On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote: > Hi, > > wc -w doesn't seem to recognize whitespace characters with a codepoint > over UCHAR_MAX (255) as word separators. For example, using the > character EM SPACE U+2003: > > $ printf "foo\u2003bar" | ./wc -w > 1 > > I should get a word count of 2, but instead the space is ignored while > counting words. Meanwhile, wc v9.4 gives the correct answer: > > $ printf "foo\u2003bar" | wc -w > 2 > > It looks like the regression has been introduced by [f40c6b5] and > would be fixed by something like the following change: > > diff --git a/src/wc.c b/src/wc.c > index f5a921534..9d456f8c0 100644 > --- a/src/wc.c > +++ b/src/wc.c > @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) > if (width > 0) > linepos += width; > } > - in_word2 = !iswnbspace (wide_char); > + in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char); > } > > /* Count words by counting word starts, i.e., each Nice one. Great to catch this before release. I've augmented your patch with a test, and will push the attached later. Marking this as done. thanks! Pádraig