On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote:
> Hi,
> 
> wc -w doesn't seem to recognize whitespace characters with a codepoint
> over UCHAR_MAX (255) as word separators. For example, using the
> character EM SPACE U+2003:
> 
> $ printf "foo\u2003bar" | ./wc -w
> 1
> 
> I should get a word count of 2, but instead the space is ignored while
> counting words. Meanwhile, wc v9.4 gives the correct answer:
> 
> $ printf "foo\u2003bar" | wc -w
> 2
> 
> It looks like the regression has been introduced by [f40c6b5] and
> would be fixed by something like the following change:
> 
> diff --git a/src/wc.c b/src/wc.c
> index f5a921534..9d456f8c0 100644
> --- a/src/wc.c
> +++ b/src/wc.c
> @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
>                             if (width > 0)
>                               linepos += width;
>                           }
> -                      in_word2 = !iswnbspace (wide_char);
> +                      in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
>                       }
> 
>                     /* Count words by counting word starts, i.e., each

Nice one.
Great to catch this before release.
I've augmented your patch with a test,
and will push the attached later.

Marking this as done.

thanks!
Pádraig