GNU bug report logs - #69369
wc -w ignores breaking space over UCHAR_MAX

Previous Next

Package: coreutils;

Reported by: Aearil <aearil <at> paranoici.org>

Date: Sun, 25 Feb 2024 06:37:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Aearil <aearil <at> paranoici.org>
Subject: bug#69369: closed (Re: bug#69369: wc -w ignores breaking space
 over UCHAR_MAX)
Date: Sun, 25 Feb 2024 12:26:02 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#69369: wc -w ignores breaking space over UCHAR_MAX

which was filed against the coreutils package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 69369 <at> debbugs.gnu.org.

-- 
69369: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=69369
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Pádraig Brady <P <at> draigBrady.com>
To: Aearil <aearil <at> paranoici.org>, 69369-done <at> debbugs.gnu.org
Subject: Re: bug#69369: wc -w ignores breaking space over UCHAR_MAX
Date: Sun, 25 Feb 2024 12:16:48 +0000
[Message part 3 (text/plain, inline)]
On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote:
> Hi,
> 
> wc -w doesn't seem to recognize whitespace characters with a codepoint
> over UCHAR_MAX (255) as word separators. For example, using the
> character EM SPACE U+2003:
> 
> $ printf "foo\u2003bar" | ./wc -w
> 1
> 
> I should get a word count of 2, but instead the space is ignored while
> counting words. Meanwhile, wc v9.4 gives the correct answer:
> 
> $ printf "foo\u2003bar" | wc -w
> 2
> 
> It looks like the regression has been introduced by [f40c6b5] and
> would be fixed by something like the following change:
> 
> diff --git a/src/wc.c b/src/wc.c
> index f5a921534..9d456f8c0 100644
> --- a/src/wc.c
> +++ b/src/wc.c
> @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
>                             if (width > 0)
>                               linepos += width;
>                           }
> -                      in_word2 = !iswnbspace (wide_char);
> +                      in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
>                       }
> 
>                     /* Count words by counting word starts, i.e., each

Nice one.
Great to catch this before release.
I've augmented your patch with a test,
and will push the attached later.

Marking this as done.

thanks!
Pádraig
[wc-wide-space.patch (text/x-patch, attachment)]
[Message part 5 (message/rfc822, inline)]
From: Aearil <aearil <at> paranoici.org>
To: bug-coreutils <at> gnu.org
Subject: wc -w ignores breaking space over UCHAR_MAX
Date: Sat, 24 Feb 2024 21:44:24 +0100
Hi,

wc -w doesn't seem to recognize whitespace characters with a codepoint
over UCHAR_MAX (255) as word separators. For example, using the
character EM SPACE U+2003:

$ printf "foo\u2003bar" | ./wc -w
1

I should get a word count of 2, but instead the space is ignored while
counting words. Meanwhile, wc v9.4 gives the correct answer:

$ printf "foo\u2003bar" | wc -w
2

It looks like the regression has been introduced by [f40c6b5] and
would be fixed by something like the following change:

diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
                           if (width > 0)
                             linepos += width;
                         }
-                      in_word2 = !iswnbspace (wide_char);
+                      in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
                     }

                   /* Count words by counting word starts, i.e., each


Cheers,

--
Aearil



This bug report was last modified 1 year and 80 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.