GNU bug report logs - #69369
wc -w ignores breaking space over UCHAR_MAX

Reported by: Aearil <aearil <at> paranoici.org>

Date: Sun, 25 Feb 2024 06:37:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 69369 in the body.
You can then email your comments to 69369 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#69369; Package coreutils. (Sun, 25 Feb 2024 06:37:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aearil <aearil <at> paranoici.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 25 Feb 2024 06:37:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Aearil <aearil <at> paranoici.org>
To: bug-coreutils <at> gnu.org
Subject: wc -w ignores breaking space over UCHAR_MAX
Date: Sat, 24 Feb 2024 21:44:24 +0100

Hi,

wc -w doesn't seem to recognize whitespace characters with a codepoint
over UCHAR_MAX (255) as word separators. For example, using the
character EM SPACE U+2003:

$ printf "foo\u2003bar" | ./wc -w
1

I should get a word count of 2, but instead the space is ignored while
counting words. Meanwhile, wc v9.4 gives the correct answer:

$ printf "foo\u2003bar" | wc -w
2

It looks like the regression has been introduced by [f40c6b5] and
would be fixed by something like the following change:

diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
                           if (width > 0)
                             linepos += width;
                         }
-                      in_word2 = !iswnbspace (wide_char);
+                      in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
                     }

                   /* Count words by counting word starts, i.e., each


Cheers,

--
Aearil

Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Sun, 25 Feb 2024 12:26:02 GMT) Full text and rfc822 format available.

Notification sent to Aearil <aearil <at> paranoici.org>:
bug acknowledged by developer. (Sun, 25 Feb 2024 12:26:02 GMT) Full text and rfc822 format available.

Message #10 received at 69369-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Aearil <aearil <at> paranoici.org>, 69369-done <at> debbugs.gnu.org
Subject: Re: bug#69369: wc -w ignores breaking space over UCHAR_MAX
Date: Sun, 25 Feb 2024 12:16:48 +0000

[Message part 1 (text/plain, inline)]

On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote:
> Hi,
> 
> wc -w doesn't seem to recognize whitespace characters with a codepoint
> over UCHAR_MAX (255) as word separators. For example, using the
> character EM SPACE U+2003:
> 
> $ printf "foo\u2003bar" | ./wc -w
> 1
> 
> I should get a word count of 2, but instead the space is ignored while
> counting words. Meanwhile, wc v9.4 gives the correct answer:
> 
> $ printf "foo\u2003bar" | wc -w
> 2
> 
> It looks like the regression has been introduced by [f40c6b5] and
> would be fixed by something like the following change:
> 
> diff --git a/src/wc.c b/src/wc.c
> index f5a921534..9d456f8c0 100644
> --- a/src/wc.c
> +++ b/src/wc.c
> @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
>                             if (width > 0)
>                               linepos += width;
>                           }
> -                      in_word2 = !iswnbspace (wide_char);
> +                      in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
>                       }
> 
>                     /* Count words by counting word starts, i.e., each

Nice one.
Great to catch this before release.
I've augmented your patch with a test,
and will push the attached later.

Marking this as done.

thanks!
Pádraig

[wc-wide-space.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 27 Mar 2024 11:24:17 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 139 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #69369 wc -w ignores breaking space over UCHAR_MAX

GNU bug report logs - #69369
wc -w ignores breaking space over UCHAR_MAX