GNU bug report logs -
#69369
wc -w ignores breaking space over UCHAR_MAX
Previous Next
Reported by: Aearil <aearil <at> paranoici.org>
Date: Sun, 25 Feb 2024 06:37:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 69369 in the body.
You can then email your comments to 69369 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#69369
; Package
coreutils
.
(Sun, 25 Feb 2024 06:37:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Aearil <aearil <at> paranoici.org>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sun, 25 Feb 2024 06:37:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
wc -w doesn't seem to recognize whitespace characters with a codepoint
over UCHAR_MAX (255) as word separators. For example, using the
character EM SPACE U+2003:
$ printf "foo\u2003bar" | ./wc -w
1
I should get a word count of 2, but instead the space is ignored while
counting words. Meanwhile, wc v9.4 gives the correct answer:
$ printf "foo\u2003bar" | wc -w
2
It looks like the regression has been introduced by [f40c6b5] and
would be fixed by something like the following change:
diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
if (width > 0)
linepos += width;
}
- in_word2 = !iswnbspace (wide_char);
+ in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
}
/* Count words by counting word starts, i.e., each
Cheers,
--
Aearil
Reply sent
to
Pádraig Brady <P <at> draigBrady.com>
:
You have taken responsibility.
(Sun, 25 Feb 2024 12:26:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Aearil <aearil <at> paranoici.org>
:
bug acknowledged by developer.
(Sun, 25 Feb 2024 12:26:02 GMT)
Full text and
rfc822 format available.
Message #10 received at 69369-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote:
> Hi,
>
> wc -w doesn't seem to recognize whitespace characters with a codepoint
> over UCHAR_MAX (255) as word separators. For example, using the
> character EM SPACE U+2003:
>
> $ printf "foo\u2003bar" | ./wc -w
> 1
>
> I should get a word count of 2, but instead the space is ignored while
> counting words. Meanwhile, wc v9.4 gives the correct answer:
>
> $ printf "foo\u2003bar" | wc -w
> 2
>
> It looks like the regression has been introduced by [f40c6b5] and
> would be fixed by something like the following change:
>
> diff --git a/src/wc.c b/src/wc.c
> index f5a921534..9d456f8c0 100644
> --- a/src/wc.c
> +++ b/src/wc.c
> @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
> if (width > 0)
> linepos += width;
> }
> - in_word2 = !iswnbspace (wide_char);
> + in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
> }
>
> /* Count words by counting word starts, i.e., each
Nice one.
Great to catch this before release.
I've augmented your patch with a test,
and will push the attached later.
Marking this as done.
thanks!
Pádraig
[wc-wide-space.patch (text/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 27 Mar 2024 11:24:17 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 79 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.