GNU bug report logs -
#69369
wc -w ignores breaking space over UCHAR_MAX
Previous Next
Reported by: Aearil <aearil <at> paranoici.org>
Date: Sun, 25 Feb 2024 06:37:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your bug report
#69369: wc -w ignores breaking space over UCHAR_MAX
which was filed against the coreutils package, has been closed.
The explanation is attached below, along with your original report.
If you require more details, please reply to 69369 <at> debbugs.gnu.org.
--
69369: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=69369
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote:
> Hi,
>
> wc -w doesn't seem to recognize whitespace characters with a codepoint
> over UCHAR_MAX (255) as word separators. For example, using the
> character EM SPACE U+2003:
>
> $ printf "foo\u2003bar" | ./wc -w
> 1
>
> I should get a word count of 2, but instead the space is ignored while
> counting words. Meanwhile, wc v9.4 gives the correct answer:
>
> $ printf "foo\u2003bar" | wc -w
> 2
>
> It looks like the regression has been introduced by [f40c6b5] and
> would be fixed by something like the following change:
>
> diff --git a/src/wc.c b/src/wc.c
> index f5a921534..9d456f8c0 100644
> --- a/src/wc.c
> +++ b/src/wc.c
> @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
> if (width > 0)
> linepos += width;
> }
> - in_word2 = !iswnbspace (wide_char);
> + in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
> }
>
> /* Count words by counting word starts, i.e., each
Nice one.
Great to catch this before release.
I've augmented your patch with a test,
and will push the attached later.
Marking this as done.
thanks!
Pádraig
[wc-wide-space.patch (text/x-patch, attachment)]
[Message part 5 (message/rfc822, inline)]
Hi,
wc -w doesn't seem to recognize whitespace characters with a codepoint
over UCHAR_MAX (255) as word separators. For example, using the
character EM SPACE U+2003:
$ printf "foo\u2003bar" | ./wc -w
1
I should get a word count of 2, but instead the space is ignored while
counting words. Meanwhile, wc v9.4 gives the correct answer:
$ printf "foo\u2003bar" | wc -w
2
It looks like the regression has been introduced by [f40c6b5] and
would be fixed by something like the following change:
diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
if (width > 0)
linepos += width;
}
- in_word2 = !iswnbspace (wide_char);
+ in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
}
/* Count words by counting word starts, i.e., each
Cheers,
--
Aearil
This bug report was last modified 1 year and 80 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.