GNU bug report logs -
#69369
wc -w ignores breaking space over UCHAR_MAX
Previous Next
Reported by: Aearil <aearil <at> paranoici.org>
Date: Sun, 25 Feb 2024 06:37:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
wc -w doesn't seem to recognize whitespace characters with a codepoint
over UCHAR_MAX (255) as word separators. For example, using the
character EM SPACE U+2003:
$ printf "foo\u2003bar" | ./wc -w
1
I should get a word count of 2, but instead the space is ignored while
counting words. Meanwhile, wc v9.4 gives the correct answer:
$ printf "foo\u2003bar" | wc -w
2
It looks like the regression has been introduced by [f40c6b5] and
would be fixed by something like the following change:
diff --git a/src/wc.c b/src/wc.c
index f5a921534..9d456f8c0 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
if (width > 0)
linepos += width;
}
- in_word2 = !iswnbspace (wide_char);
+ in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char);
}
/* Count words by counting word starts, i.e., each
Cheers,
--
Aearil
This bug report was last modified 1 year and 79 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.