GNU bug report logs -
#34524
wc: word count incorrect when words separated only by no-break space
Previous Next
Reported by: vampyrebat <at> gmail.com
Date: Mon, 18 Feb 2019 08:13:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
vampyrebat <at> gmail.com wrote:
> The man page for wc states: "A word is a... sequence of characters delimited by white space."
>
> But its concept of white space only seems to include ASCII white
> space. U+00A0 NO-BREAK SPACE, for instance, is not recognized.
Indeed this is because wc and other coreutils programs, and other
programs, use the libc locale definition.
$ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 od -tx1 -c
0000000 c2 a0 0a
302 240 \n
0000003
printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l
0
$ printf '\xC2\xA0 \n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l
1
This shows that grep does not recognize \xC2\xA0 as a character in the
class of space characters either.
$ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 tr '[[:space:]]' x | od -tx1 -c
0000000 c2 a0 78
302 240 x
0000003
And while a space character matches and is translated the other is not.
Since character classes are defined as part of the locale table there
isn't really anything we can do about it on the coreutils wc side of
things. It would need to be redefined upstream there.
Bob
This bug report was last modified 6 years and 78 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.