GNU bug report logs -
#34524
wc: word count incorrect when words separated only by no-break space
Previous Next
Reported by: vampyrebat <at> gmail.com
Date: Mon, 18 Feb 2019 08:13:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
$ wc --version
wc (GNU coreutils) 8.29
Packaged by Gentoo (8.29-r1 (p1.0))
The man page for wc states: "A word is a... sequence of characters delimited by white space."
But its concept of white space only seems to include ASCII white space. U+00A0 NO-BREAK SPACE, for instance, is not recognized.
If your terminal displays UTF-8 encoding:
printf 'how are\xC2\xA0you\n'
or if your terminal displays ISO 8859-1 encoding:
printf 'how are\xA0you\n'
the visible output of this printf is "how are you". In either case, wc does not recognize the second space as white space, resulting in an incorrect word count:
$ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w
2
$ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w
2
This bug report was last modified 6 years and 78 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.