#34524 - wc: word count incorrect when words separated only by no-break space

GNU bug report logs - #34524
wc: word count incorrect when words separated only by no-break space

Date: Mon, 18 Feb 2019 08:13:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System) To: Pádraig Brady <P <at> draigBrady.com> Cc: tracker <at> debbugs.gnu.org Subject: bug#34524: closed (wc: word count incorrect when words separated only by no-break space) Date: Tue, 26 Feb 2019 04:28:02 +0000

[Message part 1 (text/plain, inline)]

Your message dated Mon, 25 Feb 2019 20:26:55 -0800 with message-id <944c5643-0007-bf9b-43eb-a51c003ba1ec <at> draigBrady.com> and subject line Re: bug#34524: wc: word count incorrect when words separated only by no-break space has caused the debbugs.gnu.org bug report #34524, regarding wc: word count incorrect when words separated only by no-break space to be marked as done. (If you believe you have received this mail in error, please contact help-debbugs <at> gnu.org.) -- 34524: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=34524 GNU Bug Tracking System Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: vampyrebat <at> gmail.com
To: bug-coreutils <at> gnu.org
Subject: wc: word count incorrect when words separated only by no-break space
Date: Mon, 18 Feb 2019 02:12:15 -0600

$ wc --version
wc (GNU coreutils) 8.29
Packaged by Gentoo (8.29-r1 (p1.0))

The man page for wc states: "A word is a... sequence of characters delimited by white space."

But its concept of white space only seems to include ASCII white space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.

If your terminal displays UTF-8 encoding:

printf 'how are\xC2\xA0you\n'

or if your terminal displays ISO 8859-1 encoding:

printf 'how are\xA0you\n'

the visible output of this printf is "how are you".  In either case, wc does not recognize the second space as white space, resulting in an incorrect word count:

$ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w
2
$ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w
2

[Message part 3 (message/rfc822, inline)]

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: vampyrebat <at> gmail.com, 34524-done <at> debbugs.gnu.org,
 Paul Eggert <eggert <at> CS.UCLA.EDU>
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Mon, 25 Feb 2019 20:26:55 -0800

[Message part 4 (text/plain, inline)]

On 24/02/19 19:55, Pádraig Brady wrote:
> On 24/02/19 17:07, Pádraig Brady wrote:
>> So non break space is generally considered a word delimiter,
>> though there are complications you detail from unicode.
>>
>> In regard to options for enabling various behaviors for wc(1),
>> I'm thinking we might keep the strict POSIX isspace() behavior
>> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
>> by default, since that's the most common operation one would want,
>> and is consistent with libreoffice for example.
>> I'll adjust the patch along those lines.
> 
> Full patch attached.

Updated patch attached. I'll push in a few hours.
Marking this bug as done.

cheers,
Pádraig.

[wc-nbsp.patch (text/x-patch, attachment)]

This bug report was last modified 6 years and 127 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #34524 wc: word count incorrect when words separated only by no-break space

GNU bug report logs - #34524
wc: word count incorrect when words separated only by no-break space