GNU bug report logs - #34524
wc: word count incorrect when words separated only by no-break space

Previous Next

Package: coreutils;

Reported by: vampyrebat <at> gmail.com

Date: Mon, 18 Feb 2019 08:13:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bruno Haible <bruno <at> clisp.org>, Pádraig Brady <P <at> draigbrady.com>, bug-libunistring <at> gnu.org
Cc: vampyrebat <at> gmail.com, 34524 <at> debbugs.gnu.org
Subject: bug#34524: wc: word count incorrect when words separated only by no-break space
Date: Sun, 24 Feb 2019 09:47:02 -0800
Bruno Haible wrote:
> I would find it best to introduce an option '--unicode'
> to 'wc', that would produce Unicode compliant results, at the cost of
>    - not following POSIX to the letter,

It'd make sense to have an option. How about a more-general option --words, that 
would let the user define what a word is? This option's operand could use ERE 
syntax, or a shorthand beginning with '+' for common combinations. For example, 
the command:

wc --words='[[:alnum:]]+'

would say that a word consists of the longest contiguous sequence of 
alphanumeric characters. And

wc --words='+unicode'

would use the Unicode definition of word, whatever it is.




This bug report was last modified 6 years and 78 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.