GNU bug report logs -
#20751
wc -m doesn't count UTF-8 characters properly
Previous Next
Reported by: valdis.vitolins <at> odo.lv
Date: Sat, 6 Jun 2015 17:12:03 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
Message #18 received at control <at> debbugs.gnu.org (full text, mbox):
tag 20751 notabug
close 20751
stop
On 06/06/15 19:49, Valdis Vītoliņš wrote:
>>> Version: wc (GNU coreutils) 8.21
>>>
>>> When 'wc -m' is invoked, it should print character count, but it counts
>>> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
>>> bytes in them, but all have only two UTF-8 encoded characters, which you
>>> can see with any modern text editor.
>>>
>>> wc -c chows correct number of bytes:
>>> wc -c *
>>> 3 3bytes.txt
>>> 4 4bytes.txt
>>> 6 6bytes.txt
>>> 13 total
>>>
>>> But wc -m shows incorrect number of characters:
>>> wc -m *
>>> 3 3bytes.txt
>>> 3 4bytes.txt
>>> 3 6bytes.txt
>>> 9 total
>>>
>>> But should be:
>>> wc -m *
>>> 2 3bytes.txt
>>> 2 4bytes.txt
>>> 2 6bytes.txt
>>> 6 total
I think it's working correctly.
I.E. the \n is included in the count.
thanks,
Pádraig.
This bug report was last modified 9 years and 354 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.