GNU bug report logs -
#20751
wc -m doesn't count UTF-8 characters properly
Previous Next
Reported by: valdis.vitolins <at> odo.lv
Date: Sat, 6 Jun 2015 17:12:03 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Version: wc (GNU coreutils) 8.21
When 'wc -m' is invoked, it should print character count, but it counts
incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
bytes in them, but all have only two UTF-8 encoded characters, which you
can see with any modern text editor.
wc -c chows correct number of bytes:
wc -c *
3 3bytes.txt
4 4bytes.txt
6 6bytes.txt
13 total
But wc -m shows incorrect number of characters:
wc -m *
3 3bytes.txt
3 4bytes.txt
3 6bytes.txt
9 total
But should be:
wc -m *
2 3bytes.txt
2 4bytes.txt
2 6bytes.txt
6 total
I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64
GNU/Linux 3.13.0-53-generic kernel
P.S.
If attachments will not pass through system, you can test it by creating
files with following content:
3bytes.txt: aa
4bytes.txt: aā
6bytes.txt: a𐄈
[3bytes.txt (text/plain, attachment)]
[4bytes.txt (text/plain, attachment)]
[6bytes.txt (text/plain, attachment)]
This bug report was last modified 9 years and 353 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.