GNU bug report logs - #20751
wc -m doesn't count UTF-8 characters properly

Previous Next

Package: coreutils;

Reported by: valdis.vitolins <at> odo.lv

Date: Sat, 6 Jun 2015 17:12:03 UTC

Severity: normal

Tags: notabug

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Valdis Vītoliņš <valdis.vitolins <at> odo.lv>
To: 20751 <at> debbugs.gnu.org
Subject: bug#20751: wc -m doesn't count UTF-8 characters properly
Date: Sat, 06 Jun 2015 14:12:29 +0300
[Message part 1 (text/plain, inline)]
Version: wc (GNU coreutils) 8.21

When 'wc -m' is invoked, it should print character count, but it counts
incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
bytes in them, but all have only two UTF-8 encoded characters, which you
can see with any modern text editor. 

wc -c chows correct number of bytes:
wc -c *
 3 3bytes.txt
 4 4bytes.txt
 6 6bytes.txt
13 total

But wc -m shows incorrect number of characters:
wc -m *
 3 3bytes.txt
 3 4bytes.txt
 3 6bytes.txt
 9 total

But should be:
wc -m *
 2 3bytes.txt
 2 4bytes.txt
 2 6bytes.txt
 6 total

 I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64  
GNU/Linux 3.13.0-53-generic kernel

P.S.
If attachments will not pass through system, you can test it by creating
files with following content:

3bytes.txt: aa
4bytes.txt: aā
6bytes.txt: a𐄈



[3bytes.txt (text/plain, attachment)]
[4bytes.txt (text/plain, attachment)]
[6bytes.txt (text/plain, attachment)]

This bug report was last modified 9 years and 353 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.