GNU bug report logs -
#20751
wc -m doesn't count UTF-8 characters properly
Previous Next
Reported by: valdis.vitolins <at> odo.lv
Date: Sat, 6 Jun 2015 17:12:03 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Note, that UTF-8 characters can be counted by counting bytes with bit
patterns 0xxxxxxx or 11xxxxxx:
https://en.wikipedia.org/wiki/UTF-8#Description
So, general logic should be, that, if:
a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
b) first two bytes of file are 0xFE 0xFF
https://en.wikipedia.org/wiki/Byte_order_mark
then count bytes with bits 0xxxxxxx and 11xxxxxx.
> You mailed submit <at> debbugs without specifying a Package:, so your bug
> report ended up on the help-debbugs list. I have reassigned it to
> coreutils. (Please note there is no "wc" package.)
>
> (My mailer is messing up the UTF-8 characters in your report.
> Interested parties can see the original at http://debbugs.gnu.org/20751#5 .)
>
> Valdis V toli wrote:
>
> > Version: wc (GNU coreutils) 8.21
> >
> > When 'wc -m' is invoked, it should print character count, but it counts
> > incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
> > bytes in them, but all have only two UTF-8 encoded characters, which you
> > can see with any modern text editor.
> >
> > wc -c chows correct number of bytes:
> > wc -c *
> > 3 3bytes.txt
> > 4 4bytes.txt
> > 6 6bytes.txt
> > 13 total
> >
> > But wc -m shows incorrect number of characters:
> > wc -m *
> > 3 3bytes.txt
> > 3 4bytes.txt
> > 3 6bytes.txt
> > 9 total
> >
> > But should be:
> > wc -m *
> > 2 3bytes.txt
> > 2 4bytes.txt
> > 2 6bytes.txt
> > 6 total
> >
> > I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64
> > GNU/Linux 3.13.0-53-generic kernel
> >
> > P.S.
> > If attachments will not pass through system, you can test it by creating
> > files with following content:
> >
> > 3bytes.txt: aa
> > 4bytes.txt: aā
> > 6bytes.txt: a
>
> Attachments at http://debbugs.gnu.org/20751#5
This bug report was last modified 9 years and 353 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.