GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings

Previous Next

Package: coreutils;

Reported by: Roy Smith <roy <at> panix.com>

Date: Sun, 15 Dec 2019 19:41:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Roy Smith <roy <at> panix.com>
To: bug-coreutils <at> gnu.org
Subject: uniq -c gets wrong count with non-ascii strings
Date: Sun, 15 Dec 2019 14:40:14 -0500
[Message part 1 (text/plain, inline)]
With the following input:

> $ cat x
> "ⁿᵘˡˡ"
> "ܥܝܪܐܩ"


Running "uniq -c" says there's two copies of the same line!

> $ uniq -c x
>       2 "ⁿᵘˡˡ"


I've attached a copy of the test file, and here's the octal dump:

> $ od -b x
> 0000000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 334 245
> 0000020 334 235 334 252 334 220 334 251 042 012
> 0000032


I'm getting this on:

> Linux tools-sgebastion-08 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
> uniq (GNU coreutils) 8.26

My MacOS 10.13.6 box gets it right:

> $ uniq -c x
>    1 "ⁿᵘˡˡ"
>    1 "ܥܝܪܐܩ"


[Message part 2 (text/html, inline)]
[x (application/octet-stream, attachment)]
[Message part 4 (text/html, inline)]

This bug report was last modified 5 years and 90 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.