GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings

Previous Next

Package: coreutils;

Reported by: Roy Smith <roy <at> panix.com>

Date: Sun, 15 Dec 2019 19:41:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #11 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Roy Smith <roy <at> panix.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Mon, 16 Dec 2019 19:46:39 -0500
[Message part 1 (text/plain, inline)]
Yup, this does depend on the locale.  In my original example, I had LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:

> $ LANG=C.UTF-8 uniq -c x
>       1 "ⁿᵘˡˡ"
>       1 "ܥܝܪܐܩ"


But, that doesn't fully explain what's going on.  I find it difficult to believe that there's any collation sequence in the world where those two strings should compare the same.  I've been playing around with the ICU string compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't reproduce this there.  Possibly I just haven't hit upon the right combination of options to set, but I think it's far-fetched that there's any such combination for which those two strings comparing equal is legitimate.

[Message part 2 (text/html, inline)]

This bug report was last modified 5 years and 90 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.