#38627 - uniq -c gets wrong count with non-ascii strings

GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings

Reported by: Roy Smith <roy <at> panix.com>

Date: Sun, 15 Dec 2019 19:41:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Message #28 received at 38627-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com> To: Roy Smith <roy <at> panix.com> Cc: 38627-done <at> debbugs.gnu.org Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings Date: Sun, 23 Feb 2020 19:43:27 +0000

[Message part 1 (text/plain, inline)]

On 17/12/2019 17:25, Roy Smith wrote: > I stopped short of actually building uniq.c from source (bootstrap, prerequisites, ...), but looking at the code, it looks like the call chain is: > > different() > xmemcoll() > memcoll() > strcoll() > > so I tried a little test at the strcoll() level: > > #include <stdio.h> > #include <unistd.h> > #include <string.h> > > int > main (int argc, char **argv) > { > unsigned char null[] = { > > 0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0 > }; > unsigned char iraq[] = { > 0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0}; > > printf("%s\n", null); > printf("%s\n", iraq); > > int m = strcoll(null, iraq); > printf("m = %d\n", m); > } > > That correctly says the strings are different: > > $ LANG=en_US.UTF-8 ./a.out > ⁿᵘˡˡ > ܥܝܪܐܩ > m = 6 > > > > > > >> On Dec 16, 2019, at 7:46 PM, Roy Smith <roy <at> panix.com> wrote: >> >> Yup, this does depend on the locale. In my original example, I had LANG=en_US.UTF-8. Setting it to C.UTF-8 gets me the right result: >> >>> $ LANG=C.UTF-8 uniq -c x >>> 1 "ⁿᵘˡˡ" >>> 1 "ܥܝܪܐܩ" >> >> >> But, that doesn't fully explain what's going on. I find it difficult to believe that there's any collation sequence in the world where those two strings should compare the same. I've been playing around with the ICU string compare demo <http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't reproduce this there. Possibly I just haven't hit upon the right combination of options to set, but I think it's far-fetched that there's any such combination for which those two strings comparing equal is legitimate. I think you ran your test on a newer glibc. Testing on older glibc-2.22 I see the issue with strcoll() returning 0 for the above strings, while it returns an expected difference on glibc-2.30 at least. There are a few things to reason about with removing strcoll(), namely: buggy strcoll implementations inconsistent unicode normalization mismatched locale settings and data handling of characters ignored in collation order tl;dr is that strcoll() should be removed for all these reasons, and I've added a test for each of the 4 cases above in the attached patch, which I'll push later. Marking this as done. thanks, Pádraig

This bug report was last modified 5 years and 141 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #38627 uniq -c gets wrong count with non-ascii strings

GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings