GNU bug report logs -
#38627
uniq -c gets wrong count with non-ascii strings
Previous Next
Reported by: Roy Smith <roy <at> panix.com>
Date: Sun, 15 Dec 2019 19:41:01 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 12/15/19 11:40 AM, Roy Smith wrote:
> > With the following input:
> >
> >> $ cat x
> >> "ⁿᵘˡˡ"
> >> "ܥܝܪܐܩ"
> >
> >
> > Running "uniq -c" says there's two copies of the same line!
> >
> >> $ uniq -c x
> >> 2 "ⁿᵘˡˡ"
>
> Thanks for the bug report. I expect this is because GNU 'uniq' uses the
> equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
> macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
> lines compare equal in your locale, GNU 'uniq' says there's just one line.
>
> The GNU 'uniq' behavior appears to be a consequence of this commit:
>
> commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
> Author: Jim Meyering <jim <at> meyering.net>
> Date: Fri Aug 2 14:42:37 2002 +0000
>
> with a change noted this way in NEWS:
>
> * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.
>
> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
> and I expect this means that the 2002 commit should be reverted so that GNU
> 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense anyway).
>
> I'll CC: this email to Jim Meyering to see whether he has an opinion about this.
>
> In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead
> of plain 'uniq' in your shell script.
Thanks for the report, Roy, and thanks Paul for diving in.
I confess I haven't done more than look at that old diff, but this
sure sounds like a bug we must fix, to get in line with the the much
more recent POSIX spec.
This bug report was last modified 5 years and 90 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.