GNU bug report logs - #38627
uniq -c gets wrong count with non-ascii strings

Previous Next

Package: coreutils;

Reported by: Roy Smith <roy <at> panix.com>

Date: Sun, 15 Dec 2019 19:41:01 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #17 received at 38627 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Roy Smith <roy <at> panix.com>, 38627 <at> debbugs.gnu.org
Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings
Date: Tue, 17 Dec 2019 15:10:33 -0800
On Mon, Dec 16, 2019 at 1:41 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 12/15/19 11:40 AM, Roy Smith wrote:
> > With the following input:
> >
> >> $ cat x
> >> "ⁿᵘˡˡ"
> >> "ܥܝܪܐܩ"
> >
> >
> > Running "uniq -c" says there's two copies of the same line!
> >
> >> $ uniq -c x
> >>       2 "ⁿᵘˡˡ"
>
> Thanks for the bug report. I expect this is because GNU 'uniq' uses the
> equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
> macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
> lines compare equal in your locale, GNU 'uniq' says there's just one line.
>
> The GNU 'uniq' behavior appears to be a consequence of this commit:
>
> commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
> Author: Jim Meyering <jim <at> meyering.net>
> Date:   Fri Aug 2 14:42:37 2002 +0000
>
> with a change noted this way in NEWS:
>
> * uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.
>
> However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
> and I expect this means that the 2002 commit should be reverted so that GNU
> 'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense anyway).
>
> I'll CC: this email to Jim Meyering to see whether he has an opinion about this.
>
> In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead
> of plain 'uniq' in your shell script.

Thanks for the report, Roy, and thanks Paul for diving in.
I confess I haven't done more than look at that old diff, but this
sure sounds like a bug we must fix, to get in line with the the much
more recent POSIX spec.




This bug report was last modified 5 years and 90 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.