GNU bug report logs - #40226
sort: expected sort order when -c in use

Previous Next

Package: coreutils;

Reported by: Richard Ipsum <richardipsum <at> vx21.xyz>

Date: Wed, 25 Mar 2020 17:55:02 UTC

Severity: normal

Full log


Message #8 received at 40226 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Richard Ipsum <richardipsum <at> vx21.xyz>, 40226 <at> debbugs.gnu.org
Subject: Re: bug#40226: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 13:17:19 -0500
On 3/25/20 12:37 PM, Richard Ipsum wrote:
> Hi,
> 
> I'm trying to understand something and thought it would be good to ask
> here.
> 
> I get different results for a case-insensitive sort using -c. My
> understanding is that -f should lead to lower case characters with upper
> case equivalents being converted to their upper case equivalents. This
> doesn't seem to be happening for the C locale though.
> 
> % echo -e "aaaa\nAAAA" | LC_COLLATE=en_GB.UTF-8 sort -c -f -
> % echo -e "aaaa\nAAAA" | LC_COLLATE=en_US.UTF-8 sort -c -f -
> % echo -e "aaaa\nAAAA" | LC_COLLATE=C sort -c -f -
> sort: -:2: disorder: AAAA

First, 'echo -e' is not portable, so I'll be reproducing your example 
with printf.  And you are assuming that LC_ALL is not set (otherwise, 
LC_COLLATE would have no impact); so I'll set LC_ALL to be sure.  Except 
that I can't reproduce your example (I'm using Fedora 31, coreutils 8.31):

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f -
sort: -:2: disorder: AAAA

So there's probably something different in the locale libraries and/or 
your coreutils version on your system, compared to mine.

Next, let's debug things to see why:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_US.UTF-8 sort -c -f - --debug
sort: options '-c --debug' are incompatible

Oh, bummer - I don't know why we have that restriction.  Okay, let's try 
a slightly different approach:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
AAAA
____
____
aaaa
____
____
$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - --debug -s
sort: text ordering performed using ‘en_GB.UTF-8’ sorting rules
aaaa
____
AAAA
____

See the difference?  In the first case, sort is doing its default 
case-insensitive comparison of the entire line (because you passed -f 
but not -k), AND a stability comparison of the byte values of the entire 
line (as shown by the two ____ lines per input).  But in the second 
case, when you add -s, the stability comparison is omitted.  The two 
lines are indeed different when the stability comparison is performed, 
explaining why -c choked when -s is absent.  Or put another way, -f 
affects only -k, including the implied -k1 when you don't specify 
anything, and not -s.  So now that we know that, let's return to your 
example:

$ printf 'aaaa\nAAAA\n' | LC_ALL=en_GB.UTF-8 sort -f - -c -s
$ echo $?
0


> 
> Is this considered a bug or an expected difference between the locales?

I don't know if it's the locale definition, or something changed between 
coreutils versions, or both; although I'm more likely to chalk it up to 
locale issues and not something where coreutils needs a patch, other 
than perhaps a documentation patch.  I'll leave the bug report itself 
open for a bit longer, in case anyone else has an opinion.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org





This bug report was last modified 5 years and 170 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.