GNU bug report logs - #40226
sort: expected sort order when -c in use

Previous Next

Package: coreutils;

Reported by: Richard Ipsum <richardipsum <at> vx21.xyz>

Date: Wed, 25 Mar 2020 17:55:02 UTC

Severity: normal

Full log


Message #14 received at 40226 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Richard Ipsum <richardipsum <at> vx21.xyz>
Cc: 40226 <at> debbugs.gnu.org
Subject: Re: bug#40226: sort: expected sort order when -c in use
Date: Wed, 25 Mar 2020 16:35:47 -0500
On 3/25/20 3:02 PM, Richard Ipsum wrote:
> On Wed, Mar 25, 2020 at 01:17:19PM -0500, Eric Blake wrote:
>> On 3/25/20 12:37 PM, Richard Ipsum wrote:
> [snip]
>>
>> See the difference?  In the first case, sort is doing its default
>> case-insensitive comparison of the entire line (because you passed -f but
>> not -k), AND a stability comparison of the byte values of the entire line
>> (as shown by the two ____ lines per input).  But in the second case, when
>> you add -s, the stability comparison is omitted.  The two lines are indeed
>> different when the stability comparison is performed, explaining why -c
>> choked when -s is absent.  Or put another way, -f affects only -k, including
>> the implied -k1 when you don't specify anything, and not -s.  So now that we
>> know that, let's return to your example:
> 
> I'm trying to understand this relative to POSIX, which makes no mention
> of stability as far as I can see (and there is no -s in POSIX). POSIX
> says that -f should override the default ordering rules. I don't
> understand why the last-resort comparison is required when -c is in use,
> since we're not sorting with -c, just checking if the input is already sorted?

POSIX states [sort description]:

"If this collating sequence does not have a total ordering of all 
characters (see XBD LC_COLLATE), any lines of input that collate equally 
should be further compared byte-by-byte using the collating sequence for 
the POSIX locale."

As I understand it, this is true even when -f modifies the collating 
sequence to compare all lowercase characters as their uppercase equivalent.

But POSIX further states [XBD LC_COLLATE]:

"All implementation-provided locales (either preinstalled or provided as 
locale definitions which can be installed later) should define a 
collation sequence that has a total ordering of all characters unless 
the locale name has an '@' modifier indicating that it has a special 
collation sequence (for example, @icase could indicate that each upper 
and lowercase character pair collates equally).

Notes:

        A future version of this standard may require these locales to 
define a collation sequence that has a total ordering of all characters 
(by changing "should" to "shall").

        Users installing their own locales should ensure that they 
define a collation sequence with a total ordering of all characters 
unless an '@' modifier in the locale name (such as @icase ) indicates 
that it has a special collation sequence."

> 
> Put another way should -c imply -s ?

Maybe we compromise, and state that -c implies -s only for locales that 
do not include @ in their name (that is, if a locale already guarantees 
a total ordering of all characters, then even when -f collapses 
lowercase into uppercase, we don't need the final-resort comparison; but 
if a locale does not guarantee total ordering, the -s has to be explicit)?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org





This bug report was last modified 5 years and 170 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.