GNU bug report logs - #76290
"sort -u" vs "sort -h -u": possible bug

Previous Next

Package: coreutils;

Reported by: Rupert Gallagher <ruga <at> protonmail.com>

Date: Fri, 14 Feb 2025 17:01:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 76290 in the body.
You can then email your comments to 76290 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Fri, 14 Feb 2025 17:01:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Rupert Gallagher <ruga <at> protonmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 14 Feb 2025 17:01:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Rupert Gallagher <ruga <at> protonmail.com>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: "sort -u" vs "sort -h -u": possible bug
Date: Fri, 14 Feb 2025 15:44:34 +0000
[Message part 1 (text/plain, inline)]
>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787"
CVE-2018-13787 <---
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046CVE-2018-13787 <---

>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort
CVE-2018-13787 <---
CVE-2018-13787 <---
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046

>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -u
CVE-2018-13787 <---
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046

Problem:

>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -h -uCVE-2018-13787
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Sun, 16 Feb 2025 06:25:02 GMT) Full text and rfc822 format available.

Message #8 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rupert Gallagher <ruga <at> protonmail.com>
Cc: 76290 <at> debbugs.gnu.org
Subject: Re: "sort -u" vs "sort -h -u": possible bug
Date: Sat, 15 Feb 2025 22:23:59 -0800
I don't see a bug there, just an infelicity. -h means 'sort' should look 
for a number, and your data lines don't start with numbers.

Try 'sort --debug -h -u' to see more.




Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Sun, 16 Feb 2025 15:46:01 GMT) Full text and rfc822 format available.

Message #11 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: Rupert Gallagher <ruga <at> protonmail.com>
To: "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>
Cc: "76290 <at> debbugs.gnu.org" <76290 <at> debbugs.gnu.org>
Subject: Re: "sort -u" vs "sort -h -u": possible bug
Date: Sun, 16 Feb 2025 11:02:37 +0000
My concern is best described as follows.

~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -h
CVE-2018-13787
CVE-2018-13787
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046

~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -h -u
CVE-2018-13787

The introduction of the unique operator (-u) returns a wrong answer when used with the human sorting operator (-h).

Note the problem does not occur when the human sorting operator is not used.

~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort
CVE-2018-13787
CVE-2018-13787
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046

~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -u
CVE-2018-13787
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046

The example suggests the existence of a programming error between the output of -h and the input of -u.


-------- Original Message --------
On 2/16/25 07:23, Paul Eggert wrote:

>  I don't see a bug there, just an infelicity. -h means 'sort' should look for a number, and your data lines don't start with numbers.
>  
>  Try 'sort --debug -h -u' to see more.
>




Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Sun, 16 Feb 2025 17:27:02 GMT) Full text and rfc822 format available.

Message #14 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: "Philip Rowlands" <phr+coreutils <at> dimebar.com>
To: "Paul Eggert" <eggert <at> cs.ucla.edu>,
 "Rupert Gallagher" <ruga <at> protonmail.com>
Cc: 76290 <at> debbugs.gnu.org
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Sun, 16 Feb 2025 17:23:18 +0000
On Sun, 16 Feb 2025, at 06:23, Paul Eggert wrote:
> I don't see a bug there, just an infelicity. -h means 'sort' should look 
> for a number, and your data lines don't start with numbers.
>
> Try 'sort --debug -h -u' to see more.

The --debug output here isn't as helpful as it could be; taking a simplified example

$ echo -e 'CVE-222\nCVE-111\nCVE-222' | sort -h -u --debug
sort: text ordering performed using simple byte comparison
sort: note numbers use '.' as a decimal point in this locale
CVE-222
^ no match for key

$ echo $'bbb\naaa' | sort -n -u --debug
sort: text ordering performed using simple byte comparison
sort: note numbers use '.' as a decimal point in this locale
bbb
^ no match for key


Due to the diligent work by maintainers, there are very few genuine bugs in sort, so we can assume --debug users need as much help as possible figuring out where the sort options have gone wrong. How could the "no match for key" output here be clearer? Could --uniq --debug show elided lines with an explanation, especially for entire lines which match nothing?


Cheers,
Phil




Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Sun, 16 Feb 2025 22:23:02 GMT) Full text and rfc822 format available.

Message #17 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rupert Gallagher <ruga <at> protonmail.com>
Cc: "76290 <at> debbugs.gnu.org" <76290 <at> debbugs.gnu.org>
Subject: Re: "sort -u" vs "sort -h -u": possible bug
Date: Sun, 16 Feb 2025 14:22:48 -0800
On 2025-02-16 03:02, Rupert Gallagher wrote:
> The introduction of the unique operator (-u) returns a wrong answer when used with the human sorting operator (-h).

The answer is "wrong" only in the sense that sort's documented and 
implemented behavior is not what you expect.

To fix this mismatch between behavior and expectations, don't use -h. It 
makes sense to not use -h, -h is not intended for uses like that.




Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Mon, 17 Feb 2025 23:32:02 GMT) Full text and rfc822 format available.

Message #20 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rupert Gallagher <ruga <at> protonmail.com>
Cc: "76290 <at> debbugs.gnu.org" <76290 <at> debbugs.gnu.org>
Subject: Re: "sort -u" vs "sort -h -u": possible bug
Date: Mon, 17 Feb 2025 15:31:10 -0800
On 2025-02-17 15:13, Rupert Gallagher wrote:
> ~ $ echo -e "a1\na2" | sort
> a1
> a2
> 
> ~ $ echo -e "a1\na2" | sort -h
> a1
> a2
> 
> Since A = B, the result of -u must be the same on both sets, by logic.

By that logic, since the output of these two commands:

  echo -e 'a1\na2' | sort
  echo -e 'a1\na2' | sort -n

are the same, then the result of -u be the same on both sets. But this 
logic is wrong, in the sense that it disagrees with both longstanding 
practice and with the POSIX.1-2024 standard 
<https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html>, 
which say that plain 'sort' uses the entire line as a key whereas 'sort 
-n' uses a leading integer prefix (which in this example is empty so the 
keys compare equal).

I get it that 'sort' doesn't behave the way you expected. But that's a 
mismatch of expectations vs implementation, not a bug in the implementation.





Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Tue, 18 Feb 2025 04:16:05 GMT) Full text and rfc822 format available.

Message #23 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: Rupert Gallagher <ruga <at> protonmail.com>
To: "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>
Cc: "76290 <at> debbugs.gnu.org" <76290 <at> debbugs.gnu.org>
Subject: Re: "sort -u" vs "sort -h -u": possible bug
Date: Mon, 17 Feb 2025 23:13:39 +0000
No, I expect the program to do exactly what the manual says. 

   -h, --human-numeric-sort
      compare human readable numbers (e.g., 2K 1G)

Applying -h to the list in my example is expected to be semantically equivalent to not applying -h:

A = { echo -e "a1\na2" | sort }
B = { echo -e "a1\na2" | sort -h }

~ $ echo -e "a1\na2" | sort
a1
a2

~ $ echo -e "a1\na2" | sort -h
a1
a2

Since A = B, the result of -u must be the same on both sets, by logic. The program, however, has a mind of its own.

~ $ echo -e "a1\na2" | sort -u
a1
a2

~ $ echo -e "a1\na2" | sort -h -u
a1




-------- Original Message --------
On 2/16/25 23:22, Paul Eggert <eggert <at> cs.ucla.edu> wrote:

>  On 2025-02-16 03:02, Rupert Gallagher wrote:
>  > The introduction of the unique operator (-u) returns a wrong answer when used with the human sorting operator (-h).
>  
>  The answer is "wrong" only in the sense that sort's documented and
>  implemented behavior is not what you expect.
>  
>  To fix this mismatch between behavior and expectations, don't use -h. It
>  makes sense to not use -h, -h is not intended for uses like that.
>




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Tue, 18 Feb 2025 06:26:02 GMT) Full text and rfc822 format available.

Notification sent to Rupert Gallagher <ruga <at> protonmail.com>:
bug acknowledged by developer. (Tue, 18 Feb 2025 06:26:02 GMT) Full text and rfc822 format available.

Message #28 received at 76290-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rupert Gallagher <ruga <at> protonmail.com>
Cc: 76290-done <at> debbugs.gnu.org
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Mon, 17 Feb 2025 22:25:12 -0800
On 2025-02-17 15:13, Rupert Gallagher via GNU coreutils Bug Reports wrote:
> I expect the program to do exactly what the manual says.

Here's what the manual says about -u in 
<https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html#index-uniquifying-output>:

>     Normally, output only the first of a sequence of lines that compare equal....
> 
>     This option also disables the default last-resort comparison.
> 
>     The commands sort -u and sort | uniq are equivalent, but this equivalence does not extend to arbitrary sort options. For example, sort -n -u inspects only the value of the initial numeric string when checking for uniqueness, whereas sort -n | uniq inspects the entire line.

This is the part of the manual that you're disagreeing with. The example 
in my previous email (an example that you did not reply to) is a 
demonstration of this part of the manual.

I am taking the liberty of closing this bug report, as "sort" is 
behaving as documented here.





Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Tue, 18 Feb 2025 08:20:02 GMT) Full text and rfc822 format available.

Message #31 received at 76290-done <at> debbugs.gnu.org (full text, mbox):

From: Rupert Gallagher <ruga <at> protonmail.com>
To: "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>
Cc: "76290-done <at> debbugs.gnu.org" <76290-done <at> debbugs.gnu.org>
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Tue, 18 Feb 2025 08:19:25 +0000
According to gnu sort -h -u and what you claim to be common practice, a list of possibly redoundant strings, some beginning with a number, is reduced to an ordered set of the numbered strings only.

Since I expect the resulting ordered set to include the original elements, I will then stop using gnu sort to avoid data loss.


-------- Original Message --------
On 2/18/25 07:25, Paul Eggert <eggert <at> cs.ucla.edu> wrote:

>  On 2025-02-17 15:13, Rupert Gallagher via GNU coreutils Bug Reports wrote:
>  > I expect the program to do exactly what the manual says.
>  
>  Here's what the manual says about -u in
>  <https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html#index-uniquifying-output>:
>  
>  >     Normally, output only the first of a sequence of lines that compare equal....
>  >
>  >     This option also disables the default last-resort comparison.
>  >
>  >     The commands sort -u and sort | uniq are equivalent, but this equivalence does not extend to arbitrary sort options. For example, sort -n -u inspects only the value of the initial numeric string when checking for uniqueness, whereas sort -n | uniq inspects the entire line.
>  
>  This is the part of the manual that you're disagreeing with. The example
>  in my previous email (an example that you did not reply to) is a
>  demonstration of this part of the manual.
>  
>  I am taking the liberty of closing this bug report, as "sort" is
>  behaving as documented here.
>  
>




Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Tue, 18 Feb 2025 18:47:02 GMT) Full text and rfc822 format available.

Message #34 received at 76290-done <at> debbugs.gnu.org (full text, mbox):

From: Rupert Gallagher <ruga <at> protonmail.com>
To: "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>
Cc: "76290-done <at> debbugs.gnu.org" <76290-done <at> debbugs.gnu.org>
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Tue, 18 Feb 2025 18:45:57 +0000
Dictionary sort corresponds to the intended behaviour.

> echo -e "abc\n123\n456\nCVE-2011-234\nAbc\ndef\nCVE-2024-123" | sort --debug -dfu

123
___
456
___
abc
___
CVE-2011-234
____________
CVE-2024-123
____________
def
___


By comparison, human (-h) and numeric (-n) sort cause data loss:

> echo -e "abc\n123\n456\nCVE-2011-234\nAbc\ndef\nCVE-2024-123" | sort --debug -hu
sort: note numbers use ‘.’ as a decimal point in this locale
abc
^ no match for key
123
___
456
___

If I were the author of gnu sort, I would delete the -h and -n options.




Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Wed, 19 Feb 2025 17:16:02 GMT) Full text and rfc822 format available.

Message #37 received at 76290-done <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Rupert Gallagher <ruga <at> protonmail.com>,
 "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>
Cc: "76290-done <at> debbugs.gnu.org" <76290-done <at> debbugs.gnu.org>
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Wed, 19 Feb 2025 18:14:13 +0100

On 2/18/25 7:45 PM, Rupert Gallagher via GNU coreutils Bug Reports wrote:
> By comparison, human (-h) and numeric (-n) sort cause data loss:

not really.  That's the difference between
a)
  "I have a list containing numbers; I merely care about numbers and want to get a unique, sorted list of them."
  ('sort -h -u')

and
b)
  "I have a list containing numbers; I want to have it sorted by numbers, and then throw away duplicates."
  ('sort -h | uniq')

The point is: in case a), the numerical value of each non-number entry is Zero.

Consider the following:

  $ printf "%s\n" 0 1 X-1 Ab2 3 ma | LC_ALL=C sort -nu
  0
  1
  3

Here, the entries 0, "X-1", "Ab2" and "ma" all have the numerical value 0.
That's why the first Zero is output.

Now let's remove the literal/numerical 0 from the input:

  $ printf "%s\n"  1 X-1 Ab2 3 ma | LC_ALL=C sort -nu
  X-1
  1
  3

Now, the first entry which represents numerically 0 is "X-1".
Now even let's put the 0 back into the input, but at the end:

  $ printf "%s\n"  1 X-1 Ab2 3 ma 0 | LC_ALL=C sort -nu
  X-1
  1
  3

Still, sort(1) outputs the first entry which has a numerical value of Zero: "X-1".

Have a nice day,
Berny





Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Wed, 19 Feb 2025 19:16:01 GMT) Full text and rfc822 format available.

Message #40 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Rainer Canavan <coreutils <at> canavan.de>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Wed, 19 Feb 2025 20:15:17 +0100
On 19.02.25 18:14, Bernhard Voelker wrote:

On 2/18/25 7:45 PM, Rupert Gallagher via GNU coreutils Bug Reports wrote:

> By comparison, human (-h) and numeric (-n) sort cause data loss:

not really.  That's the difference between
a)
  "I have a list containing numbers; I merely care about numbers and 
want to get a unique, sorted list of them."
  ('sort -h -u')

and
b)
  "I have a list containing numbers; I want to have it sorted by 
numbers, and then throw away duplicates."
  ('sort -h | uniq')

The point is: in case a), the numerical value of each non-number entry 
is Zero.


I have no issue with the way 'sort -u' is currently working, but the man 
page isn't clear at all about the fact that 'sort -h -u' and 'sort -h | 
uniq' behave differently.

Specifically, the explanation for -u

-u, --unique
             with -c, check for strict ordering; without -c, output 
only the first of an equal run

does not provide any explanation what 'equal' or 'run' may mean. Maybe 
add something like "where equality is assessed only based on the keys 
and rules used to sort the output".


Rainer





Information forwarded to bug-coreutils <at> gnu.org:
bug#76290; Package coreutils. (Wed, 19 Feb 2025 21:22:02 GMT) Full text and rfc822 format available.

Message #43 received at 76290 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rainer Canavan <coreutils <at> canavan.de>
Cc: 76290 <at> debbugs.gnu.org
Subject: Re: bug#76290: "sort -u" vs "sort -h -u": possible bug
Date: Wed, 19 Feb 2025 13:21:14 -0800
[Message part 1 (text/plain, inline)]
On 2/19/25 11:15, Rainer Canavan wrote:
> -u, --unique
>               with -c, check for strict ordering; without -c, output 
> only the first of an equal run
> 
> does not provide any explanation what 'equal' or 'run' may mean. Maybe 
> add something like "where equality is assessed only based on the keys 
> and rules used to sort the output".

Thanks for the suggestion. Although that's a bit long for "sort --help", 
I take the point that equality comparison could be mentioned and 
installed the attached to try to make things clearer without adding so 
much length.
[0001-sort-improve-u-brief-doc.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 20 Mar 2025 11:24:13 GMT) Full text and rfc822 format available.

This bug report was last modified 92 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.