GNU bug report logs -
#76290
"sort -u" vs "sort -h -u": possible bug
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 76290 in the body.
You can then email your comments to 76290 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Fri, 14 Feb 2025 17:01:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Rupert Gallagher <ruga <at> protonmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Fri, 14 Feb 2025 17:01:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787"
CVE-2018-13787 <---
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046CVE-2018-13787 <---
>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort
CVE-2018-13787 <---
CVE-2018-13787 <---
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046
>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -u
CVE-2018-13787 <---
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046
Problem:
>echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -h -uCVE-2018-13787
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Sun, 16 Feb 2025 06:25:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 76290 <at> debbugs.gnu.org (full text, mbox):
I don't see a bug there, just an infelicity. -h means 'sort' should look
for a number, and your data lines don't start with numbers.
Try 'sort --debug -h -u' to see more.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Sun, 16 Feb 2025 15:46:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 76290 <at> debbugs.gnu.org (full text, mbox):
My concern is best described as follows.
~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -h
CVE-2018-13787
CVE-2018-13787
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046
~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -h -u
CVE-2018-13787
The introduction of the unique operator (-u) returns a wrong answer when used with the human sorting operator (-h).
Note the problem does not occur when the human sorting operator is not used.
~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort
CVE-2018-13787
CVE-2018-13787
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046
~ $ echo -e "CVE-2018-13787\nCVE-2019-16649\nCVE-2019-16650\nCVE-2020-15046\nCVE-2018-13787" | sort -u
CVE-2018-13787
CVE-2019-16649
CVE-2019-16650
CVE-2020-15046
The example suggests the existence of a programming error between the output of -h and the input of -u.
-------- Original Message --------
On 2/16/25 07:23, Paul Eggert wrote:
> I don't see a bug there, just an infelicity. -h means 'sort' should look for a number, and your data lines don't start with numbers.
>
> Try 'sort --debug -h -u' to see more.
>
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Sun, 16 Feb 2025 17:27:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 76290 <at> debbugs.gnu.org (full text, mbox):
On Sun, 16 Feb 2025, at 06:23, Paul Eggert wrote:
> I don't see a bug there, just an infelicity. -h means 'sort' should look
> for a number, and your data lines don't start with numbers.
>
> Try 'sort --debug -h -u' to see more.
The --debug output here isn't as helpful as it could be; taking a simplified example
$ echo -e 'CVE-222\nCVE-111\nCVE-222' | sort -h -u --debug
sort: text ordering performed using simple byte comparison
sort: note numbers use '.' as a decimal point in this locale
CVE-222
^ no match for key
$ echo $'bbb\naaa' | sort -n -u --debug
sort: text ordering performed using simple byte comparison
sort: note numbers use '.' as a decimal point in this locale
bbb
^ no match for key
Due to the diligent work by maintainers, there are very few genuine bugs in sort, so we can assume --debug users need as much help as possible figuring out where the sort options have gone wrong. How could the "no match for key" output here be clearer? Could --uniq --debug show elided lines with an explanation, especially for entire lines which match nothing?
Cheers,
Phil
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Sun, 16 Feb 2025 22:23:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 76290 <at> debbugs.gnu.org (full text, mbox):
On 2025-02-16 03:02, Rupert Gallagher wrote:
> The introduction of the unique operator (-u) returns a wrong answer when used with the human sorting operator (-h).
The answer is "wrong" only in the sense that sort's documented and
implemented behavior is not what you expect.
To fix this mismatch between behavior and expectations, don't use -h. It
makes sense to not use -h, -h is not intended for uses like that.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Mon, 17 Feb 2025 23:32:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 76290 <at> debbugs.gnu.org (full text, mbox):
On 2025-02-17 15:13, Rupert Gallagher wrote:
> ~ $ echo -e "a1\na2" | sort
> a1
> a2
>
> ~ $ echo -e "a1\na2" | sort -h
> a1
> a2
>
> Since A = B, the result of -u must be the same on both sets, by logic.
By that logic, since the output of these two commands:
echo -e 'a1\na2' | sort
echo -e 'a1\na2' | sort -n
are the same, then the result of -u be the same on both sets. But this
logic is wrong, in the sense that it disagrees with both longstanding
practice and with the POSIX.1-2024 standard
<https://pubs.opengroup.org/onlinepubs/9799919799/utilities/sort.html>,
which say that plain 'sort' uses the entire line as a key whereas 'sort
-n' uses a leading integer prefix (which in this example is empty so the
keys compare equal).
I get it that 'sort' doesn't behave the way you expected. But that's a
mismatch of expectations vs implementation, not a bug in the implementation.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Tue, 18 Feb 2025 04:16:05 GMT)
Full text and
rfc822 format available.
Message #23 received at 76290 <at> debbugs.gnu.org (full text, mbox):
No, I expect the program to do exactly what the manual says.
-h, --human-numeric-sort
compare human readable numbers (e.g., 2K 1G)
Applying -h to the list in my example is expected to be semantically equivalent to not applying -h:
A = { echo -e "a1\na2" | sort }
B = { echo -e "a1\na2" | sort -h }
~ $ echo -e "a1\na2" | sort
a1
a2
~ $ echo -e "a1\na2" | sort -h
a1
a2
Since A = B, the result of -u must be the same on both sets, by logic. The program, however, has a mind of its own.
~ $ echo -e "a1\na2" | sort -u
a1
a2
~ $ echo -e "a1\na2" | sort -h -u
a1
-------- Original Message --------
On 2/16/25 23:22, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 2025-02-16 03:02, Rupert Gallagher wrote:
> > The introduction of the unique operator (-u) returns a wrong answer when used with the human sorting operator (-h).
>
> The answer is "wrong" only in the sense that sort's documented and
> implemented behavior is not what you expect.
>
> To fix this mismatch between behavior and expectations, don't use -h. It
> makes sense to not use -h, -h is not intended for uses like that.
>
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Tue, 18 Feb 2025 06:26:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Rupert Gallagher <ruga <at> protonmail.com>
:
bug acknowledged by developer.
(Tue, 18 Feb 2025 06:26:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 76290-done <at> debbugs.gnu.org (full text, mbox):
On 2025-02-17 15:13, Rupert Gallagher via GNU coreutils Bug Reports wrote:
> I expect the program to do exactly what the manual says.
Here's what the manual says about -u in
<https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html#index-uniquifying-output>:
> Normally, output only the first of a sequence of lines that compare equal....
>
> This option also disables the default last-resort comparison.
>
> The commands sort -u and sort | uniq are equivalent, but this equivalence does not extend to arbitrary sort options. For example, sort -n -u inspects only the value of the initial numeric string when checking for uniqueness, whereas sort -n | uniq inspects the entire line.
This is the part of the manual that you're disagreeing with. The example
in my previous email (an example that you did not reply to) is a
demonstration of this part of the manual.
I am taking the liberty of closing this bug report, as "sort" is
behaving as documented here.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Tue, 18 Feb 2025 08:20:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 76290-done <at> debbugs.gnu.org (full text, mbox):
According to gnu sort -h -u and what you claim to be common practice, a list of possibly redoundant strings, some beginning with a number, is reduced to an ordered set of the numbered strings only.
Since I expect the resulting ordered set to include the original elements, I will then stop using gnu sort to avoid data loss.
-------- Original Message --------
On 2/18/25 07:25, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 2025-02-17 15:13, Rupert Gallagher via GNU coreutils Bug Reports wrote:
> > I expect the program to do exactly what the manual says.
>
> Here's what the manual says about -u in
> <https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html#index-uniquifying-output>:
>
> > Normally, output only the first of a sequence of lines that compare equal....
> >
> > This option also disables the default last-resort comparison.
> >
> > The commands sort -u and sort | uniq are equivalent, but this equivalence does not extend to arbitrary sort options. For example, sort -n -u inspects only the value of the initial numeric string when checking for uniqueness, whereas sort -n | uniq inspects the entire line.
>
> This is the part of the manual that you're disagreeing with. The example
> in my previous email (an example that you did not reply to) is a
> demonstration of this part of the manual.
>
> I am taking the liberty of closing this bug report, as "sort" is
> behaving as documented here.
>
>
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Tue, 18 Feb 2025 18:47:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 76290-done <at> debbugs.gnu.org (full text, mbox):
Dictionary sort corresponds to the intended behaviour.
> echo -e "abc\n123\n456\nCVE-2011-234\nAbc\ndef\nCVE-2024-123" | sort --debug -dfu
123
___
456
___
abc
___
CVE-2011-234
____________
CVE-2024-123
____________
def
___
By comparison, human (-h) and numeric (-n) sort cause data loss:
> echo -e "abc\n123\n456\nCVE-2011-234\nAbc\ndef\nCVE-2024-123" | sort --debug -hu
sort: note numbers use ‘.’ as a decimal point in this locale
abc
^ no match for key
123
___
456
___
If I were the author of gnu sort, I would delete the -h and -n options.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Wed, 19 Feb 2025 17:16:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 76290-done <at> debbugs.gnu.org (full text, mbox):
On 2/18/25 7:45 PM, Rupert Gallagher via GNU coreutils Bug Reports wrote:
> By comparison, human (-h) and numeric (-n) sort cause data loss:
not really. That's the difference between
a)
"I have a list containing numbers; I merely care about numbers and want to get a unique, sorted list of them."
('sort -h -u')
and
b)
"I have a list containing numbers; I want to have it sorted by numbers, and then throw away duplicates."
('sort -h | uniq')
The point is: in case a), the numerical value of each non-number entry is Zero.
Consider the following:
$ printf "%s\n" 0 1 X-1 Ab2 3 ma | LC_ALL=C sort -nu
0
1
3
Here, the entries 0, "X-1", "Ab2" and "ma" all have the numerical value 0.
That's why the first Zero is output.
Now let's remove the literal/numerical 0 from the input:
$ printf "%s\n" 1 X-1 Ab2 3 ma | LC_ALL=C sort -nu
X-1
1
3
Now, the first entry which represents numerically 0 is "X-1".
Now even let's put the 0 back into the input, but at the end:
$ printf "%s\n" 1 X-1 Ab2 3 ma 0 | LC_ALL=C sort -nu
X-1
1
3
Still, sort(1) outputs the first entry which has a numerical value of Zero: "X-1".
Have a nice day,
Berny
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Wed, 19 Feb 2025 19:16:01 GMT)
Full text and
rfc822 format available.
Message #40 received at submit <at> debbugs.gnu.org (full text, mbox):
On 19.02.25 18:14, Bernhard Voelker wrote:
On 2/18/25 7:45 PM, Rupert Gallagher via GNU coreutils Bug Reports wrote:
> By comparison, human (-h) and numeric (-n) sort cause data loss:
not really. That's the difference between
a)
"I have a list containing numbers; I merely care about numbers and
want to get a unique, sorted list of them."
('sort -h -u')
and
b)
"I have a list containing numbers; I want to have it sorted by
numbers, and then throw away duplicates."
('sort -h | uniq')
The point is: in case a), the numerical value of each non-number entry
is Zero.
I have no issue with the way 'sort -u' is currently working, but the man
page isn't clear at all about the fact that 'sort -h -u' and 'sort -h |
uniq' behave differently.
Specifically, the explanation for -u
-u, --unique
with -c, check for strict ordering; without -c, output
only the first of an equal run
does not provide any explanation what 'equal' or 'run' may mean. Maybe
add something like "where equality is assessed only based on the keys
and rules used to sort the output".
Rainer
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#76290
; Package
coreutils
.
(Wed, 19 Feb 2025 21:22:02 GMT)
Full text and
rfc822 format available.
Message #43 received at 76290 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2/19/25 11:15, Rainer Canavan wrote:
> -u, --unique
> with -c, check for strict ordering; without -c, output
> only the first of an equal run
>
> does not provide any explanation what 'equal' or 'run' may mean. Maybe
> add something like "where equality is assessed only based on the keys
> and rules used to sort the output".
Thanks for the suggestion. Although that's a bit long for "sort --help",
I take the point that equality comparison could be mentioned and
installed the attached to try to make things clearer without adding so
much length.
[0001-sort-improve-u-brief-doc.patch (text/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 20 Mar 2025 11:24:13 GMT)
Full text and
rfc822 format available.
This bug report was last modified 92 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.