GNU bug report logs - #22109
Sort gives incorrect order when changing delimiters

Previous Next

Package: coreutils;

Reported by: Ed Brambley <edbrambley <at> gmail.com>

Date: Mon, 7 Dec 2015 16:17:03 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22109 in the body.
You can then email your comments to 22109 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#22109; Package coreutils. (Mon, 07 Dec 2015 16:17:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ed Brambley <edbrambley <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 07 Dec 2015 16:17:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ed Brambley <edbrambley <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Sort gives incorrect order when changing delimiters
Date: Mon, 7 Dec 2015 15:36:12 +0000
[Message part 1 (text/plain, inline)]
The following problem came to light following a StackOverflow question [1].
The lexical ordering of sort appears to depend on the delimiter used, and I
believe it shouldn't. As a minimal example:

### Correct ordering ###
$ printf "1,a,1\n2,aa,2" | LC_ALL=C sort -k2 -t,
1,a,1
2,aa,2

### Incorrect ordering by replacing the "," delimiter by "~" ###
$ printf "1~a~1\n2~aa~2" | LC_ALL=C sort -k2 -t~
2~aa~2
1~a~1

I think this is because, in ASCII, "," < "a" < "~".

Many thanks,
Ed

[1]
http://stackoverflow.com/questions/34134677/trying-to-understand-the-sort-utilty-in-linux
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#22109; Package coreutils. (Mon, 07 Dec 2015 16:49:02 GMT) Full text and rfc822 format available.

Message #8 received at 22109 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Ed Brambley <edbrambley <at> gmail.com>, 22109 <at> debbugs.gnu.org
Subject: Re: bug#22109: Sort gives incorrect order when changing delimiters
Date: Mon, 7 Dec 2015 11:49:39 -0500
tag 22109 notabug
close 22109
stop

Hello Ed,

On 12/07/2015 10:36 AM, Ed Brambley wrote:
> The following problem came to light following a StackOverflow question [1]. The lexical ordering of sort appears to depend on the delimiter used, and I believe it shouldn't. As a minimal example:
>
> ### Correct ordering ###
> $ printf "1,a,1\n2,aa,2" | LC_ALL=C sort -k2 -t,
> 1,a,1
> 2,aa,2
>
> ### Incorrect ordering by replacing the "," delimiter by "~" ###
> $ printf "1~a~1\n2~aa~2" | LC_ALL=C sort -k2 -t~
> 2~aa~2
> 1~a~1
>

This is not a bug in 'sort', but simply an incorrect usage of the key options.

The parameter "-k2" means: use the second key *and all characters until the end of the line* to sort each line.
In this case, the character after the second key ',' or '~' does come into play.

The correct usage is to specify the key as "-k2,2" meaning: sort by the second key alone (then resolve equal keys by the entire line, unless --stable is used).

    $ printf "1~a~1\n2~aa~2" | LC_ALL=C sort -k2,2 -t~
    1~a~1
    2~aa~2


Using sort's "--debug" option will illustrate the difference (notice the underscore characters indicating what is the key that is being used):

Incorrect usage (-k2):

    $ printf "1~a~1\n2~aa~2" | LC_ALL=C sort --debug -k2 -t~
    sort: using simple byte comparison
    2~aa~2
      ____
    ______
    1~a~1
      ___
    _____


Better usage (-k2,2):

    $ printf "1~a~1\n2~aa~2" | LC_ALL=C sort --debug -k2,2 -t~
    sort: using simple byte comparison
    1~a~1
      _
    _____
    2~aa~2
      __
    ______




regards,
 - assaf





Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Mon, 07 Dec 2015 16:50:03 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Mon, 07 Dec 2015 16:50:04 GMT) Full text and rfc822 format available.

Notification sent to Ed Brambley <edbrambley <at> gmail.com>:
bug acknowledged by developer. (Mon, 07 Dec 2015 16:50:05 GMT) Full text and rfc822 format available.

Message #15 received at 22109-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Ed Brambley <edbrambley <at> gmail.com>, 22109-done <at> debbugs.gnu.org
Subject: Re: bug#22109: Sort gives incorrect order when changing delimiters
Date: Mon, 7 Dec 2015 09:49:08 -0700
[Message part 1 (text/plain, inline)]
tag 22109 notabug
thanks

On 12/07/2015 08:36 AM, Ed Brambley wrote:
> The following problem came to light following a StackOverflow question [1].
> The lexical ordering of sort appears to depend on the delimiter used, and I
> believe it shouldn't. As a minimal example:

Thanks for the report.  However, you have not found a bug in sort, only
in your misuse of the command line and in your incorrect assumptions.

Let's investigate further with the --debug option:

> 
> ### Correct ordering ###
> $ printf "1,a,1\n2,aa,2" | LC_ALL=C sort -k2 -t,
> 1,a,1
> 2,aa,2

$ printf '1,a,1\n2,aa,2' | LC_ALL=C sort -k2 -t, --debug
sort: using simple byte comparison
1,a,1
  ___
_____
2,aa,2
  ____
______

You are comparing the string "a,1" with "aa,2"; so the relative relation
between ',' and 'a' matters.

> 
> ### Incorrect ordering by replacing the "," delimiter by "~" ###
> $ printf "1~a~1\n2~aa~2" | LC_ALL=C sort -k2 -t~
> 2~aa~2
> 1~a~1

Same goes for here.

$ printf '1~a~1\n2~aa~2' | LC_ALL=C sort -k2 -t~ --debug
sort: using simple byte comparison
2~aa~2
  ____
______
1~a~1
  ___
_____

You compared the string "aa~2" with "a~1".


> 
> I think this is because, in ASCII, "," < "a" < "~".

Yes, so you saw exactly what you asked for.  But what you asked for
("sort starting from the second delimiter through to the end of the
line") is probably not what you wanted.  It sounds like you wanted "sort
on ONLY the second delimiter", which is spelled differently:

$ printf '1~a~1\n2~aa~2' | LC_ALL=C sort -k2,2 -t~ --debug
sort: using simple byte comparison
1~a~1
  _
_____
2~aa~2
  __
______


Note that there is a very distinct difference between '-k2' and '-k2,2';
only the latter one limits the sort to JUST the second key ("a" vs.
"aa", regardless of delimiter), while the former slurps in the rest of
the line such that the spelling of the delimiter affects the result.

I'm marking this as not a bug in the database, but feel free to add
further comments.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#22109; Package coreutils. (Mon, 07 Dec 2015 18:08:02 GMT) Full text and rfc822 format available.

Message #18 received at 22109 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 22109 <at> debbugs.gnu.org, edbrambley <at> gmail.com
Subject: Re: bug#22109: Sort gives incorrect order when changing delimiters
Date: Mon, 7 Dec 2015 10:07:19 -0800
[Message part 1 (text/plain, inline)]
This confusion happens often enough that I installed the attached 
documentation patch to try to make things clearer.
[0001-doc-promote-sort-debug.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#22109; Package coreutils. (Mon, 07 Dec 2015 21:08:01 GMT) Full text and rfc822 format available.

Message #21 received at 22109 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 22109 <at> debbugs.gnu.org,
 edbrambley <at> gmail.com
Subject: Re: bug#22109: Sort gives incorrect order when changing delimiters
Date: Mon, 7 Dec 2015 14:07:10 -0700
[Message part 1 (text/plain, inline)]
On 12/07/2015 11:07 AM, Paul Eggert wrote:
> This confusion happens often enough that I installed the attached
> documentation patch to try to make things clearer.

Should we also modify this paragraph in 'sort --help'?  Maybe:

> KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
> field number and C a character position in the field; both are origin 1, and
> the stop position defaults to the line's end. If neither -t nor -b is in
> effect, characters in a field are counted from the beginning of the preceding
> whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
> which override global ordering options for that key. If no key is given, use
>-the entire line as the key.
>+the entire line as the key.  Use --debug to diagnose incorrect key usage.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#22109; Package coreutils. (Mon, 07 Dec 2015 21:09:02 GMT) Full text and rfc822 format available.

Message #24 received at 22109 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>, 22109 <at> debbugs.gnu.org, edbrambley <at> gmail.com
Subject: Re: bug#22109: Sort gives incorrect order when changing delimiters
Date: Mon, 7 Dec 2015 13:08:18 -0800
On 12/07/2015 01:07 PM, Eric Blake wrote:
> Should we also modify this paragraph in 'sort --help'?

Works for me.




Information forwarded to bug-coreutils <at> gnu.org:
bug#22109; Package coreutils. (Tue, 08 Dec 2015 09:51:03 GMT) Full text and rfc822 format available.

Message #27 received at 22109 <at> debbugs.gnu.org (full text, mbox):

From: Ed Brambley <edbrambley <at> gmail.com>
To: 22109 <at> debbugs.gnu.org
Subject: Re: bug#22109: Sort gives incorrect order when changing delimiters
Date: Tue, 8 Dec 2015 09:50:34 +0000
[Message part 1 (text/plain, inline)]
Dear All,

Thanks Assaf and Eric for the explanation.  It's very well hidden in the
man page.  I know it would break backward compatability (so don't do it)
and, as Eric pointed out to me, would break POSIX compatability, but I
would think most people's expectation would be that -k2 would be shorthand
for -k2,2 rather than -k2,end.

Updating the documentation would really help.  Your proposals so far seem
good, but they are really missing the point as far as I'm concerned, which
is that *field separators are including in the comparison*,  So I think
Paul's update is a bit misleading, as it says "Sort compares each pair of
fields, in the order specified on the command line, according to the
associated ordering options, until a difference is found or no fields are
left", but doesn't mention that it also uses the field separators when
comparing fields.

If I'd seen the documentation suggesting using --debug, I would have used
it, but still reported a bug as --debug would have just confirmed that sort
was doing what I thought it was doing, which I thought was wrong.

So parhaps we could say somewhere in the documentation something like:

> KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F
is a
> field number and C a character position in the field; both default to 1,
and
> the stop position defaults to the line's end.  Note that any field
separators between
> the start and stop positions are also included in the comparison.

And also possibly something like:

> ... A line's trailing newline is not part of the line for comparison
purposes, but field
> separators are included in the comparison...

Thanks again,
Ed

Ps: Sorry for emailing you directly, Eric.  My fault for not replying all.
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 05 Jan 2016 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 166 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.