GNU bug report logs - #7961
sort

Previous Next

Package: coreutils;

Reported by: Francesco Bettella <francesb <at> decode.is>

Date: Wed, 2 Feb 2011 14:42:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 7961 in the body.
You can then email your comments to 7961 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7961; Package coreutils. (Wed, 02 Feb 2011 14:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Francesco Bettella <francesb <at> decode.is>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 02 Feb 2011 14:42:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Francesco Bettella <francesb <at> decode.is>
To: bug-coreutils <at> gnu.org
Subject: sort
Date: Wed, 2 Feb 2011 12:42:01 +0000
[Message part 1 (text/plain, inline)]
hi,
I may have bumped into an undesired feature/bug of sort, which appears to be 
still present in the version 8.9 of coreutils.

I'm issuing the following sort commands (see attached files):

[prompt1] > sort -k 1.4,1n asd1 > asd1.sorted

[prompt2] > sort -k 2.4,2n asd2 > asd2.sorted

the first one works as I would expect, the second one doesn't.

cheers.

Francesco
[asd1 (text/plain, attachment)]
[asd1.sorted (text/plain, attachment)]
[asd2.sorted (text/plain, attachment)]
[asd2 (text/plain, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7961; Package coreutils. (Wed, 02 Feb 2011 17:36:02 GMT) Full text and rfc822 format available.

Message #8 received at 7961 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Francesco Bettella <francesb <at> decode.is>
Cc: 7961 <at> debbugs.gnu.org
Subject: Re: bug#7961: sort
Date: Wed, 02 Feb 2011 10:44:00 -0700
[Message part 1 (text/plain, inline)]
On 02/02/2011 05:42 AM, Francesco Bettella wrote:
> hi,
> I may have bumped into an undesired feature/bug of sort, which appears to be 
> still present in the version 8.9 of coreutils.

Thanks for the report.  However, this is a feature, and not a bug, of sort.

> 
> I'm issuing the following sort commands (see attached files):
> 
> [prompt1] > sort -k 1.4,1n asd1 > asd1.sorted
> 
> [prompt2] > sort -k 2.4,2n asd2 > asd2.sorted

If I'm correct, asd1 and asd2 have the same contents, except that you
have swapped columns 1 and 2 between the two and resorted the lines.
And your desired goal is that the output matches asd1.sorted, again with
the columns swapped for asd2.sorted.

> 
> the first one works as I would expect, the second one doesn't.

Let's examine why:

$ head -3 asd1 | sort -k 1.4,1n --debug
sort: using `en_US.UTF-8' sorting rules
sort: leading blanks are significant in key 1; consider also specifying `b'
chr>coding_gene
   ^ no match for key
_______________
chr1>PRAMEF1
   _
____________
chr1>PRAMEF4
   _
____________
$ head -3 asd1 | LC_ALL=C sort -k 1.4,1n --debug
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
chr>coding_gene
   ^ no match for key
_______________
chr1>PRAMEF1
   _
____________
chr1>PRAMEF4
   _
____________

In both cases, when there is no match for a key but numeric sorting was
requested, then that line sorts first; meanwhile, you get the fallback
sort of the complete line after the first key has been sorted, so that
the end result matches asd1.sorted whether you use the C locale or
dictionary sorting.

But notice that warning about not using -b, and how it affects asd2 (and
also, how the difference in dictionary vs. byte-ordering plays a role in
the secondary sorting):

$ head -3 asd2 | sort -k 2.4,2n --debug
sort: using `en_US.UTF-8' sorting rules
sort: leading blanks are significant in key 1; consider also specifying `b'
coding_gene>chr
              ^ no match for key
_______________
PRAMEF1>chr1
          ^ no match for key
____________
PRAMEF4>chr1
          ^ no match for key
____________
$ head -3 asd2 | LC_ALL=C sort -k 2.4,2n --debug
sort: using simple byte comparison
sort: leading blanks are significant in key 1; consider also specifying `b'
PRAMEF1>chr1
          ^ no match for key
____________
PRAMEF4>chr1
          ^ no match for key
____________
coding_gene>chr
              ^ no match for key

But when you add -b (note, b is the one option you have to add to the
start field, since it affects start and end fields specially; all other
options can be added to start, end, or both, and affect the entire key):

$ head -3 asd2 | sort -k 2.4b,2n --debug
sort: using `en_US.UTF-8' sorting rules
coding_gene>chr
               ^ no match for key
_______________
PRAMEF1>chr1
           _
____________
PRAMEF4>chr1
           _
____________
$ head -3 asd2 | LC_ALL=C coreutils/src/sort -k 2.4b,2n --debug
coreutils/src/sort: using simple byte comparison
coding_gene>chr
               ^ no match for key
_______________
PRAMEF1>chr1
           _
____________
PRAMEF4>chr1
           _
____________

That is, your expectations were insufficient - without telling sort
enough additional information, sort correctly followed what you told it
to do, but what you told it was not what you meant.  And the --debug
option is your [new] friend :)

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7961; Package coreutils. (Wed, 02 Feb 2011 19:04:03 GMT) Full text and rfc822 format available.

Message #11 received at 7961 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Francesco Bettella <francesb <at> decode.is>
Cc: 7961 <at> debbugs.gnu.org
Subject: Re: bug#7961: sort
Date: Wed, 02 Feb 2011 13:50:35 -0500
On a somewhat off-topic note,

Francesco Bettella wrote, On 02/02/2011 07:42 AM:
> 
> I'm issuing the following sort commands (see attached files):
> [prompt1] > sort -k 1.4,1n asd1 > asd1.sorted
> [prompt2] > sort -k 2.4,2n asd2 > asd2.sorted
> 
> the first one works as I would expect, the second one doesn't.

When sorting chromosome names, the version sort option (-V, introduced in coreutils 7.0) sorts as you would expect,
saving you the need to skip three characters in the sort key, and also accommodating mixing letters and numbers.

Example:

$ cat chrom.txt
chr1
chrUn_gl000232
chrY
chr2
chr13
chrM
chrUn_gl000218
chr6_hap
chr2R
chr16
chr10
chr6_dbb_hap3
chr4
chr3L
chr4_ctg9_hap1
chr3R
chr3
chrX

$ sort -k1,1V chrom.txt
chr1
chr2
chr2R
chr3
chr3L
chr3R
chr4
chr4_ctg9_hap1
chr6_dbb_hap3
chr6_hap
chr10
chr13
chr16
chrM
chrUn_gl000218
chrUn_gl000232
chrX
chrY


-gordon





Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7961; Package coreutils. (Wed, 02 Feb 2011 19:04:03 GMT) Full text and rfc822 format available.

Message #14 received at 7961 <at> debbugs.gnu.org (full text, mbox):

From: Francesco Bettella <francesb <at> decode.is>
To: Eric Blake <eblake <at> redhat.com>
Cc: 7961 <at> debbugs.gnu.org
Subject: Re: bug#7961: sort
Date: Wed, 2 Feb 2011 19:05:33 +0000
thank you very much for your time. and sorry for the trouble.
if I understand this right, specifying 'b' in the start field spares me the 
fallback sort of the complete line. and this actually does the trick.
I remain a little in the dark regarding the dictionary vs. byte (POSIX vs. C) 
ordering. I've tried both on asd2 (without the 'b') with the same result. but 
I trust you on this one.

Francesco

P.S.: just got Gordon's reply. thank you for that.



On Wed February 2 2011 17:44, Eric Blake wrote:
> On 02/02/2011 05:42 AM, Francesco Bettella wrote:
> > hi,
> > I may have bumped into an undesired feature/bug of sort, which appears to 
be 
> > still present in the version 8.9 of coreutils.
> 
> Thanks for the report.  However, this is a feature, and not a bug, of sort.
> 
> > 
> > I'm issuing the following sort commands (see attached files):
> > 
> > [prompt1] > sort -k 1.4,1n asd1 > asd1.sorted
> > 
> > [prompt2] > sort -k 2.4,2n asd2 > asd2.sorted
> 
> If I'm correct, asd1 and asd2 have the same contents, except that you
> have swapped columns 1 and 2 between the two and resorted the lines.
> And your desired goal is that the output matches asd1.sorted, again with
> the columns swapped for asd2.sorted.
> 
> > 
> > the first one works as I would expect, the second one doesn't.
> 
> Let's examine why:
> 
> $ head -3 asd1 | sort -k 1.4,1n --debug
> sort: using `en_US.UTF-8' sorting rules
> sort: leading blanks are significant in key 1; consider also specifying `b'
> chr>coding_gene
>    ^ no match for key
> _______________
> chr1>PRAMEF1
>    _
> ____________
> chr1>PRAMEF4
>    _
> ____________
> $ head -3 asd1 | LC_ALL=C sort -k 1.4,1n --debug
> sort: using simple byte comparison
> sort: leading blanks are significant in key 1; consider also specifying `b'
> chr>coding_gene
>    ^ no match for key
> _______________
> chr1>PRAMEF1
>    _
> ____________
> chr1>PRAMEF4
>    _
> ____________
> 
> In both cases, when there is no match for a key but numeric sorting was
> requested, then that line sorts first; meanwhile, you get the fallback
> sort of the complete line after the first key has been sorted, so that
> the end result matches asd1.sorted whether you use the C locale or
> dictionary sorting.
> 
> But notice that warning about not using -b, and how it affects asd2 (and
> also, how the difference in dictionary vs. byte-ordering plays a role in
> the secondary sorting):
> 
> $ head -3 asd2 | sort -k 2.4,2n --debug
> sort: using `en_US.UTF-8' sorting rules
> sort: leading blanks are significant in key 1; consider also specifying `b'
> coding_gene>chr
>               ^ no match for key
> _______________
> PRAMEF1>chr1
>           ^ no match for key
> ____________
> PRAMEF4>chr1
>           ^ no match for key
> ____________
> $ head -3 asd2 | LC_ALL=C sort -k 2.4,2n --debug
> sort: using simple byte comparison
> sort: leading blanks are significant in key 1; consider also specifying `b'
> PRAMEF1>chr1
>           ^ no match for key
> ____________
> PRAMEF4>chr1
>           ^ no match for key
> ____________
> coding_gene>chr
>               ^ no match for key
> 
> But when you add -b (note, b is the one option you have to add to the
> start field, since it affects start and end fields specially; all other
> options can be added to start, end, or both, and affect the entire key):
> 
> $ head -3 asd2 | sort -k 2.4b,2n --debug
> sort: using `en_US.UTF-8' sorting rules
> coding_gene>chr
>                ^ no match for key
> _______________
> PRAMEF1>chr1
>            _
> ____________
> PRAMEF4>chr1
>            _
> ____________
> $ head -3 asd2 | LC_ALL=C coreutils/src/sort -k 2.4b,2n --debug
> coreutils/src/sort: using simple byte comparison
> coding_gene>chr
>                ^ no match for key
> _______________
> PRAMEF1>chr1
>            _
> ____________
> PRAMEF4>chr1
>            _
> ____________
> 
> That is, your expectations were insufficient - without telling sort
> enough additional information, sort correctly followed what you told it
> to do, but what you told it was not what you meant.  And the --debug
> option is your [new] friend :)
> 
> -- 
> Eric Blake   eblake <at> redhat.com    +1-801-349-2682
> Libvirt virtualization library http://libvirt.org
> 
> 




Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Wed, 02 Feb 2011 22:53:02 GMT) Full text and rfc822 format available.

Notification sent to Francesco Bettella <francesb <at> decode.is>:
bug acknowledged by developer. (Wed, 02 Feb 2011 22:53:02 GMT) Full text and rfc822 format available.

Message #19 received at 7961-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: 7961-done <at> debbugs.gnu.org, Francesco Bettella <francesb <at> decode.is>
Subject: Re: bug#7961: sort
Date: Wed, 02 Feb 2011 22:59:02 +0000
On 02/02/11 17:44, Eric Blake wrote:
> $ head -3 asd2 | LC_ALL=C sort -k 2.4,2n --debug
> sort: using simple byte comparison
> sort: leading blanks are significant in key 1; consider also specifying `b'
> PRAMEF1>chr1
>           ^ no match for key
> ____________
> PRAMEF4>chr1
>           ^ no match for key
> ____________
> coding_gene>chr
>               ^ no match for key
> 
> But when you add -b (note, b is the one option you have to add to the
> start field, since it affects start and end fields specially; all other
> options can be added to start, end, or both, and affect the entire key):
> 
> $ head -3 asd2 | sort -k 2.4b,2n --debug
> sort: using `en_US.UTF-8' sorting rules
> coding_gene>chr
>                ^ no match for key
> _______________
> PRAMEF1>chr1
>            _


Yep. The 'b' option is one of the main reasons for --debug.
Note, sort --debug will warn until you put it in the right place.

Hmm, I just noticed a bug with --debug, introduced with bdde34f9:

$ printf "A\tchr10\nB\tchr1\n" | ./sort -s --debug -k2.4b,2.3n 2>/dev/null
A>chr10
     __
B>chr1
     _

This should fix it up:

diff --git a/src/sort.c b/src/sort.c
index 06b0d95..365634d 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -2214,7 +2214,9 @@ debug_key (struct line const *line, struct keyfield const *key)

           char *tighter_lim = beg;

-          if (key->month)
+          if (lim < beg)
+            tighter_lim = lim;
+          else if (key->month)
             getmonth (beg, &tighter_lim);
           else if (key->general_numeric)
             ignore_value (strtold (beg, &tighter_lim));

cheers,
Pádraig.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#7961; Package coreutils. (Thu, 03 Feb 2011 08:12:02 GMT) Full text and rfc822 format available.

Message #22 received at 7961 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: 7961 <at> debbugs.gnu.org
Cc: P <at> draigBrady.com
Subject: Re: bug#7961: sort
Date: Thu, 03 Feb 2011 09:20:14 +0100
Pádraig Brady wrote:
> On 02/02/11 17:44, Eric Blake wrote:
>> $ head -3 asd2 | LC_ALL=C sort -k 2.4,2n --debug
>> sort: using simple byte comparison
>> sort: leading blanks are significant in key 1; consider also specifying `b'
>> PRAMEF1>chr1
>>           ^ no match for key
>> ____________
>> PRAMEF4>chr1
>>           ^ no match for key
>> ____________
>> coding_gene>chr
>>               ^ no match for key
>>
>> But when you add -b (note, b is the one option you have to add to the
>> start field, since it affects start and end fields specially; all other
>> options can be added to start, end, or both, and affect the entire key):
>>
>> $ head -3 asd2 | sort -k 2.4b,2n --debug
>> sort: using `en_US.UTF-8' sorting rules
>> coding_gene>chr
>>                ^ no match for key
>> _______________
>> PRAMEF1>chr1
>>            _
>
>
> Yep. The 'b' option is one of the main reasons for --debug.
> Note, sort --debug will warn until you put it in the right place.
>
> Hmm, I just noticed a bug with --debug, introduced with bdde34f9:
>
> $ printf "A\tchr10\nB\tchr1\n" | ./sort -s --debug -k2.4b,2.3n 2>/dev/null
> A>chr10
>      __
> B>chr1
>      _
>
> This should fix it up:

Good catch.  That looks right and works for me:

  $ printf "A\tchr10\nB\tchr1\n" | ./sort -s --debug -k2.4b,2.3n 2>/dev/null
  A>chr10
       ^ no match for key
  B>chr1
       ^ no match for key

If you have time, please push that today.

> diff --git a/src/sort.c b/src/sort.c
> index 06b0d95..365634d 100644
> --- a/src/sort.c
> +++ b/src/sort.c
> @@ -2214,7 +2214,9 @@ debug_key (struct line const *line, struct keyfield const *key)
>
>            char *tighter_lim = beg;
>
> -          if (key->month)
> +          if (lim < beg)
> +            tighter_lim = lim;
> +          else if (key->month)
>              getmonth (beg, &tighter_lim);
>            else if (key->general_numeric)
>              ignore_value (strtold (beg, &tighter_lim));




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 03 Mar 2011 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 14 years and 162 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.