GNU bug report logs - #13638
linux-sort inconsistency

Previous Next

Package: coreutils;

Reported by: Knud Arnbjerg Christensen <kc <at> sund.ku.dk>

Date: Wed, 6 Feb 2013 16:55:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13638 in the body.
You can then email your comments to 13638 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#13638; Package coreutils. (Wed, 06 Feb 2013 16:55:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Knud Arnbjerg Christensen <kc <at> sund.ku.dk>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 06 Feb 2013 16:55:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Knud Arnbjerg Christensen <kc <at> sund.ku.dk>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: linux-sort inconsistency 
Date: Wed, 6 Feb 2013 10:49:38 +0000
[Message part 1 (text/plain, inline)]
Hi
linux-sort inconsistency occours when sorting an alfpha-numeric field,
then the order becomes different depending on if the following field is numeric (file 1) or alfanumeric (file 2). In case one the length of the shorter fields is extended by ´zeros´ in case 2 the fields is extended by blanks which cause the different sorting order.

knud c

sort -k 1 file1>file1-sorted
Seq_101615 00022   x 03262 03068
Seq_101656 00001   x 03068 00470
Seq_101744 00001   x 00470 00586
Seq_10187 00001   x 00181 00553
Seq_10190 00001   x 00553 01182
Seq_101903 00001   x 00586 00331
Seq_101949 00001   x 00331 00822
Seq_10201 00001   x 01182 00396
Seq_10203 00001   x 00396 00499
Seq_10205 00001   x 00499 00603
Seq_10210 00013   x 00603 00370
Seq_1021 00001   x 00744 01203
Seq_102103 00001   x 00822 01356
Seq_102146 00001   x 01356 00303
Seq_10224 00001   x 00370 00864
Seq_10226 00001   x 00864 00205
Seq_102287 00001   x 00303 00290
Seq_102291 00001   x 00290 01632
Seq_1023 00025   x 01203 02268
Seq_102331 00001   x 01632 00204
Seq_102334 00001   x 00204 00354
Seq_102389 00001   x 00354 00303
Seq_1024 00001   x 02268 01267
Seq_102421 00001   x 00303 00281
Seq_102427 00001   x 00281 00757
Seq_10247 00001   x 00205 00406
Seq_10250 00001   x 00406 00647
Seq_102555 00001   x 00757 01351

sort -k 1 file2 >file2-sorted
Seq_101615 complete MYRIP Rab effector MyRIP   3161
Seq_101656 incomplete BFSP2 Phakinin   590
Seq_101744 incomplete CK048 Uncharacterized protein C11orf48   678
Seq_10187 incomplete B4DN50 Gap junction protein   640
Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1   416
Seq_10190 incomplete HSF2 Heat shock factor protein 2   1273
Seq_101949 incomplete TCEA3 Transcription elongation factor A protein 3   906
Seq_10201 incomplete E9PNK6 Tumor protein D52-like 1   482
Seq_102103 incomplete ATR Serine/threonine-protein kinase ATR   1456
Seq_10210 complete CENPW Centromere protein W   470
Seq_102146 incomplete E7ET15 U2 snRNP-associated SURP domain-containing   388
Seq_1021 incomplete B1AMR4 Cdc42 guanine nucleotide exchange factor (GEF) 9   1293
Seq_10224 complete SAMD3 Sterile alpha motif domain-containing protein 3   964
Seq_10226 incomplete Q6R5J7 4.1G isoform   292
Seq_102287 incomplete CBPB1 Carboxypeptidase B   387
Seq_102291 incomplete CBPA3 Mast cell carboxypeptidase A   1721
Seq_102331 incomplete T4S1 Transmembrane 4 L6 family member 1   290
Seq_102334 incomplete F8WBG6 Transmembrane 4 L six family member 1   439
Seq_102389 incomplete C9JQ45 Profilin   388
Seq_1023 complete ELF4 ETS-related transcription factor Elf-4   2353
Seq_102421 incomplete KRR1 KRR1 small subunit processome component homolog   368
Seq_102427 incomplete MD12L Mediator of RNA polymerase II transcription subunit 12-like protein  857
Seq_10247 incomplete ERD21 ER lumen protein retaining receptor 1   493
Seq_1024 incomplete JKIP3 Janus kinase and microtubule-interacting protein 3   1374
Seq_10250 incomplete S35D2 UDP-N-acetylglucosamine/UDP-glucose/GDP-mannose transporter   740
Seq_102555 incomplete GP149 Probable G-protein coupled receptor 149   1451

[Message part 2 (text/html, inline)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Wed, 06 Feb 2013 18:08:01 GMT) Full text and rfc822 format available.

Notification sent to Knud Arnbjerg Christensen <kc <at> sund.ku.dk>:
bug acknowledged by developer. (Wed, 06 Feb 2013 18:08:02 GMT) Full text and rfc822 format available.

Message #10 received at 13638-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Knud Arnbjerg Christensen <kc <at> sund.ku.dk>
Cc: 13638-done <at> debbugs.gnu.org
Subject: Re: bug#13638: linux-sort inconsistency
Date: Wed, 06 Feb 2013 10:05:58 -0800
On 02/06/13 02:49, Knud Arnbjerg Christensen wrote:
> linux-sort inconsistency occours when sorting an alfpha-numeric field,
> then the order becomes different depending on if the following field is numeric (file 1) or alfanumeric (file 2). In case one the length of the shorter fields is extended by ´zeros´ in case 2 the fields is extended by blanks which cause the different sorting order.
> 
> knud c
> 
> sort -k 1 file1>file1-sorted

It looks to me like 'sort' is behaving as documented.
'-k 1' means "use the concatenation of all the fields, starting
with field 1, as the key".  It does not mean "use
just field 1 as the key".  The documentation explains
this in some detail.




Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Wed, 06 Feb 2013 18:23:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#13638; Package coreutils. (Wed, 06 Feb 2013 18:23:02 GMT) Full text and rfc822 format available.

Message #15 received at 13638-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Knud Arnbjerg Christensen <kc <at> sund.ku.dk>
Cc: 13638-done <at> debbugs.gnu.org
Subject: Re: bug#13638: linux-sort inconsistency
Date: Wed, 06 Feb 2013 11:21:07 -0700
[Message part 1 (text/plain, inline)]
tag 13638 notabug
thanks

On 02/06/2013 03:49 AM, Knud Arnbjerg Christensen wrote:
> Hi
> linux-sort inconsistency occours when sorting an alfpha-numeric field,
> then the order becomes different depending on if the following field is numeric (file 1) or alfanumeric (file 2). In case one the length of the shorter fields is extended by ´zeros´ in case 2 the fields is extended by blanks which cause the different sorting order.

This is most likely a product of your locale; you may find this FAQ
addresses your issue:
https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021

> sort -k 1 file1>file1-sorted

Oops - this says to use the first field _and on to the rest of the line_
as the single sort key.  You probably want to limit the sort to just the
first field, using -k1,1 instead.

Extracting portions of just 3 lines that went differently between your
two invocations:

> Seq_10187 00001   x 00181 00553
> Seq_10190 00001   x 00553 01182
> Seq_101903 00001   x 00586 00331

vs.

> Seq_10187 incomplete B4DN50 Gap junction protein   640
> Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1   416
> Seq_10190 incomplete HSF2 Heat shock factor protein 2   1273

Using sort's --debug option will make it quite obvious what is going on:

$ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903
incomplete\n' | sort -k 1 --debug
sort: using ‘en_US.UTF-8’ sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
Seq_10187 incomplete
____________________
____________________
Seq_101903 incomplete
_____________________
_____________________
Seq_10190 incomplete
____________________
____________________


You specified the entire line as the first sort key, and in the
en_US.UTF-8 locale, punctuation (including space) is ignored during
collation.  Since "903i" sorts before "90in" when spacing is removed,
that explains why the sort order differs based on whether the text after
the space is numeric or alphabetic.  Now note what happens when you
force the C locale, where every byte is significant during collation,
and where "90 in" sorts before "903 i":

$ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903
incomplete\n' | LC_ALL=C sort -k 1 --debug
sort: using simple byte comparison
Seq_10187 incomplete
____________________
____________________
Seq_10190 incomplete
____________________
____________________
Seq_101903 incomplete
_____________________
_____________________

Meanwhile, what you probably wanted is to sort by JUST the first field
(note how I added -b as suggested, and used -k1,1 instead of -k1).

$ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903
incomplete\n' | sort -b -k 1,1 --debug
sort: using ‘en_US.UTF-8’ sorting rules
Seq_10187 incomplete
_________
____________________
Seq_10190 incomplete
_________
____________________
Seq_101903 incomplete
__________
_____________________


As such, I'm closing this bug report, although you may feel free to add
further comments or questions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 07 Mar 2013 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 12 years and 113 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.