From unknown Sat Aug 16 16:21:25 2025 X-Loop: help-debbugs@gnu.org Subject: bug#13638: linux-sort inconsistency Resent-From: Knud Arnbjerg Christensen Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Wed, 06 Feb 2013 16:55:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 13638 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 13638@debbugs.gnu.org X-Debbugs-Original-To: "bug-coreutils@gnu.org" Received: via spool by submit@debbugs.gnu.org id=B.13601696752988 (code B ref -1); Wed, 06 Feb 2013 16:55:02 +0000 Received: (at submit) by debbugs.gnu.org; 6 Feb 2013 16:54:35 +0000 Received: from localhost ([127.0.0.1]:39529 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U38GX-0000m3-9t for submit@debbugs.gnu.org; Wed, 06 Feb 2013 11:54:35 -0500 Received: from eggs.gnu.org ([208.118.235.92]:50986) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U334C-0000LI-EL for submit@debbugs.gnu.org; Wed, 06 Feb 2013 06:21:31 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U332v-0000zu-4a for submit@debbugs.gnu.org; Wed, 06 Feb 2013 06:20:12 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,HTML_MESSAGE, RCVD_IN_DNSWL_MED, RECEIVED_FROM_WINDOWS_HOST autolearn=no version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:49868) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U332v-0000zo-1k for submit@debbugs.gnu.org; Wed, 06 Feb 2013 06:20:09 -0500 Received: from eggs.gnu.org ([208.118.235.92]:34103) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U332n-00064z-OX for bug-coreutils@gnu.org; Wed, 06 Feb 2013 06:20:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U332k-0000n8-La for bug-coreutils@gnu.org; Wed, 06 Feb 2013 06:20:01 -0500 Received: from co9ehsobe001.messaging.microsoft.com ([207.46.163.24]:16746 helo=co9outboundpool.messaging.microsoft.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U332k-0000mt-A5 for bug-coreutils@gnu.org; Wed, 06 Feb 2013 06:19:58 -0500 Received: from mail204-co9-R.bigfish.com (10.236.132.226) by CO9EHSOBE014.bigfish.com (10.236.130.77) with Microsoft SMTP Server id 14.1.225.23; Wed, 6 Feb 2013 10:49:49 +0000 Received: from mail204-co9 (localhost [127.0.0.1]) by mail204-co9-R.bigfish.com (Postfix) with ESMTP id 900A88802AD for ; Wed, 6 Feb 2013 10:49:49 +0000 (UTC) X-Forefront-Antispam-Report: CIP:130.225.206.176; KIP:(null); UIP:(null); IPV:NLI; H:exchange.ku.dk; RD:unicph-gw.ku.dk; EFVD:NLI X-SpamScore: 1 X-BigFish: VPS1(zzc85dhzz1ee6h1de0h1d18h1202h1e76h1d1ah1d2ahzz18c673hz2fh2a8h668h839h8e2h8e3hd25hf0ah1288h12a5h12bdh137ah1441h1504h1537h153bh15d0h162dh1631h1758h18e1h1946h19b5hbe9i1155h) Received-SPF: pass (mail204-co9: domain of sund.ku.dk designates 130.225.206.176 as permitted sender) client-ip=130.225.206.176; envelope-from=kc@sund.ku.dk; helo=exchange.ku.dk ; change.ku.dk ; Received: from mail204-co9 (localhost.localdomain [127.0.0.1]) by mail204-co9 (MessageSwitch) id 1360147787156049_14013; Wed, 6 Feb 2013 10:49:47 +0000 (UTC) Received: from CO9EHSMHS010.bigfish.com (unknown [10.236.132.235]) by mail204-co9.bigfish.com (Postfix) with ESMTP id 242F464004B for ; Wed, 6 Feb 2013 10:49:47 +0000 (UTC) Received: from exchange.ku.dk (130.225.206.176) by CO9EHSMHS010.bigfish.com (10.236.130.20) with Microsoft SMTP Server (TLS) id 14.1.225.23; Wed, 6 Feb 2013 10:49:41 +0000 Received: from P2KITMBX02WC01.unicph.domain ([fe80::419d:cd50:2df7:5ef7]) by P1KITHUB07W.unicph.domain ([::1]) with mapi id 14.02.0328.009; Wed, 6 Feb 2013 11:49:39 +0100 From: Knud Arnbjerg Christensen Thread-Topic: linux-sort inconsistency Thread-Index: AQHOBFZmXKE12OcLR0aArb3ABwVILQ== Date: Wed, 6 Feb 2013 10:49:38 +0000 Message-ID: <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> Accept-Language: da-DK, en-US Content-Language: da-DK X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.28.4.5] Content-Type: multipart/alternative; boundary="_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_" MIME-Version: 1.0 X-OriginatorOrg: sund.ku.dk X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -3.5 (---) X-Mailman-Approved-At: Wed, 06 Feb 2013 11:54:28 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.2 (------) --_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi linux-sort inconsistency occours when sorting an alfpha-numeric field, then the order becomes different depending on if the following field is num= eric (file 1) or alfanumeric (file 2). In case one the length of the shorte= r fields is extended by =B4zeros=B4 in case 2 the fields is extended by bla= nks which cause the different sorting order. knud c sort -k 1 file1>file1-sorted Seq_101615 00022 x 03262 03068 Seq_101656 00001 x 03068 00470 Seq_101744 00001 x 00470 00586 Seq_10187 00001 x 00181 00553 Seq_10190 00001 x 00553 01182 Seq_101903 00001 x 00586 00331 Seq_101949 00001 x 00331 00822 Seq_10201 00001 x 01182 00396 Seq_10203 00001 x 00396 00499 Seq_10205 00001 x 00499 00603 Seq_10210 00013 x 00603 00370 Seq_1021 00001 x 00744 01203 Seq_102103 00001 x 00822 01356 Seq_102146 00001 x 01356 00303 Seq_10224 00001 x 00370 00864 Seq_10226 00001 x 00864 00205 Seq_102287 00001 x 00303 00290 Seq_102291 00001 x 00290 01632 Seq_1023 00025 x 01203 02268 Seq_102331 00001 x 01632 00204 Seq_102334 00001 x 00204 00354 Seq_102389 00001 x 00354 00303 Seq_1024 00001 x 02268 01267 Seq_102421 00001 x 00303 00281 Seq_102427 00001 x 00281 00757 Seq_10247 00001 x 00205 00406 Seq_10250 00001 x 00406 00647 Seq_102555 00001 x 00757 01351 sort -k 1 file2 >file2-sorted Seq_101615 complete MYRIP Rab effector MyRIP 3161 Seq_101656 incomplete BFSP2 Phakinin 590 Seq_101744 incomplete CK048 Uncharacterized protein C11orf48 678 Seq_10187 incomplete B4DN50 Gap junction protein 640 Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1 416 Seq_10190 incomplete HSF2 Heat shock factor protein 2 1273 Seq_101949 incomplete TCEA3 Transcription elongation factor A protein 3 9= 06 Seq_10201 incomplete E9PNK6 Tumor protein D52-like 1 482 Seq_102103 incomplete ATR Serine/threonine-protein kinase ATR 1456 Seq_10210 complete CENPW Centromere protein W 470 Seq_102146 incomplete E7ET15 U2 snRNP-associated SURP domain-containing 3= 88 Seq_1021 incomplete B1AMR4 Cdc42 guanine nucleotide exchange factor (GEF) 9= 1293 Seq_10224 complete SAMD3 Sterile alpha motif domain-containing protein 3 = 964 Seq_10226 incomplete Q6R5J7 4.1G isoform 292 Seq_102287 incomplete CBPB1 Carboxypeptidase B 387 Seq_102291 incomplete CBPA3 Mast cell carboxypeptidase A 1721 Seq_102331 incomplete T4S1 Transmembrane 4 L6 family member 1 290 Seq_102334 incomplete F8WBG6 Transmembrane 4 L six family member 1 439 Seq_102389 incomplete C9JQ45 Profilin 388 Seq_1023 complete ELF4 ETS-related transcription factor Elf-4 2353 Seq_102421 incomplete KRR1 KRR1 small subunit processome component homolog = 368 Seq_102427 incomplete MD12L Mediator of RNA polymerase II transcription sub= unit 12-like protein 857 Seq_10247 incomplete ERD21 ER lumen protein retaining receptor 1 493 Seq_1024 incomplete JKIP3 Janus kinase and microtubule-interacting protein = 3 1374 Seq_10250 incomplete S35D2 UDP-N-acetylglucosamine/UDP-glucose/GDP-mannose = transporter 740 Seq_102555 incomplete GP149 Probable G-protein coupled receptor 149 1451 --_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi
linux-sort inconsistency occours when sorting an alfpha-numeric field,
= then the order becomes different depending on if the following field is num= eric (file 1) or alfanumeric (file 2). In case one the length of the shorte= r fields is extended by =B4zeros=B4 in case 2 the fields is extended by bla= nks which cause the different sorting order.

knud c

sort -k 1 file1>file1-sorted
Seq_101615 00022   x 03262 03068
Seq_101656 00001   x 03068 00470
Seq_101744 00001   x 00470 00586
Seq_10187 00001   x 00181 00553
Seq_10190 00001   x 00553 01182
Seq_101903 00001   x 00586 00331
Seq_101949 00001   x 00331 00822
Seq_10201 00001   x 01182 00396
Seq_10203 00001   x 00396 00499
Seq_10205 00001   x 00499 00603
Seq_10210 00013   x 00603 00370
Seq_1021 00001   x 00744 01203
Seq_102103 00001   x 00822 01356
Seq_102146 00001   x 01356 00303
Seq_10224 00001   x 00370 00864
Seq_10226 00001   x 00864 00205
Seq_102287 00001   x 00303 00290
Seq_102291 00001   x 00290 01632
Seq_1023 00025   x 01203 02268
Seq_102331 00001   x 01632 00204
Seq_102334 00001   x 00204 00354
Seq_102389 00001   x 00354 00303
Seq_1024 00001   x 02268 01267
Seq_102421 00001   x 00303 00281
Seq_102427 00001   x 00281 00757
Seq_10247 00001   x 00205 00406
Seq_10250 00001   x 00406 00647
Seq_102555 00001   x 00757 01351

sort -k 1 file2 >file2-sorted
Seq_101615 complete MYRIP Rab effector MyRIP   3161
Seq_101656 incomplete BFSP2 Phakinin   590
Seq_101744 incomplete CK048 Uncharacterized protein C11orf48   67= 8
Seq_10187 incomplete B4DN50 Gap junction protein   640
Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1  = 416
Seq_10190 incomplete HSF2 Heat shock factor protein 2   1273
Seq_101949 incomplete TCEA3 Transcription elongation factor A protein 3&nbs= p;  906
Seq_10201 incomplete E9PNK6 Tumor protein D52-like 1   482
Seq_102103 incomplete ATR Serine/threonine-protein kinase ATR   1= 456
Seq_10210 complete CENPW Centromere protein W   470
Seq_102146 incomplete E7ET15 U2 snRNP-associated SURP domain-containing&nbs= p;  388
Seq_1021 incomplete B1AMR4 Cdc42 guanine nucleotide exchange factor (GEF) 9=    1293
Seq_10224 complete SAMD3 Sterile alpha motif domain-containing protein 3&nb= sp;  964
Seq_10226 incomplete Q6R5J7 4.1G isoform   292
Seq_102287 incomplete CBPB1 Carboxypeptidase B   387
Seq_102291 incomplete CBPA3 Mast cell carboxypeptidase A   1721 Seq_102331 incomplete T4S1 Transmembrane 4 L6 family member 1   2= 90
Seq_102334 incomplete F8WBG6 Transmembrane 4 L six family member 1 &nb= sp; 439
Seq_102389 incomplete C9JQ45 Profilin   388
Seq_1023 complete ELF4 ETS-related transcription factor Elf-4   2= 353
Seq_102421 incomplete KRR1 KRR1 small subunit processome component homolog&= nbsp;  368
Seq_102427 incomplete MD12L Mediator of RNA polymerase II transcription sub= unit 12-like protein  857
Seq_10247 incomplete ERD21 ER lumen protein retaining receptor 1  = ; 493
Seq_1024 incomplete JKIP3 Janus kinase and microtubule-interacting protein = 3   1374
Seq_10250 incomplete S35D2 UDP-N-acetylglucosamine/UDP-glucose/GDP-mannose = transporter   740
Seq_102555 incomplete GP149 Probable G-protein coupled receptor 149 &n= bsp; 1451

= --_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_-- From unknown Sat Aug 16 16:21:25 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.428 (Entity 5.428) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Knud Arnbjerg Christensen Subject: bug#13638: closed (Re: bug#13638: linux-sort inconsistency) Message-ID: References: <51129B86.90400@cs.ucla.edu> <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> X-Gnu-PR-Message: they-closed 13638 X-Gnu-PR-Package: coreutils Reply-To: 13638@debbugs.gnu.org Date: Wed, 06 Feb 2013 18:08:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1360174082-13377-1" This is a multi-part message in MIME format... ------------=_1360174082-13377-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #13638: linux-sort inconsistency=20 which was filed against the coreutils package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 13638@debbugs.gnu.org. --=20 13638: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D13638 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1360174082-13377-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 13638-done) by debbugs.gnu.org; 6 Feb 2013 18:07:19 +0000 Received: from localhost ([127.0.0.1]:39653 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U39Ow-0003Sk-U0 for submit@debbugs.gnu.org; Wed, 06 Feb 2013 13:07:19 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:55282) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U39Ou-0003Sd-VH for 13638-done@debbugs.gnu.org; Wed, 06 Feb 2013 13:07:17 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id F2D7339E8109; Wed, 6 Feb 2013 10:05:58 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HdttOA-PUGQ9; Wed, 6 Feb 2013 10:05:58 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id A2D6239E8108; Wed, 6 Feb 2013 10:05:58 -0800 (PST) Message-ID: <51129B86.90400@cs.ucla.edu> Date: Wed, 06 Feb 2013 10:05:58 -0800 From: Paul Eggert User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Knud Arnbjerg Christensen Subject: Re: bug#13638: linux-sort inconsistency References: <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> In-Reply-To: <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13638-done Cc: 13638-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) On 02/06/13 02:49, Knud Arnbjerg Christensen wrote: > linux-sort inconsistency occours when sorting an alfpha-numeric field, > then the order becomes different depending on if the following field is= numeric (file 1) or alfanumeric (file 2). In case one the length of the = shorter fields is extended by =C2=B4zeros=C2=B4 in case 2 the fields is e= xtended by blanks which cause the different sorting order. >=20 > knud c >=20 > sort -k 1 file1>file1-sorted It looks to me like 'sort' is behaving as documented. '-k 1' means "use the concatenation of all the fields, starting with field 1, as the key". It does not mean "use just field 1 as the key". The documentation explains this in some detail. ------------=_1360174082-13377-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 6 Feb 2013 16:54:35 +0000 Received: from localhost ([127.0.0.1]:39529 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U38GX-0000m3-9t for submit@debbugs.gnu.org; Wed, 06 Feb 2013 11:54:35 -0500 Received: from eggs.gnu.org ([208.118.235.92]:50986) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U334C-0000LI-EL for submit@debbugs.gnu.org; Wed, 06 Feb 2013 06:21:31 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U332v-0000zu-4a for submit@debbugs.gnu.org; Wed, 06 Feb 2013 06:20:12 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,HTML_MESSAGE, RCVD_IN_DNSWL_MED, RECEIVED_FROM_WINDOWS_HOST autolearn=no version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:49868) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U332v-0000zo-1k for submit@debbugs.gnu.org; Wed, 06 Feb 2013 06:20:09 -0500 Received: from eggs.gnu.org ([208.118.235.92]:34103) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U332n-00064z-OX for bug-coreutils@gnu.org; Wed, 06 Feb 2013 06:20:08 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1U332k-0000n8-La for bug-coreutils@gnu.org; Wed, 06 Feb 2013 06:20:01 -0500 Received: from co9ehsobe001.messaging.microsoft.com ([207.46.163.24]:16746 helo=co9outboundpool.messaging.microsoft.com) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1U332k-0000mt-A5 for bug-coreutils@gnu.org; Wed, 06 Feb 2013 06:19:58 -0500 Received: from mail204-co9-R.bigfish.com (10.236.132.226) by CO9EHSOBE014.bigfish.com (10.236.130.77) with Microsoft SMTP Server id 14.1.225.23; Wed, 6 Feb 2013 10:49:49 +0000 Received: from mail204-co9 (localhost [127.0.0.1]) by mail204-co9-R.bigfish.com (Postfix) with ESMTP id 900A88802AD for ; Wed, 6 Feb 2013 10:49:49 +0000 (UTC) X-Forefront-Antispam-Report: CIP:130.225.206.176; KIP:(null); UIP:(null); IPV:NLI; H:exchange.ku.dk; RD:unicph-gw.ku.dk; EFVD:NLI X-SpamScore: 1 X-BigFish: VPS1(zzc85dhzz1ee6h1de0h1d18h1202h1e76h1d1ah1d2ahzz18c673hz2fh2a8h668h839h8e2h8e3hd25hf0ah1288h12a5h12bdh137ah1441h1504h1537h153bh15d0h162dh1631h1758h18e1h1946h19b5hbe9i1155h) Received-SPF: pass (mail204-co9: domain of sund.ku.dk designates 130.225.206.176 as permitted sender) client-ip=130.225.206.176; envelope-from=kc@sund.ku.dk; helo=exchange.ku.dk ; change.ku.dk ; Received: from mail204-co9 (localhost.localdomain [127.0.0.1]) by mail204-co9 (MessageSwitch) id 1360147787156049_14013; Wed, 6 Feb 2013 10:49:47 +0000 (UTC) Received: from CO9EHSMHS010.bigfish.com (unknown [10.236.132.235]) by mail204-co9.bigfish.com (Postfix) with ESMTP id 242F464004B for ; Wed, 6 Feb 2013 10:49:47 +0000 (UTC) Received: from exchange.ku.dk (130.225.206.176) by CO9EHSMHS010.bigfish.com (10.236.130.20) with Microsoft SMTP Server (TLS) id 14.1.225.23; Wed, 6 Feb 2013 10:49:41 +0000 Received: from P2KITMBX02WC01.unicph.domain ([fe80::419d:cd50:2df7:5ef7]) by P1KITHUB07W.unicph.domain ([::1]) with mapi id 14.02.0328.009; Wed, 6 Feb 2013 11:49:39 +0100 From: Knud Arnbjerg Christensen To: "bug-coreutils@gnu.org" Subject: linux-sort inconsistency Thread-Topic: linux-sort inconsistency Thread-Index: AQHOBFZmXKE12OcLR0aArb3ABwVILQ== Date: Wed, 6 Feb 2013 10:49:38 +0000 Message-ID: <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> Accept-Language: da-DK, en-US Content-Language: da-DK X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [172.28.4.5] Content-Type: multipart/alternative; boundary="_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_" MIME-Version: 1.0 X-OriginatorOrg: sund.ku.dk X-detected-operating-system: by eggs.gnu.org: Windows 7 or 8 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -3.5 (---) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 06 Feb 2013 11:54:28 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.2 (------) --_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi linux-sort inconsistency occours when sorting an alfpha-numeric field, then the order becomes different depending on if the following field is num= eric (file 1) or alfanumeric (file 2). In case one the length of the shorte= r fields is extended by =B4zeros=B4 in case 2 the fields is extended by bla= nks which cause the different sorting order. knud c sort -k 1 file1>file1-sorted Seq_101615 00022 x 03262 03068 Seq_101656 00001 x 03068 00470 Seq_101744 00001 x 00470 00586 Seq_10187 00001 x 00181 00553 Seq_10190 00001 x 00553 01182 Seq_101903 00001 x 00586 00331 Seq_101949 00001 x 00331 00822 Seq_10201 00001 x 01182 00396 Seq_10203 00001 x 00396 00499 Seq_10205 00001 x 00499 00603 Seq_10210 00013 x 00603 00370 Seq_1021 00001 x 00744 01203 Seq_102103 00001 x 00822 01356 Seq_102146 00001 x 01356 00303 Seq_10224 00001 x 00370 00864 Seq_10226 00001 x 00864 00205 Seq_102287 00001 x 00303 00290 Seq_102291 00001 x 00290 01632 Seq_1023 00025 x 01203 02268 Seq_102331 00001 x 01632 00204 Seq_102334 00001 x 00204 00354 Seq_102389 00001 x 00354 00303 Seq_1024 00001 x 02268 01267 Seq_102421 00001 x 00303 00281 Seq_102427 00001 x 00281 00757 Seq_10247 00001 x 00205 00406 Seq_10250 00001 x 00406 00647 Seq_102555 00001 x 00757 01351 sort -k 1 file2 >file2-sorted Seq_101615 complete MYRIP Rab effector MyRIP 3161 Seq_101656 incomplete BFSP2 Phakinin 590 Seq_101744 incomplete CK048 Uncharacterized protein C11orf48 678 Seq_10187 incomplete B4DN50 Gap junction protein 640 Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1 416 Seq_10190 incomplete HSF2 Heat shock factor protein 2 1273 Seq_101949 incomplete TCEA3 Transcription elongation factor A protein 3 9= 06 Seq_10201 incomplete E9PNK6 Tumor protein D52-like 1 482 Seq_102103 incomplete ATR Serine/threonine-protein kinase ATR 1456 Seq_10210 complete CENPW Centromere protein W 470 Seq_102146 incomplete E7ET15 U2 snRNP-associated SURP domain-containing 3= 88 Seq_1021 incomplete B1AMR4 Cdc42 guanine nucleotide exchange factor (GEF) 9= 1293 Seq_10224 complete SAMD3 Sterile alpha motif domain-containing protein 3 = 964 Seq_10226 incomplete Q6R5J7 4.1G isoform 292 Seq_102287 incomplete CBPB1 Carboxypeptidase B 387 Seq_102291 incomplete CBPA3 Mast cell carboxypeptidase A 1721 Seq_102331 incomplete T4S1 Transmembrane 4 L6 family member 1 290 Seq_102334 incomplete F8WBG6 Transmembrane 4 L six family member 1 439 Seq_102389 incomplete C9JQ45 Profilin 388 Seq_1023 complete ELF4 ETS-related transcription factor Elf-4 2353 Seq_102421 incomplete KRR1 KRR1 small subunit processome component homolog = 368 Seq_102427 incomplete MD12L Mediator of RNA polymerase II transcription sub= unit 12-like protein 857 Seq_10247 incomplete ERD21 ER lumen protein retaining receptor 1 493 Seq_1024 incomplete JKIP3 Janus kinase and microtubule-interacting protein = 3 1374 Seq_10250 incomplete S35D2 UDP-N-acetylglucosamine/UDP-glucose/GDP-mannose = transporter 740 Seq_102555 incomplete GP149 Probable G-protein coupled receptor 149 1451 --_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hi
linux-sort inconsistency occours when sorting an alfpha-numeric field,
= then the order becomes different depending on if the following field is num= eric (file 1) or alfanumeric (file 2). In case one the length of the shorte= r fields is extended by =B4zeros=B4 in case 2 the fields is extended by bla= nks which cause the different sorting order.

knud c

sort -k 1 file1>file1-sorted
Seq_101615 00022   x 03262 03068
Seq_101656 00001   x 03068 00470
Seq_101744 00001   x 00470 00586
Seq_10187 00001   x 00181 00553
Seq_10190 00001   x 00553 01182
Seq_101903 00001   x 00586 00331
Seq_101949 00001   x 00331 00822
Seq_10201 00001   x 01182 00396
Seq_10203 00001   x 00396 00499
Seq_10205 00001   x 00499 00603
Seq_10210 00013   x 00603 00370
Seq_1021 00001   x 00744 01203
Seq_102103 00001   x 00822 01356
Seq_102146 00001   x 01356 00303
Seq_10224 00001   x 00370 00864
Seq_10226 00001   x 00864 00205
Seq_102287 00001   x 00303 00290
Seq_102291 00001   x 00290 01632
Seq_1023 00025   x 01203 02268
Seq_102331 00001   x 01632 00204
Seq_102334 00001   x 00204 00354
Seq_102389 00001   x 00354 00303
Seq_1024 00001   x 02268 01267
Seq_102421 00001   x 00303 00281
Seq_102427 00001   x 00281 00757
Seq_10247 00001   x 00205 00406
Seq_10250 00001   x 00406 00647
Seq_102555 00001   x 00757 01351

sort -k 1 file2 >file2-sorted
Seq_101615 complete MYRIP Rab effector MyRIP   3161
Seq_101656 incomplete BFSP2 Phakinin   590
Seq_101744 incomplete CK048 Uncharacterized protein C11orf48   67= 8
Seq_10187 incomplete B4DN50 Gap junction protein   640
Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1  = 416
Seq_10190 incomplete HSF2 Heat shock factor protein 2   1273
Seq_101949 incomplete TCEA3 Transcription elongation factor A protein 3&nbs= p;  906
Seq_10201 incomplete E9PNK6 Tumor protein D52-like 1   482
Seq_102103 incomplete ATR Serine/threonine-protein kinase ATR   1= 456
Seq_10210 complete CENPW Centromere protein W   470
Seq_102146 incomplete E7ET15 U2 snRNP-associated SURP domain-containing&nbs= p;  388
Seq_1021 incomplete B1AMR4 Cdc42 guanine nucleotide exchange factor (GEF) 9=    1293
Seq_10224 complete SAMD3 Sterile alpha motif domain-containing protein 3&nb= sp;  964
Seq_10226 incomplete Q6R5J7 4.1G isoform   292
Seq_102287 incomplete CBPB1 Carboxypeptidase B   387
Seq_102291 incomplete CBPA3 Mast cell carboxypeptidase A   1721 Seq_102331 incomplete T4S1 Transmembrane 4 L6 family member 1   2= 90
Seq_102334 incomplete F8WBG6 Transmembrane 4 L six family member 1 &nb= sp; 439
Seq_102389 incomplete C9JQ45 Profilin   388
Seq_1023 complete ELF4 ETS-related transcription factor Elf-4   2= 353
Seq_102421 incomplete KRR1 KRR1 small subunit processome component homolog&= nbsp;  368
Seq_102427 incomplete MD12L Mediator of RNA polymerase II transcription sub= unit 12-like protein  857
Seq_10247 incomplete ERD21 ER lumen protein retaining receptor 1  = ; 493
Seq_1024 incomplete JKIP3 Janus kinase and microtubule-interacting protein = 3   1374
Seq_10250 incomplete S35D2 UDP-N-acetylglucosamine/UDP-glucose/GDP-mannose = transporter   740
Seq_102555 incomplete GP149 Probable G-protein coupled receptor 149 &n= bsp; 1451

= --_000_3AB1A1F128718F4DB7206297C73966EA416130DBP2KITMBX02WC01u_-- ------------=_1360174082-13377-1-- From debbugs-submit-bounces@debbugs.gnu.org Wed Feb 06 13:22:35 2013 Received: (at control) by debbugs.gnu.org; 6 Feb 2013 18:22:35 +0000 Received: from localhost ([127.0.0.1]:39672 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U39di-0003qE-Ng for submit@debbugs.gnu.org; Wed, 06 Feb 2013 13:22:35 -0500 Received: from mx1.redhat.com ([209.132.183.28]:38450) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1U39da-0003pu-Ro; Wed, 06 Feb 2013 13:22:31 -0500 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id r16IL8LE020868 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 6 Feb 2013 13:21:08 -0500 Received: from [10.3.113.66] (ovpn-113-66.phx2.redhat.com [10.3.113.66]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id r16IL7dF022302; Wed, 6 Feb 2013 13:21:07 -0500 Message-ID: <51129F13.6080800@redhat.com> Date: Wed, 06 Feb 2013 11:21:07 -0700 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 MIME-Version: 1.0 To: Knud Arnbjerg Christensen Subject: Re: bug#13638: linux-sort inconsistency References: <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> In-Reply-To: <3AB1A1F128718F4DB7206297C73966EA416130DB@P2KITMBX02WC01.unicph.domain> X-Enigmail-Version: 1.5.0 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="----enig2XIUBEPBCHCLUGSUODKOK" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Spam-Score: -6.9 (------) X-Debbugs-Envelope-To: control Cc: 13638-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.9 (------) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) ------enig2XIUBEPBCHCLUGSUODKOK Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable tag 13638 notabug thanks On 02/06/2013 03:49 AM, Knud Arnbjerg Christensen wrote: > Hi > linux-sort inconsistency occours when sorting an alfpha-numeric field, > then the order becomes different depending on if the following field is= numeric (file 1) or alfanumeric (file 2). In case one the length of the = shorter fields is extended by =C2=B4zeros=C2=B4 in case 2 the fields is e= xtended by blanks which cause the different sorting order. This is most likely a product of your locale; you may find this FAQ addresses your issue: https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-n= ot-sort-in-normal-order_0021 > sort -k 1 file1>file1-sorted Oops - this says to use the first field _and on to the rest of the line_ as the single sort key. You probably want to limit the sort to just the first field, using -k1,1 instead. Extracting portions of just 3 lines that went differently between your two invocations: > Seq_10187 00001 x 00181 00553 > Seq_10190 00001 x 00553 01182 > Seq_101903 00001 x 00586 00331 vs. > Seq_10187 incomplete B4DN50 Gap junction protein 640 > Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1 416 > Seq_10190 incomplete HSF2 Heat shock factor protein 2 1273 Using sort's --debug option will make it quite obvious what is going on: $ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903 incomplete\n' | sort -k 1 --debug sort: using =E2=80=98en_US.UTF-8=E2=80=99 sorting rules sort: leading blanks are significant in key 1; consider also specifying '= b' Seq_10187 incomplete ____________________ ____________________ Seq_101903 incomplete _____________________ _____________________ Seq_10190 incomplete ____________________ ____________________ You specified the entire line as the first sort key, and in the en_US.UTF-8 locale, punctuation (including space) is ignored during collation. Since "903i" sorts before "90in" when spacing is removed, that explains why the sort order differs based on whether the text after the space is numeric or alphabetic. Now note what happens when you force the C locale, where every byte is significant during collation, and where "90 in" sorts before "903 i": $ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903 incomplete\n' | LC_ALL=3DC sort -k 1 --debug sort: using simple byte comparison Seq_10187 incomplete ____________________ ____________________ Seq_10190 incomplete ____________________ ____________________ Seq_101903 incomplete _____________________ _____________________ Meanwhile, what you probably wanted is to sort by JUST the first field (note how I added -b as suggested, and used -k1,1 instead of -k1). $ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903 incomplete\n' | sort -b -k 1,1 --debug sort: using =E2=80=98en_US.UTF-8=E2=80=99 sorting rules Seq_10187 incomplete _________ ____________________ Seq_10190 incomplete _________ ____________________ Seq_101903 incomplete __________ _____________________ As such, I'm closing this bug report, although you may feel free to add further comments or questions. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org ------enig2XIUBEPBCHCLUGSUODKOK Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJREp8TAAoJEKeha0olJ0NqitAIALCiIdfPhJvDZESEKTt1rhth +1uxxmd77+qfQiKuv/0hQvwNo0GqWW3FXQzk89F8Jv9N27brYknhZj9vsSQUAxuI L7hh8lPT7Ih3dI1na/tgqpODWK3h3b9+f/gemRs+FkK8IvJYd/ZcD2+8Q7xMGrL+ /0ZaJTQzyuD7IIHw7BZii3Z+kLe60rhxaZL+S6S5eF49331pYWXDBFv3kqREc+an P+euyD04wdZhhTsZC3nCwiLUWDx7GpTOH7lg7FpicEFgrXOpvvuvXN4k4XZuE56h VoB6R+4Ym7g+eU2hyxIfN3A9yr3LVeNw80KlYxmsClWIdfVPCTGVDPmCxhh9xGs= =3iKV -----END PGP SIGNATURE----- ------enig2XIUBEPBCHCLUGSUODKOK--