From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 20 21:35:55 2011 Received: (at submit) by debbugs.gnu.org; 21 Jan 2011 02:35:55 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Pg6qv-0002wZ-Fj for submit@debbugs.gnu.org; Thu, 20 Jan 2011 21:35:55 -0500 Received: from eggs.gnu.org ([140.186.70.92]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Pg6oO-0002t3-VX for submit@debbugs.gnu.org; Thu, 20 Jan 2011 21:33:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pg6vz-0002xy-HI for submit@debbugs.gnu.org; Thu, 20 Jan 2011 21:41:10 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-16.1 required=5.0 tests=BAYES_00, HTML_FONT_SIZE_HUGE, HTML_MESSAGE, NO_RDNS_DOTCOM_HELO, RCVD_IN_DNSWL_NONE, USER_IN_DEF_WHITELIST autolearn=no version=3.3.1 Received: from lists.gnu.org ([199.232.76.165]:58352) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pg6vy-0002xj-Vo for submit@debbugs.gnu.org; Thu, 20 Jan 2011 21:41:07 -0500 Received: from [140.186.70.92] (port=49477 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Pg6vv-0000tP-PM for bug-coreutils@gnu.org; Thu, 20 Jan 2011 21:41:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Pg6vs-0002wz-VN for bug-coreutils@gnu.org; Thu, 20 Jan 2011 21:41:03 -0500 Received: from mrout2-b.corp.re1.yahoo.com ([69.147.107.21]:37006) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Pg6vs-0002wk-Jo for bug-coreutils@gnu.org; Thu, 20 Jan 2011 21:41:00 -0500 Received: from SP2-EX07CAS04.ds.corp.yahoo.com (sp2-ex07cas04.corp.sp2.yahoo.com [98.137.59.5]) by mrout2-b.corp.re1.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id p0L2e3JV055194 for ; Thu, 20 Jan 2011 18:40:03 -0800 (PST) Received: from SP2-EX07VS01.ds.corp.yahoo.com ([98.137.59.29]) by SP2-EX07CAS04.ds.corp.yahoo.com ([98.137.59.5]) with mapi; Thu, 20 Jan 2011 18:40:03 -0800 From: Randall Lewis To: "bug-coreutils@gnu.org" Date: Thu, 20 Jan 2011 18:40:01 -0800 Subject: "sort" bug--inconsistent single-column sorting influenced by other columns? Thread-Topic: "sort" bug--inconsistent single-column sorting influenced by other columns? Thread-Index: Acu5FIBEn8eiKUHiR0+B6tnLwRTTHg== Message-ID: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: yes X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/mixed; boundary="_006_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_" MIME-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-Spam-Score: -6.5 (------) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 20 Jan 2011 21:35:52 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.5 (------) --_006_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_ Content-Type: multipart/alternative; boundary="_000_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_" --_000_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable "sort" does inconsistent sorting. I'm pretty sure it has NOTHING to do with the following warning, although I= could be totally wrong. " *** WARNING *** The locale specified by the environment affects sort order. Set LC_ALL=3DC to get the traditional sort order that uses native byte values. " See the attached shell script and text files. bash-3.2$ cat test1.txt 323|1 36|2 406|3 40|4 587|5 cat test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 Note that the first column is the same for both files. sort test1.txt 323|1 36|2 40|4 406|3 587|5 sort test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 The rows are in a different order depending on the dataset--and it is NOT a= numeric sort. I'm not even sure it is is ANY type of sort. sort -k1 test1.txt 323|1 36|2 40|4 406|3 587|5 sort -k1 test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 Trying to fix the problem by focusing on the first column doesn't work. sort -t "|" test1.txt 323|1 36|2 40|4 406|3 587|5 sort -t "|" test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 sort -t '|' test1.txt 323|1 36|2 40|4 406|3 587|5 sort -t '|' test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 sort -k1 -t "|" test1.txt 323|1 36|2 40|4 406|3 587|5 sort -k1 -t "|" test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 sort -k1 -t '|' test1.txt 323|1 36|2 40|4 406|3 587|5 sort -k1 -t '|' test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 Trying to fix the problem by including delimiter information doesn't work. sort -k1d test1.txt 323|1 36|2 40|4 406|3 587|5 sort -k1d test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 sort -s test1.txt 323|1 36|2 40|4 406|3 587|5 sort -s test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 sort -s -k1 test1.txt 323|1 36|2 40|4 406|3 587|5 sort -s -k1 test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 Neither does dictionary order or stable matching. sort -g test1.txt 36|2 40|4 323|1 406|3 587|5 sort -g test7.txt 36|C2 40|B4 323|B1 406|B3 587|C5 sort -n test1.txt 36|2 40|4 323|1 406|3 587|5 sort -n test7.txt 36|C2 40|B4 323|B1 406|B3 587|C5 Using numeric or general sorting appears to fix the problem on this numeric= example. But why did it sort inconsistently in the first place based on th= e other contents of the file rather than just focusing on the first column--even when I told it to= ? sort test1.txt | join -a1 -a2 -t "\|" - test7.txt 323|1|B1 36|2|C2 40|4 406|3|B3 40|B4 587|5|C5 Inconsistent sorting when combined with 'join' provides incorrect matches a= nd duplication of records. This is a mess. sort test1.txt | sort -c sort test7.txt | sort -c Yet, sort -c says that it is sorted correctly. sort test1.txt 323|1 36|2 40|4 406|3 587|5 sort test7.txt 323|B1 36|C2 406|B3 40|B4 587|C5 sort test1.txt | join -a1 -a2 -j1 -t "\|" -e "0" -o "1.1,1.2,2.2" - test7.t= xt See COMMENTED Cygwin output. # $ sort test1.txt # 323|1 # 36|2 # 406|3 # 40|4 # 587|5 # $ sort test7.txt # 323|B1 # 36|C2 # 406|B3 # 40|B4 # 587|C5 # $ sort test1.txt | join -a1 -a2 -j1 -t "|" -e "0" -o "1.1,1.2,2.2" - test= 7.txt # |B1|1 # |C22 # |B3|3 # |B44 # |C5|5 And finally, Cygwin does this sort consistently across all three examples (= but it does mess up the 'join'). ????? Sucks to be me with a defective Cygw= in and an unreliable so rt and work to get done. Any advice? randall lewis research scientist ralewis@yahoo-inc.com mobile 617-671-8294 4401 great america parkway, santa clara, ca, 95054, us --_000_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

“sort” does inconsistent sorting.=

 

I’m pretty sure it has NOTHING to do with the following warning, although I could be totally wrong.

 

“ *** WARNING ***

The locale specified by the environment affects sort ord= er.

Set LC_ALL=3DC to get the traditional sort order that us= es

native byte values. “

 

 

See the attached shell script and text files.=

 

bash-3.2$

 

 

cat test1.txt

323|1

36|2

406|3

40|4

587|5

cat test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Note that the first column is the same for both files.

 

sort test1.txt

323|1

36|2

40|4

406|3

587|5

sort test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

The rows are in a different order depending on t= he dataset--and it is NOT a numeric sort. I'm not even sure it is is ANY type = of sort.

 

sort -k1 test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Trying to fix the problem by focusing on the fir= st column doesn't work.

 

sort -t "|" test1.txt

323|1

36|2

40|4

406|3

587|5

sort -t "|" test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -t '|' test1.txt

323|1

36|2

40|4

406|3

587|5

sort -t '|' test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -k1 -t "|" test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 -t "|" test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -k1 -t '|' test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 -t '|' test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Trying to fix the problem by including delimiter information doesn't work.

sort -k1d test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1d test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -s test1.txt

323|1

36|2

40|4

406|3

587|5

sort -s test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -s -k1 test1.txt

323|1

36|2

40|4

406|3

587|5

sort -s -k1 test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Neither does dictionary order or stable matching= .

sort -g test1.txt

36|2

40|4

323|1

406|3

587|5

sort -g test7.txt

36|C2

40|B4

323|B1

406|B3

587|C5

sort -n test1.txt

36|2

40|4

323|1

406|3

587|5

sort -n test7.txt

36|C2

40|B4

323|B1

406|B3

587|C5

Using numeric or general sorting appears to fix = the problem on this numeric example. But why did it sort inconsistently in the first place based on the other contents of the

 file rather than just focusing on the firs= t column--even when I told it to?

sort test1.txt | join -a1 -a2 -t "\|" = - test7.txt

323|1|B1

36|2|C2

40|4

406|3|B3

40|B4

587|5|C5

Inconsistent sorting when combined with 'join' provides incorrect matches and duplication of records. This is a mess.=

sort test1.txt | sort -c

sort test7.txt | sort -c

Yet, sort -c says that it is sorted correctly.

sort test1.txt

323|1

36|2

40|4

406|3

587|5

sort test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort test1.txt | join -a1 -a2 -j1 -t "\|&qu= ot; -e "0" -o "1.1,1.2,2.2" - test7.txt

See COMMENTED Cygwin output.

 

# $ sort test1.txt

# 323|1

# 36|2

# 406|3

# 40|4

# 587|5

 

# $ sort test7.txt

# 323|B1

# 36|C2

# 406|B3

# 40|B4

# 587|C5

 

# $ sort test1.txt | join -a1 -a2 -j1 -t "|" -e "0" -o "1.1,1.2,2.2" - test7.txt<= /o:p>

# |B1|1

# |C22

# |B3|3

# |B44

# |C5|5

 

 

And finally, Cygwin does this sort consistently across all three examples (but it does mess up the 'join'). ????? Sucks to = be me with a defective Cygwin and an unreliable so

rt and work to get done. Any advice?<= /span>

 

 

randall lewis
research scientist
 
ralewis@yahoo-inc.com
mobile 617-671-8294
 
4401 great = america parkway, santa clara, ca, 95054, us

<= span style=3D'font-size:6.0pt;color:gray'>

 

 

--_000_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_-- --_006_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_ Content-Type: application/octet-stream; name="SortBug.sh" Content-Description: SortBug.sh Content-Disposition: attachment; filename="SortBug.sh"; size=3098; creation-date="Thu, 20 Jan 2011 17:55:23 GMT"; modification-date="Thu, 20 Jan 2011 18:28:41 GMT" Content-Transfer-Encoding: base64 DQplY2hvICJjYXQgdGVzdDEudHh0Ig0KY2F0IHRlc3QxLnR4dA0KZWNobyAiY2F0IHRlc3Q3LnR4 dCINCmNhdCB0ZXN0Ny50eHQNCg0KZWNobyAiTm90ZSB0aGF0IHRoZSBmaXJzdCBjb2x1bW4gaXMg dGhlIHNhbWUgZm9yIGJvdGggZmlsZXMuIg0KDQplY2hvICJzb3J0IHRlc3QxLnR4dCINCnNvcnQg dGVzdDEudHh0DQplY2hvICJzb3J0IHRlc3Q3LnR4dCINCnNvcnQgdGVzdDcudHh0DQoNCmVjaG8g IlRoZSByb3dzIGFyZSBpbiBhIGRpZmZlcmVudCBvcmRlciBkZXBlbmRpbmcgb24gdGhlIGRhdGFz ZXQtLWFuZCBpdCBpcyBOT1QgYSBudW1lcmljIHNvcnQuIEknbSBub3QgZXZlbiBzdXJlIGl0IGlz IGlzIEFOWSB0eXBlIG9mIHNvcnQuIg0KDQplY2hvICJzb3J0IC1rMSB0ZXN0MS50eHQiDQpzb3J0 IC1rMSB0ZXN0MS50eHQNCmVjaG8gInNvcnQgLWsxIHRlc3Q3LnR4dCINCnNvcnQgLWsxIHRlc3Q3 LnR4dA0KDQplY2hvICJUcnlpbmcgdG8gZml4IHRoZSBwcm9ibGVtIGJ5IGZvY3VzaW5nIG9uIHRo ZSBmaXJzdCBjb2x1bW4gZG9lc24ndCB3b3JrLiINCg0KZWNobyAic29ydCAtdCBcInxcIiB0ZXN0 MS50eHQiDQpzb3J0IC10ICJ8IiB0ZXN0MS50eHQNCmVjaG8gInNvcnQgLXQgXCJ8XCIgdGVzdDcu dHh0Ig0Kc29ydCAtdCAifCIgdGVzdDcudHh0DQoNCmVjaG8gInNvcnQgLXQgJ3wnIHRlc3QxLnR4 dCINCnNvcnQgLXQgJ3wnIHRlc3QxLnR4dA0KZWNobyAic29ydCAtdCAnfCcgdGVzdDcudHh0Ig0K c29ydCAtdCAnfCcgdGVzdDcudHh0DQoNCmVjaG8gInNvcnQgLWsxIC10IFwifFwiIHRlc3QxLnR4 dCINCnNvcnQgLWsxIC10ICJ8IiB0ZXN0MS50eHQNCmVjaG8gInNvcnQgLWsxIC10IFwifFwiIHRl c3Q3LnR4dCINCnNvcnQgLWsxIC10ICJ8IiB0ZXN0Ny50eHQNCg0KZWNobyAic29ydCAtazEgLXQg J3wnIHRlc3QxLnR4dCINCnNvcnQgLWsxIC10ICd8JyB0ZXN0MS50eHQNCmVjaG8gInNvcnQgLWsx IC10ICd8JyB0ZXN0Ny50eHQiDQpzb3J0IC1rMSAtdCAnfCcgdGVzdDcudHh0DQoNCmVjaG8gIlRy eWluZyB0byBmaXggdGhlIHByb2JsZW0gYnkgaW5jbHVkaW5nIGRlbGltaXRlciBpbmZvcm1hdGlv biBkb2Vzbid0IHdvcmsuIg0KDQplY2hvICJzb3J0IC1rMWQgdGVzdDEudHh0Ig0Kc29ydCAtazFk IHRlc3QxLnR4dA0KZWNobyAic29ydCAtazFkIHRlc3Q3LnR4dCINCnNvcnQgLWsxZCB0ZXN0Ny50 eHQNCg0KZWNobyAic29ydCAtcyB0ZXN0MS50eHQiDQpzb3J0IC1zIHRlc3QxLnR4dA0KZWNobyAi c29ydCAtcyB0ZXN0Ny50eHQiDQpzb3J0IC1zIHRlc3Q3LnR4dA0KDQplY2hvICJzb3J0IC1zIC1r MSB0ZXN0MS50eHQiDQpzb3J0IC1zIC1rMSB0ZXN0MS50eHQNCmVjaG8gInNvcnQgLXMgLWsxIHRl c3Q3LnR4dCINCnNvcnQgLXMgLWsxIHRlc3Q3LnR4dA0KDQplY2hvICJOZWl0aGVyIGRvZXMgZGlj dGlvbmFyeSBvcmRlciBvciBzdGFibGUgbWF0Y2hpbmcuIg0KDQplY2hvICJzb3J0IC1nIHRlc3Qx LnR4dCINCnNvcnQgLWcgdGVzdDEudHh0DQplY2hvICJzb3J0IC1nIHRlc3Q3LnR4dCINCnNvcnQg LWcgdGVzdDcudHh0DQoNCmVjaG8gInNvcnQgLW4gdGVzdDEudHh0Ig0Kc29ydCAtbiB0ZXN0MS50 eHQNCmVjaG8gInNvcnQgLW4gdGVzdDcudHh0Ig0Kc29ydCAtbiB0ZXN0Ny50eHQNCg0KZWNobyAi VXNpbmcgbnVtZXJpYyBvciBnZW5lcmFsIHNvcnRpbmcgYXBwZWFycyB0byBmaXggdGhlIHByb2Js ZW0gb24gdGhpcyBudW1lcmljIGV4YW1wbGUuIEJ1dCB3aHkgZGlkIGl0IHNvcnQgaW5jb25zaXN0 ZW50bHkgaW4gdGhlIGZpcnN0IHBsYWNlIGJhc2VkIG9uIHRoZSBvdGhlciBjb250ZW50cyBvZiB0 aGUgZmlsZSByYXRoZXIgdGhhbiBqdXN0IGZvY3VzaW5nIG9uIHRoZSBmaXJzdCBjb2x1bW4tLWV2 ZW4gd2hlbiBJIHRvbGQgaXQgdG8/Ig0KDQplY2hvICJzb3J0IHRlc3QxLnR4dCB8IGpvaW4gLWEx IC1hMiAtdCBcIlx8XCIgLSB0ZXN0Ny50eHQiDQpzb3J0IHRlc3QxLnR4dCB8IGpvaW4gLWExIC1h MiAtdCAifCIgLSB0ZXN0Ny50eHQNCg0KZWNobyAiSW5jb25zaXN0ZW50IHNvcnRpbmcgd2hlbiBj b21iaW5lZCB3aXRoICdqb2luJyBwcm92aWRlcyBpbmNvcnJlY3QgbWF0Y2hlcyBhbmQgZHVwbGlj YXRpb24gb2YgcmVjb3Jkcy4gVGhpcyBpcyBhIG1lc3MuIg0KDQplY2hvICJzb3J0IHRlc3QxLnR4 dCB8IHNvcnQgLWMgIg0Kc29ydCB0ZXN0MS50eHQgfCBzb3J0IC1jIA0KZWNobyAic29ydCB0ZXN0 Ny50eHQgfCBzb3J0IC1jICINCnNvcnQgdGVzdDcudHh0IHwgc29ydCAtYyANCg0KZWNobyAiWWV0 LCBzb3J0IC1jIHNheXMgdGhhdCBpdCBpcyBzb3J0ZWQgY29ycmVjdGx5LiINCg0KZWNobyAic29y dCB0ZXN0MS50eHQiDQpzb3J0IHRlc3QxLnR4dA0KZWNobyAic29ydCB0ZXN0Ny50eHQiDQpzb3J0 IHRlc3Q3LnR4dA0KDQplY2hvICJzb3J0IHRlc3QxLnR4dCB8IGpvaW4gLWExIC1hMiAtajEgLXQg XCJcfFwiIC1lIFwiMFwiIC1vIFwiMS4xLDEuMiwyLjJcIiAtIHRlc3Q3LnR4dCINCiMgc29ydCB0 ZXN0MS50eHQgfCBqb2luIC1hMSAtYTIgLWoxIC10ICJ8IiAtZSAiMCIgLW8gIjEuMSwxLjIsMi4y IiAtIHRlc3Q3LnR4dA0KDQplY2hvICJTZWUgQ09NTUVOVEVEIG91dHB1dC4iDQoNCiMgJCBzb3J0 IHRlc3QxLnR4dA0KIyAzMjN8MQ0KIyAzNnwyDQojIDQwNnwzDQojIDQwfDQNCiMgNTg3fDUNCg0K IyAkIHNvcnQgdGVzdDcudHh0DQojIDMyM3xCMQ0KIyAzNnxDMg0KIyA0MDZ8QjMNCiMgNDB8QjQN CiMgNTg3fEM1DQoNCiMgJCBzb3J0IHRlc3QxLnR4dCB8IGpvaW4gLWExIC1hMiAtajEgLXQgInwi IC1lICIwIiAtbyAiMS4xLDEuMiwyLjIiIC0gdGVzdDcudHh0DQojIHxCMXwxDQojIHxDMjINCiMg fEIzfDMNCiMgfEI0NA0KIyB8QzV8NQ0KDQplY2hvICJBbmQgZmluYWxseSwgQ3lnd2luIGRvZXMg dGhpcyBzb3J0IGNvbnNpc3RlbnRseSBhY3Jvc3MgYWxsIHRocmVlIGV4YW1wbGVzIChidXQgaXQg ZG9lcyBtZXNzIHVwIHRoZSAnam9pbicpLiA/Pz8/PyBTdWNrcyB0byBiZSBtZSB3aXRoIGEgZGVm ZWN0aXZlIEN5Z3dpbiBhbmQgYW4gdW5yZWxpYWJsZSBzb3J0IGFuZCB3b3JrIHRvIGdldCBkb25l LiBBbnkgYWR2aWNlPyINCg0KDQo= --_006_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_ Content-Type: text/plain; name="test7.txt" Content-Description: test7.txt Content-Disposition: attachment; filename="test7.txt"; size=38; creation-date="Thu, 20 Jan 2011 16:54:27 GMT"; modification-date="Thu, 20 Jan 2011 18:09:17 GMT" Content-Transfer-Encoding: base64 MzIzfEIxDQozNnxDMg0KNDA2fEIzDQo0MHxCNA0KNTg3fEM1DQo= --_006_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_ Content-Type: text/plain; name="test1.txt" Content-Description: test1.txt Content-Disposition: attachment; filename="test1.txt"; size=33; creation-date="Thu, 20 Jan 2011 15:52:14 GMT"; modification-date="Thu, 20 Jan 2011 18:09:13 GMT" Content-Transfer-Encoding: base64 MzIzfDENCjM2fDINCjQwNnwzDQo0MHw0DQo1ODd8NQ0K --_006_5EC5BED18D9C0A4891B1DDA894FEC7180282B52682SP2EX07VS01ds_-- From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 21 00:54:23 2011 Received: (at 7878) by debbugs.gnu.org; 21 Jan 2011 05:54:23 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Pg9x0-0007E1-Qb for submit@debbugs.gnu.org; Fri, 21 Jan 2011 00:54:23 -0500 Received: from joseki.proulx.com ([216.17.153.58]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Pg9wx-0007Dm-96 for 7878@debbugs.gnu.org; Fri, 21 Jan 2011 00:54:21 -0500 Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id AD27A21365; Thu, 20 Jan 2011 23:02:11 -0700 (MST) Received: by hysteria.proulx.com (Postfix, from userid 1000) id A6A532DCD5; Thu, 20 Jan 2011 23:02:11 -0700 (MST) Date: Thu, 20 Jan 2011 23:02:11 -0700 From: Bob Proulx To: Randall Lewis Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Message-ID: <20110121060211.GA6946@hysteria.proulx.com> References: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-Spam-Score: -2.4 (--) X-Debbugs-Envelope-To: 7878 Cc: 7878@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.4 (--) Randall Lewis wrote: > "sort" does inconsistent sorting. You are sure about that? :-) > I'm pretty sure it has NOTHING to do with the following warning, > although I could be totally wrong. > > " *** WARNING *** > The locale specified by the environment affects sort order. > Set LC_ALL=C to get the traditional sort order that uses > native byte values. " You read this, know that sort will base the sorting upon the locale setting, but didn't tell us what locale you were using to sort? Shame on you. Because you *know* I am going to ask you about it! :-) What locale are you using? C? en_US.UTF-8? Some other? The locale command will print this information. Here is an example from my system. $ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE=C LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= > sort test1.txt > 323|1 > 36|2 > 40|4 > 406|3 > 587|5 > sort test7.txt > 323|B1 > 36|C2 > 406|B3 > 40|B4 > 587|C5 Looks okay to me for the en_US.UTF-8 locale. But it will of course be different in the C locale. $ LC_ALL=en_US.UTF-8 sort test1.txt 323|1 36|2 40|4 406|3 587|5 $ LC_ALL=C sort test1.txt 323|1 36|2 406|3 40|4 587|5 What ordering did you expect there? I assume you are expecting to see these sorted as in the C locale? > The rows are in a different order depending on the dataset--and it > is NOT a numeric sort. I'm not even sure it is is ANY type of sort. It is a character sort. A string sort. It is comparing the line of characters from start to finish. But it uses the system's collation tables based upon the locale. In the en_US.UTF-8 locale punctuation is ignored and case is folded. I don't like it but the powers that be have decreed it. Please see the FAQ: http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 The standards documentation: http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html Variables that control localization: http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02 > sort -k1 -t "|" test1.txt Hint: If you ever think you need to use -k POS1 then you almost always should be using -k POS1,POS2 to specify where you want the sort to stop comparing. Otherwise it compares all of the way to the end of the line. > But why did it sort inconsistently in the first place based on the > other contents of the file rather than just focusing on the first > column--even when I told it to? You never told it not to continue comparing all of the way to the end of the line. For example this way: $ sort -t'|' -k1,1n -k2,2n test1.txt 36|2 40|4 323|1 406|3 587|5 That won't help you with join since that expects a non-numeric sort ordering. > Inconsistent sorting when combined with 'join' provides incorrect > matches and duplication of records. This is a mess. Yes. Recent versions of join detect and warn about this. Recent versions of sort have a --debug option that can help to identify problem cases. Bob From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 21 02:22:06 2011 Received: (at 7878) by debbugs.gnu.org; 21 Jan 2011 07:22:06 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgBJt-0000gE-Ln for submit@debbugs.gnu.org; Fri, 21 Jan 2011 02:22:06 -0500 Received: from mrout2.yahoo.com ([216.145.54.172]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgBJr-0000fm-3k for 7878@debbugs.gnu.org; Fri, 21 Jan 2011 02:22:04 -0500 Received: from SP2-EX07CAS05.ds.corp.yahoo.com (sp2-ex07cas05.corp.sp2.yahoo.com [98.137.59.39]) by mrout2.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id p0L7Tjd4046622; Thu, 20 Jan 2011 23:29:45 -0800 (PST) Received: from SP2-EX07VS01.ds.corp.yahoo.com ([98.137.59.29]) by SP2-EX07CAS05.ds.corp.yahoo.com ([98.137.59.39]) with mapi; Thu, 20 Jan 2011 23:29:45 -0800 From: Randall Lewis To: Bob Proulx Date: Thu, 20 Jan 2011 23:29:42 -0800 Subject: RE: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Thread-Topic: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Thread-Index: Acu5MMqnr7SU+gxqTQyDM+EVSTI2hgACP02w Message-ID: <5EC5BED18D9C0A4891B1DDA894FEC7180282B526B9@SP2-EX07VS01.ds.corp.yahoo.com> References: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> <20110121060211.GA6946@hysteria.proulx.com> In-Reply-To: <20110121060211.GA6946@hysteria.proulx.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Spam-Score: -19.6 (-------------------) X-Debbugs-Envelope-To: 7878 Cc: "7878@debbugs.gnu.org" <7878@debbugs.gnu.org> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -18.9 (------------------) Hi Bob-- Wow! So, a couple comments about how I seem to have figured out every wrong= way to use "sort" when also using "join." Who would've thought that=20 sort -k1 test1.txt would default to sort on the entire line? (I normally would've thought that= [,POS2] means "optional if you want to have it keep going beyond the first= field.") Also, who would've thought that the default "sort" would be incompatible wi= th "join" and that you would need to write the command like this every time= you wanted to use "join"? LC_ALL=3DC sort test1.txt Or that you would need a special type of "pre-sort" on the column (which I = was executing wrong)? sort -k1,1 -t "|" test1.txt Regardless, here is "locale" (for the record, I'm pretty new to the utiliti= es--and love them. I'm not a computer scientist, but rather an economist tr= ying to fit in at Yahoo! with the engineers and computer scientists). I'm s= ure there's a good reason why there are two, and it's pretty clear that I n= ovice enough that I'll have to learn that later. bash-3.2$ locale LANG=3Den_US.UTF-8 LC_CTYPE=3D"en_US.UTF-8" LC_NUMERIC=3D"en_US.UTF-8" LC_TIME=3D"en_US.UTF-8" LC_COLLATE=3D"en_US.UTF-8" LC_MONETARY=3D"en_US.UTF-8" LC_MESSAGES=3D"en_US.UTF-8" LC_PAPER=3D"en_US.UTF-8" LC_NAME=3D"en_US.UTF-8" LC_ADDRESS=3D"en_US.UTF-8" LC_TELEPHONE=3D"en_US.UTF-8" LC_MEASUREMENT=3D"en_US.UTF-8" LC_IDENTIFICATION=3D"en_US.UTF-8" LC_ALL=3D Thanks, Bob, for sharing two separate ways that I could get the answer the = way I need it--two ways I could not have come up with on my own. Thanks! --Randall P.S. So, the reason why sorting on the column didn't work for me was becaus= e it was plucking out the delimiter and then doing a string sort? Then it w= as string sorting, putting numbers before letters (as you might expect it t= o)?=20 bash-3.2$ sort test1.txt 323|1 36|2 406|3 40|7 <-- Changed from 4 to 7 changed the sort order. 587|5 bash-3.2$ sort test1.txt 323|1 36|2 40|4 406|3 587|5 -----Original Message----- From: Bob Proulx [mailto:bob@proulx.com]=20 Sent: Thursday, January 20, 2011 10:02 PM To: Randall Lewis Cc: 7878@debbugs.gnu.org Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influ= enced by other columns? Randall Lewis wrote: > "sort" does inconsistent sorting. You are sure about that? :-) > I'm pretty sure it has NOTHING to do with the following warning, > although I could be totally wrong. >=20 > " *** WARNING *** > The locale specified by the environment affects sort order. > Set LC_ALL=3DC to get the traditional sort order that uses > native byte values. " You read this, know that sort will base the sorting upon the locale setting, but didn't tell us what locale you were using to sort? Shame on you. Because you *know* I am going to ask you about it! :-) What locale are you using? C? en_US.UTF-8? Some other? The locale command will print this information. Here is an example from my system. $ locale LANG=3Den_US.UTF-8 LC_CTYPE=3D"en_US.UTF-8" LC_NUMERIC=3D"en_US.UTF-8" LC_TIME=3D"en_US.UTF-8" LC_COLLATE=3DC LC_MONETARY=3D"en_US.UTF-8" LC_MESSAGES=3D"en_US.UTF-8" LC_PAPER=3D"en_US.UTF-8" LC_NAME=3D"en_US.UTF-8" LC_ADDRESS=3D"en_US.UTF-8" LC_TELEPHONE=3D"en_US.UTF-8" LC_MEASUREMENT=3D"en_US.UTF-8" LC_IDENTIFICATION=3D"en_US.UTF-8" LC_ALL=3D > sort test1.txt > 323|1 > 36|2 > 40|4 > 406|3 > 587|5 > sort test7.txt > 323|B1 > 36|C2 > 406|B3 > 40|B4 > 587|C5 Looks okay to me for the en_US.UTF-8 locale. But it will of course be different in the C locale. $ LC_ALL=3Den_US.UTF-8 sort test1.txt=20 323|1 36|2 40|4 406|3 587|5 $ LC_ALL=3DC sort test1.txt=20 323|1 36|2 406|3 40|4 587|5 What ordering did you expect there? I assume you are expecting to see these sorted as in the C locale? > The rows are in a different order depending on the dataset--and it > is NOT a numeric sort. I'm not even sure it is is ANY type of sort. It is a character sort. A string sort. It is comparing the line of characters from start to finish. But it uses the system's collation tables based upon the locale. In the en_US.UTF-8 locale punctuation is ignored and case is folded. I don't like it but the powers that be have decreed it. Please see the FAQ: http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-o= rder_0021 The standards documentation: http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html Variables that control localization: http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#ta= g_08_02 > sort -k1 -t "|" test1.txt Hint: If you ever think you need to use -k POS1 then you almost always should be using -k POS1,POS2 to specify where you want the sort to stop comparing. Otherwise it compares all of the way to the end of the line. > But why did it sort inconsistently in the first place based on the > other contents of the file rather than just focusing on the first > column--even when I told it to? You never told it not to continue comparing all of the way to the end of the line. For example this way: $ sort -t'|' -k1,1n -k2,2n test1.txt=20 36|2 40|4 323|1 406|3 587|5 That won't help you with join since that expects a non-numeric sort ordering. > Inconsistent sorting when combined with 'join' provides incorrect > matches and duplication of records. This is a mess. Yes. Recent versions of join detect and warn about this. Recent versions of sort have a --debug option that can help to identify problem cases. Bob From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 21 04:18:00 2011 Received: (at 7878) by debbugs.gnu.org; 21 Jan 2011 09:18:00 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgD84-0003B6-9x for submit@debbugs.gnu.org; Fri, 21 Jan 2011 04:18:00 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgD80-0003At-Tw for 7878@debbugs.gnu.org; Fri, 21 Jan 2011 04:17:58 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 776AA39E80F0; Fri, 21 Jan 2011 01:25:49 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id qtliDk4SiDeQ; Fri, 21 Jan 2011 01:25:49 -0800 (PST) Received: from [192.168.1.10] (pool-71-189-109-235.lsanca.fios.verizon.net [71.189.109.235]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 070FE39E80DB; Fri, 21 Jan 2011 01:25:49 -0800 (PST) Message-ID: <4D395115.7080904@cs.ucla.edu> Date: Fri, 21 Jan 2011 01:25:41 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.13) Gecko/20101208 Thunderbird/3.1.7 MIME-Version: 1.0 To: Randall Lewis Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? References: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> <20110121060211.GA6946@hysteria.proulx.com> <5EC5BED18D9C0A4891B1DDA894FEC7180282B526B9@SP2-EX07VS01.ds.corp.yahoo.com> In-Reply-To: <5EC5BED18D9C0A4891B1DDA894FEC7180282B526B9@SP2-EX07VS01.ds.corp.yahoo.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 7878 Cc: "7878-done@debbugs.gnu.org" <7878@debbugs.gnu.org> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.9 (--) On 01/20/2011 11:29 PM, Randall Lewis wrote: > Also, who would've thought that the default "sort" would be incompatible with "join" and that you would need to write the command like this every time you wanted to use "join"? > > LC_ALL=C sort test1.txt No, "sort" and "join" use the same collating sequence by default. It sounds like you have a different problem: you weren't sorting by the same field that you were joining on. For example, if you want to use plain "join" then you need to sort via "sort -k 1b,1". Or, if you want to use "join -t '|'" then you also need to use "sort -k 1,1 -t '|'". This is documented in the coreutils manual. It may be that "LC_ALL=C sort" worked around your problem on your particular test case, but it won't work in general. From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 21 04:26:51 2011 Received: (at 7878) by debbugs.gnu.org; 21 Jan 2011 09:26:51 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgDGd-0003MU-Jk for submit@debbugs.gnu.org; Fri, 21 Jan 2011 04:26:51 -0500 Received: from mrout2.yahoo.com ([216.145.54.172]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgDGX-0003MI-0W for 7878@debbugs.gnu.org; Fri, 21 Jan 2011 04:26:45 -0500 Received: from SP2-EX07CAS05.ds.corp.yahoo.com (sp2-ex07cas05.corp.sp2.yahoo.com [98.137.59.39]) by mrout2.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id p0L9YO9S084674; Fri, 21 Jan 2011 01:34:24 -0800 (PST) Received: from SP2-EX07VS01.ds.corp.yahoo.com ([98.137.59.29]) by SP2-EX07CAS05.ds.corp.yahoo.com ([98.137.59.39]) with mapi; Fri, 21 Jan 2011 01:34:24 -0800 From: Randall Lewis To: Paul Eggert Date: Fri, 21 Jan 2011 01:34:22 -0800 Subject: RE: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Thread-Topic: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Thread-Index: Acu5TT1MdWu5qtVUQuSQvT/hvw3H9QAACdxQ Message-ID: <5EC5BED18D9C0A4891B1DDA894FEC7180282B526C5@SP2-EX07VS01.ds.corp.yahoo.com> References: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> <20110121060211.GA6946@hysteria.proulx.com> <5EC5BED18D9C0A4891B1DDA894FEC7180282B526B9@SP2-EX07VS01.ds.corp.yahoo.com> <4D395115.7080904@cs.ucla.edu> In-Reply-To: <4D395115.7080904@cs.ucla.edu> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Spam-Score: -18.6 (------------------) X-Debbugs-Envelope-To: 7878 Cc: "7878-done@debbugs.gnu.org" <7878@debbugs.gnu.org> X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -18.4 (------------------) Thanks Paul. Yes, it would seem that the true solution to my problem is doing the follow= ing (as you suggested): use "sort -k 1,1 -t '|'" This ensures that I sort on the first field--whereas "sort -k1 -t '|'" does= not, as much as I wanted it to. ;) Since I was joining on only the first f= ield I should've only been sorting on the first field. So, perhaps the only= logical conflict with my usage here is that "join" works on the first fiel= d by default (as far as I can tell from join --help) while "sort" does not.= But I guess this makes sense since "sort" is used for much more (bizarre u= se cases) than just as a pre-step to "join." I'll read up on the coreutils manual next time. Thanks for being patient with me and for the great feedback. :) --Randall -----Original Message----- From: Paul Eggert [mailto:eggert@cs.ucla.edu]=20 Sent: Friday, January 21, 2011 1:26 AM To: Randall Lewis Cc: 7878-done@debbugs.gnu.org Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influ= enced by other columns? On 01/20/2011 11:29 PM, Randall Lewis wrote: > Also, who would've thought that the default "sort" would be incompatible = with "join" and that you would need to write the command like this every ti= me you wanted to use "join"? >=20 > LC_ALL=3DC sort test1.txt No, "sort" and "join" use the same collating sequence by default. It sounds like you have a different problem: you weren't sorting by the same field that you were joining on. For example, if you want to use plain "join" then you need to sort via "sort -k 1b,1". Or, if you want to use "join -t '|'" then you also need to use "sort -k 1,1 -t '|'". This is documented in the coreutils manual. It may be that "LC_ALL=3DC sort" worked around your problem on your particular test case, but it won't work in general. From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 21 04:37:12 2011 Received: (at 7878-done) by debbugs.gnu.org; 21 Jan 2011 09:37:12 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgDQd-0003be-RG for submit@debbugs.gnu.org; Fri, 21 Jan 2011 04:37:12 -0500 Received: from joseki.proulx.com ([216.17.153.58]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1PgDQb-0003bB-Du for 7878-done@debbugs.gnu.org; Fri, 21 Jan 2011 04:37:10 -0500 Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 50AE921311; Fri, 21 Jan 2011 02:45:02 -0700 (MST) Received: by hysteria.proulx.com (Postfix, from userid 1000) id 3D2952DCD5; Fri, 21 Jan 2011 02:45:02 -0700 (MST) Date: Fri, 21 Jan 2011 02:45:02 -0700 From: Bob Proulx To: Randall Lewis Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns? Message-ID: <20110121094502.GA10491@hysteria.proulx.com> References: <5EC5BED18D9C0A4891B1DDA894FEC7180282B52682@SP2-EX07VS01.ds.corp.yahoo.com> <20110121060211.GA6946@hysteria.proulx.com> <5EC5BED18D9C0A4891B1DDA894FEC7180282B526B9@SP2-EX07VS01.ds.corp.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5EC5BED18D9C0A4891B1DDA894FEC7180282B526B9@SP2-EX07VS01.ds.corp.yahoo.com> User-Agent: Mutt/1.5.20 (2009-06-14) X-Spam-Score: -2.4 (--) X-Debbugs-Envelope-To: 7878-done Cc: 7878-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.4 (--) Hi Randall, Randall Lewis wrote: > Wow! So, a couple comments about how I seem to have figured out > every wrong way to use "sort" when also using "join." You did have an impressive number of cases examined! > Who would've thought that > > sort -k1 test1.txt > > would default to sort on the entire line? (I normally would've > thought that [,POS2] means "optional if you want to have it keep > going beyond the first field.") You are not the only one to have had that misconception. But that is the way that it has always worked. Here is the GNU sort documentation. `-k POS1[,POS2]' `--key=POS1[,POS2]' Specify a sort field that consists of the part of the line between POS1 and POS2 (or the end of the line, if POS2 is omitted), _inclusive_. This behavior goes back at least to Unix v7 days and actually very likely well before that time. When you are a programmer in the middle 1970's writing a sorting program and you make a simple decision about how to control sorting using command line arguments would you have had any idea that in 2011 we would still be using virtually the same program and interface forty years later? And you are working on the problem for what amounts to the first time on a new operating system. Having done interface design and having been less successful I can't complain. :-) Some of the decisions were less than great. Other decisions were excellent and visionary. On average they were better than most of us can do on our best days. > Also, who would've thought that the default "sort" would be > incompatible with "join" and that you would need to write the > command like this every time you wanted to use "join"? When sort and join were written they were compatible. Back then the collation sequence was strictly byte ordering. That is the standard C locale ordering. It wasn't until recently when locales were introduced with en_US and similar that problems were introduced. For reasons unfathomable to me the powers that be made sort ordering dictionary ordering where case is folded and punctuation is ignored. They failed to see how this would negatively impact almost everything. Creeping features. Because punctuation is ignored in the en_US locale it causes a lot of problems. You didn't have to say LC_ALL=C for the first thirty years. Don't get me started. I have been a rather outspoken critic of this design decision. Personally I have the following set in my shell environment. export LANG=en_US.UTF-8 export LC_COLLATE=C I want the traditional collation sequence and so set LC_COLLATE. But I also want the fancy new characters with umlauts and that requires (along with a unicode charset) a UTF-8 capable locale. The above is a compromise but for me a good one. > LC_ALL=C sort test1.txt > > Or that you would need a special type of "pre-sort" on the column > (which I was executing wrong)? > > sort -k1,1 -t "|" test1.txt Since you had two fields you probably want to sort on the second field too. sort -k1,1 -k2,2 -t "|" test1.txt That will sort on the first field and then the second field. > Regardless, here is "locale" (for the record, I'm pretty new to the > utilities--and love them. I'm not a computer scientist, but rather > an economist trying to fit in at Yahoo! with the engineers and > computer scientists). I'm sure there's a good reason why there are > two, and it's pretty clear that I novice enough that I'll have to > learn that later. I didn't follow where the "two" was attached. Two as in economists and computer scientists? Or two as in engineers and computer scientists? Full disclosure: I am an electrical engineer. :-) > Thanks, Bob, for sharing two separate ways that I could get the > answer the way I need it--two ways I could not have come up with on > my own. Just to nudge in a particular direction there are two other mailing lists that are good to know about. The coreutils@gnu.org mailing list is for general discussion of the coreutils. Here on bug-coreutils is where bug reports are collected every message thread opens a bug ticket in the bug tracking system. Which is great for bug reports. But not so good for general discussion since it keeps opening bugs that need to be triaged. That is why we have the coreutils mailing list which is just a normal list for normal discussion. Additionally there is a general discussion list for general help help-gnu-utils@gnu.org that is also a good resource. > P.S. So, the reason why sorting on the column didn't work for me was > because it was plucking out the delimiter and then doing a string > sort? Correct. > Then it was string sorting, putting numbers before letters (as > you might expect it to)? It would look like this to sort: $ sed 's/[[:punct:]]//' test1.txt 3231 362 4063 404 5875 $ sed 's/[[:punct:]]//' test1.txt | LC_ALL=C sort 3231 362 404 4063 5875 > 323|1 > 36|2 > 406|3 > 40|7 <-- Changed from 4 to 7 changed the sort order. > 587|5 $ sed 's/[[:punct:]]//' test1.txt | LC_ALL=C sort 3231 362 4063 407 5875 And case is folded too. But that didn't come into play here. And this affects everything that sorts everywhere on the system. Including the shell. echo * for f in *; do ... ls Bob From unknown Sat Jun 21 05:17:20 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Fri, 18 Feb 2011 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator