“sort” does inconsistent sorting.

 

I’m pretty sure it has NOTHING to do with the following warning, although I could be totally wrong.

 

“ *** WARNING ***

The locale specified by the environment affects sort order.

Set LC_ALL=C to get the traditional sort order that uses

native byte values. “

 

 

See the attached shell script and text files.

 

bash-3.2$

 

 

cat test1.txt

323|1

36|2

406|3

40|4

587|5

cat test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Note that the first column is the same for both files.

 

sort test1.txt

323|1

36|2

40|4

406|3

587|5

sort test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

The rows are in a different order depending on the dataset--and it is NOT a numeric sort. I'm not even sure it is is ANY type of sort.

 

sort -k1 test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Trying to fix the problem by focusing on the first column doesn't work.

 

sort -t "|" test1.txt

323|1

36|2

40|4

406|3

587|5

sort -t "|" test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -t '|' test1.txt

323|1

36|2

40|4

406|3

587|5

sort -t '|' test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -k1 -t "|" test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 -t "|" test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -k1 -t '|' test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1 -t '|' test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Trying to fix the problem by including delimiter information doesn't work.

sort -k1d test1.txt

323|1

36|2

40|4

406|3

587|5

sort -k1d test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -s test1.txt

323|1

36|2

40|4

406|3

587|5

sort -s test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort -s -k1 test1.txt

323|1

36|2

40|4

406|3

587|5

sort -s -k1 test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

Neither does dictionary order or stable matching.

sort -g test1.txt

36|2

40|4

323|1

406|3

587|5

sort -g test7.txt

36|C2

40|B4

323|B1

406|B3

587|C5

sort -n test1.txt

36|2

40|4

323|1

406|3

587|5

sort -n test7.txt

36|C2

40|B4

323|B1

406|B3

587|C5

Using numeric or general sorting appears to fix the problem on this numeric example. But why did it sort inconsistently in the first place based on the other contents of the

 file rather than just focusing on the first column--even when I told it to?

sort test1.txt | join -a1 -a2 -t "\|" - test7.txt

323|1|B1

36|2|C2

40|4

406|3|B3

40|B4

587|5|C5

Inconsistent sorting when combined with 'join' provides incorrect matches and duplication of records. This is a mess.

sort test1.txt | sort -c

sort test7.txt | sort -c

Yet, sort -c says that it is sorted correctly.

sort test1.txt

323|1

36|2

40|4

406|3

587|5

sort test7.txt

323|B1

36|C2

406|B3

40|B4

587|C5

sort test1.txt | join -a1 -a2 -j1 -t "\|" -e "0" -o "1.1,1.2,2.2" - test7.txt

See COMMENTED Cygwin output.

 

# $ sort test1.txt

# 323|1

# 36|2

# 406|3

# 40|4

# 587|5

 

# $ sort test7.txt

# 323|B1

# 36|C2

# 406|B3

# 40|B4

# 587|C5

 

# $ sort test1.txt | join -a1 -a2 -j1 -t "|" -e "0" -o "1.1,1.2,2.2" - test7.txt

# |B1|1

# |C22

# |B3|3

# |B44

# |C5|5

 

 

And finally, Cygwin does this sort consistently across all three examples (but it does mess up the 'join'). ????? Sucks to be me with a defective Cygwin and an unreliable so

rt and work to get done. Any advice?

 

 

randall lewis
research scientist
 
ralewis@yahoo-inc.com
mobile 617-671-8294
 
4401 great america parkway, santa clara, ca, 95054, us