“sort” does inconsistent sorting.
I’m pretty sure it has NOTHING to do with the
following warning, although I could be totally wrong.
“ *** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values. “
See the attached shell script and text files.
bash-3.2$
cat test1.txt
323|1
36|2
406|3
40|4
587|5
cat test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
Note that the first column is the same for both
files.
sort test1.txt
323|1
36|2
40|4
406|3
587|5
sort test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
The rows are in a different order depending on the
dataset--and it is NOT a numeric sort. I'm not even sure it is is ANY type of
sort.
sort -k1 test1.txt
323|1
36|2
40|4
406|3
587|5
sort -k1 test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
Trying to fix the problem by focusing on the first
column doesn't work.
sort -t "|" test1.txt
323|1
36|2
40|4
406|3
587|5
sort -t "|" test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
sort -t '|' test1.txt
323|1
36|2
40|4
406|3
587|5
sort -t '|' test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
sort -k1 -t "|" test1.txt
323|1
36|2
40|4
406|3
587|5
sort -k1 -t "|" test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
sort -k1 -t '|' test1.txt
323|1
36|2
40|4
406|3
587|5
sort -k1 -t '|' test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
Trying to fix the problem by including delimiter
information doesn't work.
sort -k1d test1.txt
323|1
36|2
40|4
406|3
587|5
sort -k1d test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
sort -s test1.txt
323|1
36|2
40|4
406|3
587|5
sort -s test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
sort -s -k1 test1.txt
323|1
36|2
40|4
406|3
587|5
sort -s -k1 test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
Neither does dictionary order or stable matching.
sort -g test1.txt
36|2
40|4
323|1
406|3
587|5
sort -g test7.txt
36|C2
40|B4
323|B1
406|B3
587|C5
sort -n test1.txt
36|2
40|4
323|1
406|3
587|5
sort -n test7.txt
36|C2
40|B4
323|B1
406|B3
587|C5
Using numeric or general sorting appears to fix the
problem on this numeric example. But why did it sort inconsistently in the
first place based on the other contents of the
file rather than just focusing on the first
column--even when I told it to?
sort test1.txt | join -a1 -a2 -t "\|" -
test7.txt
323|1|B1
36|2|C2
40|4
406|3|B3
40|B4
587|5|C5
Inconsistent sorting when combined with 'join'
provides incorrect matches and duplication of records. This is a mess.
sort test1.txt | sort -c
sort test7.txt | sort -c
Yet, sort -c says that it is sorted correctly.
sort test1.txt
323|1
36|2
40|4
406|3
587|5
sort test7.txt
323|B1
36|C2
406|B3
40|B4
587|C5
sort test1.txt | join -a1 -a2 -j1 -t "\|"
-e "0" -o "1.1,1.2,2.2" - test7.txt
See COMMENTED Cygwin output.
# $ sort test1.txt
# 323|1
# 36|2
# 406|3
# 40|4
# 587|5
# $ sort test7.txt
# 323|B1
# 36|C2
# 406|B3
# 40|B4
# 587|C5
# $ sort test1.txt | join -a1 -a2 -j1 -t
"|" -e "0" -o "1.1,1.2,2.2" - test7.txt
# |B1|1
# |C22
# |B3|3
# |B44
# |C5|5
And finally, Cygwin does this sort consistently
across all three examples (but it does mess up the 'join'). ????? Sucks to be
me with a defective Cygwin and an unreliable so
rt and work to get done. Any advice?
randall
lewis
research scientist
ralewis@yahoo-inc.com
mobile 617-671-8294
4401 great