tag 13638 notabug thanks On 02/06/2013 03:49 AM, Knud Arnbjerg Christensen wrote: > Hi > linux-sort inconsistency occours when sorting an alfpha-numeric field, > then the order becomes different depending on if the following field is numeric (file 1) or alfanumeric (file 2). In case one the length of the shorter fields is extended by ´zeros´ in case 2 the fields is extended by blanks which cause the different sorting order. This is most likely a product of your locale; you may find this FAQ addresses your issue: https://www.gnu.org/software/coreutils/faq/coreutils-faq.html#Sort-does-not-sort-in-normal-order_0021 > sort -k 1 file1>file1-sorted Oops - this says to use the first field _and on to the rest of the line_ as the single sort key. You probably want to limit the sort to just the first field, using -k1,1 instead. Extracting portions of just 3 lines that went differently between your two invocations: > Seq_10187 00001 x 00181 00553 > Seq_10190 00001 x 00553 01182 > Seq_101903 00001 x 00586 00331 vs. > Seq_10187 incomplete B4DN50 Gap junction protein 640 > Seq_101903 incomplete FAIM1 Fas apoptotic inhibitory molecule 1 416 > Seq_10190 incomplete HSF2 Heat shock factor protein 2 1273 Using sort's --debug option will make it quite obvious what is going on: $ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903 incomplete\n' | sort -k 1 --debug sort: using ‘en_US.UTF-8’ sorting rules sort: leading blanks are significant in key 1; consider also specifying 'b' Seq_10187 incomplete ____________________ ____________________ Seq_101903 incomplete _____________________ _____________________ Seq_10190 incomplete ____________________ ____________________ You specified the entire line as the first sort key, and in the en_US.UTF-8 locale, punctuation (including space) is ignored during collation. Since "903i" sorts before "90in" when spacing is removed, that explains why the sort order differs based on whether the text after the space is numeric or alphabetic. Now note what happens when you force the C locale, where every byte is significant during collation, and where "90 in" sorts before "903 i": $ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903 incomplete\n' | LC_ALL=C sort -k 1 --debug sort: using simple byte comparison Seq_10187 incomplete ____________________ ____________________ Seq_10190 incomplete ____________________ ____________________ Seq_101903 incomplete _____________________ _____________________ Meanwhile, what you probably wanted is to sort by JUST the first field (note how I added -b as suggested, and used -k1,1 instead of -k1). $ printf 'Seq_10187 incomplete\nSeq_10190 incomplete\nSeq_101903 incomplete\n' | sort -b -k 1,1 --debug sort: using ‘en_US.UTF-8’ sorting rules Seq_10187 incomplete _________ ____________________ Seq_10190 incomplete _________ ____________________ Seq_101903 incomplete __________ _____________________ As such, I'm closing this bug report, although you may feel free to add further comments or questions. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org