tag 19329 notabug thanks On 12/09/2014 05:47 AM, throwaway1024@nurfuerspam.de wrote: > Dear developers, > The following command results in a document where valbug.txt contains multiple > duplicate lines (e.g. "(TDCYXT)" without quotes) Thanks for the report. > -------- > cat validation.csv | sed 's!\r\n!\n!g' | sed 's! !\n!g' | LC_ALL=C sort | uniq > > valbug.txt Useless use of cat, and unnecessary use of two sed processes. Why not just use: sed 's!\r\n!\n!g; s! !\n!g' validation.csv | LC_ALL=C sort | uniq > -------- > It is definitely a problem of uniq, since the second command line of: > -------- > cat validation.csv | sed 's!\r\n!\n!g' | sed 's! !\n!g' | LC_ALL=C sort > valo.txt > cat valo.txt | uniq > valbug.txt Again, useless use of cat; why not: uniq < valo.txt > valbug.txt > -------- > would result in the same bug and the valo.txt file is sorted correctly. The problem is NOT in uniq, but in your incorrect usage of sed. $ file valo.txt valo.txt: ASCII text, with CRLF, LF line terminators Observe that you have created a file with mixed line endings, and what's more, that you have not gotten rid of carriage return. So let's look a bit closer at the output of that sort command (which in turn is based on your botched sed command): $ sed 's!\r\n!\n!g; s! !\n!g' validation.csv | LC_ALL=C sort \ | grep '(TDCYXT)' | od -tx1z 0000000 28 54 44 43 59 58 54 29 0a 28 54 44 43 59 58 54 >(TDCYXT).(TDCYXT< 0000020 29 0a 28 54 44 43 59 58 54 29 0a 28 54 44 43 59 >).(TDCYXT).(TDCY< 0000040 58 54 29 0a 28 54 44 43 59 58 54 29 0a 28 54 44 >XT).(TDCYXT).(TD< 0000060 43 59 58 54 29 0a 28 54 44 43 59 58 54 29 0a 28 >CYXT).(TDCYXT).(< 0000100 54 44 43 59 58 54 29 0d 0a 28 54 44 43 59 58 54 >TDCYXT)..(TDCYXT< 0000120 29 0d 0a 28 54 44 43 59 58 54 29 0d 0a 28 54 44 >)..(TDCYXT)..(TD< 0000140 43 59 58 54 29 0d 0a 28 54 44 43 59 58 54 29 0d >CYXT)..(TDCYXT).< 0000160 0a 28 54 44 43 59 58 54 29 0d 0a 42 4e 58 4e 28 >.(TDCYXT)..BNXN(< 0000200 54 44 43 59 58 54 29 0a 42 4e 58 4e 28 54 44 43 >TDCYXT).BNXN(TDC< 0000220 59 58 54 29 0a 42 4e 58 4e 28 54 44 43 59 58 54 >YXT).BNXN(TDCYXT< 0000240 29 0d 0a 42 4e 58 4e 28 54 44 43 59 58 54 29 0d >)..BNXN(TDCYXT).< 0000260 0a >.< 0000261 Look closely - you'll see some of the lines have \r still embedded in them. Which means that the lines are NOT unique, and therefore uniq is CORRECTLY showing the line twice in the output (once without \r, once with). So the problem to figure out is why \r is still in the output. That's because your use of sed did NOT do what you think. Remember, in sed, the way things work is that sed FIRST reads in a line of input (excluding the trailing newline, but including a carriage return), then operates on that line in memory, then outputs the line (adding back a newline on output). While manipulating a line in memory, you can add newlines, and you can match newlines against the pattern in memory, but the line you read from input will NOT include any newlines unless you first manipulate the line to add newlines. Thus: 's!\r\n!\n!g' is a no-op when there were no preceeding operations to manipulate in-memory representation. There are NO lines of the input that include a \n to be matched, so you ended up NOT stripping any carriage returns. You MEANT to do something like: $ sed 's!\r!!g; s! !\n!g' validation.csv | LC_ALL=C sort \ | uniq | grep '(TDCYXT)' (TDCYXT) BNXN(TDCYXT) > I also could not reduce the example since using only a few lines around the > buggy lines would make it work again correctly. 1.4 megabytes was a bit much; the problem can be demonstrated in a much smaller input space: $ printf 'a\r\na\n' | sed 's/\r\n/\n/g' | uniq | od -tx1z 0000000 61 0d 0a 61 0a >a..a.< 0000005 $ printf 'a\r\na\n' | sed 's/\r//g' | uniq | od -tx1z 0000000 61 0a >a.< 0000002 Also, it appears that the data you posted may have used some sort of simple substitution cypher; 1.4 megabytes might be plenty of information for someone to reverse-engineer the original text, so I hope you didn't leak something that you thought was secure. > You can find all files in attached zip or under You do realize that .zip is not the best file format in the world; there are other file formats that both compress better, and are more preferred in the free software world because they do not suffer from patent problems like zip. I've marked this as not a bug, but feel free to add further comments or replies if you have more questions. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org