tag 19329 notabug
thanks

On 12/09/2014 05:47 AM, throwaway1024@nurfuerspam.de wrote:
> Dear developers,
> The following command results in a document where valbug.txt contains multiple 
> duplicate lines (e.g. "(TDCYXT)" without quotes)

Thanks for the report.

> --------
> cat validation.csv | sed 's!\r\n!\n!g' | sed 's! !\n!g' | LC_ALL=C sort | uniq > 
> valbug.txt

Useless use of cat, and unnecessary use of two sed processes.  Why not
just use:

sed 's!\r\n!\n!g; s! !\n!g' validation.csv | LC_ALL=C sort | uniq

> --------
> It is definitely a problem of uniq, since the second command line of:
> --------
> cat validation.csv | sed 's!\r\n!\n!g' | sed 's! !\n!g' | LC_ALL=C sort > valo.txt
> cat valo.txt | uniq > valbug.txt

Again, useless use of cat; why not:

uniq < valo.txt > valbug.txt

> --------
> would result in the same bug and the valo.txt file is sorted correctly.

The problem is NOT in uniq, but in your incorrect usage of sed.

$ file valo.txt
valo.txt: ASCII text, with CRLF, LF line terminators

Observe that you have created a file with mixed line endings, and what's
more, that you have not gotten rid of carriage return.  So let's look a
bit closer at the output of that sort command (which in turn is based on
your botched sed command):

$ sed 's!\r\n!\n!g; s! !\n!g' validation.csv | LC_ALL=C sort \
  | grep '(TDCYXT)' | od -tx1z
0000000 28 54 44 43 59 58 54 29 0a 28 54 44 43 59 58 54  >(TDCYXT).(TDCYXT<
0000020 29 0a 28 54 44 43 59 58 54 29 0a 28 54 44 43 59  >).(TDCYXT).(TDCY<
0000040 58 54 29 0a 28 54 44 43 59 58 54 29 0a 28 54 44  >XT).(TDCYXT).(TD<
0000060 43 59 58 54 29 0a 28 54 44 43 59 58 54 29 0a 28  >CYXT).(TDCYXT).(<
0000100 54 44 43 59 58 54 29 0d 0a 28 54 44 43 59 58 54  >TDCYXT)..(TDCYXT<
0000120 29 0d 0a 28 54 44 43 59 58 54 29 0d 0a 28 54 44  >)..(TDCYXT)..(TD<
0000140 43 59 58 54 29 0d 0a 28 54 44 43 59 58 54 29 0d  >CYXT)..(TDCYXT).<
0000160 0a 28 54 44 43 59 58 54 29 0d 0a 42 4e 58 4e 28  >.(TDCYXT)..BNXN(<
0000200 54 44 43 59 58 54 29 0a 42 4e 58 4e 28 54 44 43  >TDCYXT).BNXN(TDC<
0000220 59 58 54 29 0a 42 4e 58 4e 28 54 44 43 59 58 54  >YXT).BNXN(TDCYXT<
0000240 29 0d 0a 42 4e 58 4e 28 54 44 43 59 58 54 29 0d  >)..BNXN(TDCYXT).<
0000260 0a                                               >.<
0000261

Look closely - you'll see some of the lines have \r still embedded in
them.  Which means that the lines are NOT unique, and therefore uniq is
CORRECTLY showing the line twice in the output (once without \r, once with).

So the problem to figure out is why \r is still in the output.  That's
because your use of sed did NOT do what you think.  Remember, in sed,
the way things work is that sed FIRST reads in a line of input
(excluding the trailing newline, but including a carriage return), then
operates on that line in memory, then outputs the line (adding back a
newline on output).  While manipulating a line in memory, you can add
newlines, and you can match newlines against the pattern in memory, but
the line you read from input will NOT include any newlines unless you
first manipulate the line to add newlines.  Thus:

's!\r\n!\n!g'

is a no-op when there were no preceeding operations to manipulate
in-memory representation.  There are NO lines of the input that include
a \n to be matched, so you ended up NOT stripping any carriage returns.

You MEANT to do something like:

$ sed 's!\r!!g; s! !\n!g' validation.csv | LC_ALL=C sort \
  | uniq | grep '(TDCYXT)'
(TDCYXT)
BNXN(TDCYXT)

> I also could not reduce the example since using only a few lines around the 
> buggy lines would make it work again correctly.

1.4 megabytes was a bit much; the problem can be demonstrated in a much
smaller input space:

$ printf 'a\r\na\n' | sed 's/\r\n/\n/g' | uniq | od -tx1z
0000000 61 0d 0a 61 0a                                   >a..a.<
0000005
$ printf 'a\r\na\n' | sed 's/\r//g' | uniq | od -tx1z
0000000 61 0a                                            >a.<
0000002

Also, it appears that the data you posted may have used some sort of
simple substitution cypher; 1.4 megabytes might be plenty of information
for someone to reverse-engineer the original text, so I hope you didn't
leak something that you thought was secure.

> You can find all files in attached zip or under 

You do realize that .zip is not the best file format in the world; there
are other file formats that both compress better, and are more preferred
in the free software world because they do not suffer from patent
problems like zip.

I've marked this as not a bug, but feel free to add further comments or
replies if you have more questions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org