GNU bug report logs - #19329
Uniq Bug

Previous Next

Package: coreutils;

Reported by: throwaway1024 <at> nurfuerspam.de

Date: Tue, 9 Dec 2014 16:07:01 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19329 in the body.
You can then email your comments to 19329 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#19329; Package coreutils. (Tue, 09 Dec 2014 16:07:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to throwaway1024 <at> nurfuerspam.de:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 09 Dec 2014 16:07:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: throwaway1024 <at> nurfuerspam.de
To: bug-coreutils <at> gnu.org
Subject: Uniq Bug
Date: Tue, 9 Dec 2014 13:47:01 +0100
[Message part 1 (text/html, inline)]
[Uniq Bug.zip (application/zip, attachment)]

Added tag(s) notabug. Request was from Eric Blake <eblake <at> redhat.com> to control <at> debbugs.gnu.org. (Tue, 09 Dec 2014 18:41:02 GMT) Full text and rfc822 format available.

Reply sent to Eric Blake <eblake <at> redhat.com>:
You have taken responsibility. (Tue, 09 Dec 2014 18:41:03 GMT) Full text and rfc822 format available.

Notification sent to throwaway1024 <at> nurfuerspam.de:
bug acknowledged by developer. (Tue, 09 Dec 2014 18:41:04 GMT) Full text and rfc822 format available.

Message #12 received at 19329-done <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: throwaway1024 <at> nurfuerspam.de, 19329-done <at> debbugs.gnu.org
Subject: Re: bug#19329: Uniq Bug
Date: Tue, 09 Dec 2014 11:40:20 -0700
[Message part 1 (text/plain, inline)]
tag 19329 notabug
thanks

On 12/09/2014 05:47 AM, throwaway1024 <at> nurfuerspam.de wrote:
> Dear developers,
> The following command results in a document where valbug.txt contains multiple 
> duplicate lines (e.g. "(TDCYXT)" without quotes)

Thanks for the report.

> --------
> cat validation.csv | sed 's!\r\n!\n!g' | sed 's! !\n!g' | LC_ALL=C sort | uniq > 
> valbug.txt

Useless use of cat, and unnecessary use of two sed processes.  Why not
just use:

sed 's!\r\n!\n!g; s! !\n!g' validation.csv | LC_ALL=C sort | uniq

> --------
> It is definitely a problem of uniq, since the second command line of:
> --------
> cat validation.csv | sed 's!\r\n!\n!g' | sed 's! !\n!g' | LC_ALL=C sort > valo.txt
> cat valo.txt | uniq > valbug.txt

Again, useless use of cat; why not:

uniq < valo.txt > valbug.txt

> --------
> would result in the same bug and the valo.txt file is sorted correctly.

The problem is NOT in uniq, but in your incorrect usage of sed.

$ file valo.txt
valo.txt: ASCII text, with CRLF, LF line terminators

Observe that you have created a file with mixed line endings, and what's
more, that you have not gotten rid of carriage return.  So let's look a
bit closer at the output of that sort command (which in turn is based on
your botched sed command):

$ sed 's!\r\n!\n!g; s! !\n!g' validation.csv | LC_ALL=C sort \
  | grep '(TDCYXT)' | od -tx1z
0000000 28 54 44 43 59 58 54 29 0a 28 54 44 43 59 58 54  >(TDCYXT).(TDCYXT<
0000020 29 0a 28 54 44 43 59 58 54 29 0a 28 54 44 43 59  >).(TDCYXT).(TDCY<
0000040 58 54 29 0a 28 54 44 43 59 58 54 29 0a 28 54 44  >XT).(TDCYXT).(TD<
0000060 43 59 58 54 29 0a 28 54 44 43 59 58 54 29 0a 28  >CYXT).(TDCYXT).(<
0000100 54 44 43 59 58 54 29 0d 0a 28 54 44 43 59 58 54  >TDCYXT)..(TDCYXT<
0000120 29 0d 0a 28 54 44 43 59 58 54 29 0d 0a 28 54 44  >)..(TDCYXT)..(TD<
0000140 43 59 58 54 29 0d 0a 28 54 44 43 59 58 54 29 0d  >CYXT)..(TDCYXT).<
0000160 0a 28 54 44 43 59 58 54 29 0d 0a 42 4e 58 4e 28  >.(TDCYXT)..BNXN(<
0000200 54 44 43 59 58 54 29 0a 42 4e 58 4e 28 54 44 43  >TDCYXT).BNXN(TDC<
0000220 59 58 54 29 0a 42 4e 58 4e 28 54 44 43 59 58 54  >YXT).BNXN(TDCYXT<
0000240 29 0d 0a 42 4e 58 4e 28 54 44 43 59 58 54 29 0d  >)..BNXN(TDCYXT).<
0000260 0a                                               >.<
0000261

Look closely - you'll see some of the lines have \r still embedded in
them.  Which means that the lines are NOT unique, and therefore uniq is
CORRECTLY showing the line twice in the output (once without \r, once with).

So the problem to figure out is why \r is still in the output.  That's
because your use of sed did NOT do what you think.  Remember, in sed,
the way things work is that sed FIRST reads in a line of input
(excluding the trailing newline, but including a carriage return), then
operates on that line in memory, then outputs the line (adding back a
newline on output).  While manipulating a line in memory, you can add
newlines, and you can match newlines against the pattern in memory, but
the line you read from input will NOT include any newlines unless you
first manipulate the line to add newlines.  Thus:

's!\r\n!\n!g'

is a no-op when there were no preceeding operations to manipulate
in-memory representation.  There are NO lines of the input that include
a \n to be matched, so you ended up NOT stripping any carriage returns.

You MEANT to do something like:

$ sed 's!\r!!g; s! !\n!g' validation.csv | LC_ALL=C sort \
  | uniq | grep '(TDCYXT)'
(TDCYXT)
BNXN(TDCYXT)

> I also could not reduce the example since using only a few lines around the 
> buggy lines would make it work again correctly.

1.4 megabytes was a bit much; the problem can be demonstrated in a much
smaller input space:

$ printf 'a\r\na\n' | sed 's/\r\n/\n/g' | uniq | od -tx1z
0000000 61 0d 0a 61 0a                                   >a..a.<
0000005
$ printf 'a\r\na\n' | sed 's/\r//g' | uniq | od -tx1z
0000000 61 0a                                            >a.<
0000002

Also, it appears that the data you posted may have used some sort of
simple substitution cypher; 1.4 megabytes might be plenty of information
for someone to reverse-engineer the original text, so I hope you didn't
leak something that you thought was secure.

> You can find all files in attached zip or under 

You do realize that .zip is not the best file format in the world; there
are other file formats that both compress better, and are more preferred
in the free software world because they do not suffer from patent
problems like zip.

I've marked this as not a bug, but feel free to add further comments or
replies if you have more questions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#19329; Package coreutils. (Fri, 19 Dec 2014 16:57:03 GMT) Full text and rfc822 format available.

Message #15 received at 19329 <at> debbugs.gnu.org (full text, mbox):

From: throwaway1024 <at> nurfuerspam.de
To: 19329 <at> debbugs.gnu.org
Subject: Bug report
Date: Fri, 19 Dec 2014 12:01:13 +0100
[Message part 1 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 17 Jan 2015 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 161 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.