GNU bug report logs - #9780
sort -u throws out non-duplicates

Previous Next

Package: coreutils;

Reported by: Bernhard Rosenkraenzer <bero <at> bero.eu>

Date: Tue, 18 Oct 2011 01:04:02 UTC

Severity: normal

Tags: moreinfo

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 9780 <at> debbugs.gnu.org, Benno Schulenberg <bensberg <at> justemail.net>, Bruce Dubbs <bruce.dubbs <at> gmail.com>, Rasmus Borup Hansen <rbh <at> intomics.com>
Subject: bug#9780: sort -u data loss deserves new release ASAP [Re: bug#9780: sort -u...
Date: Fri, 17 Aug 2012 12:00:24 +0200
Jim Meyering wrote:
> Jim Meyering wrote:
> ...
>> In case anyone is chomping at the bit, here's a preliminary patch:
>>
>> Here's a smaller test case that appears to be host/nproc-independent:
>> It should print two lines: 1, then 7.
>> Without this patch, it prints only "7".
>>
>>     (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u
...
> Here's a complete patch:
>
>>From 431102766cbf7c360ee6fa1f157ebcd7d8b9ca0e Mon Sep 17 00:00:00 2001
> From: Jim Meyering <meyering <at> redhat.com>
> Date: Wed, 15 Aug 2012 12:30:44 +0200
> Subject: [PATCH] sort: sort --unique (-u) could cause data loss
>
> sort -u could omit one or more lines of expected output.
> This bug arose because sort recorded the most recently printed line via
> reference, and if you were unlucky, the storage for that line would be
> reused (overwritten) as additional input was read into memory.  If you
> were doubly unlucky, the new value of the "saved" line would not only
> match the very next line, but if that next line were also the first in
> a series of identical, not-yet-printed lines, then the corrupted "saved"
> line value would result in the omission of all matching lines.
>
> * src/sort.c (saved_line): New static/global, renamed and moved from...
> (write_unique): ...here.  Old name was "saved", which was too generic
> for its new role as file-scoped global.
> (fillbuf): With --unique, when we're about to read into a buffer that
> overlaps the saved "preceding" line (saved_line), copy the line's .text
> member to a realloc'd-as-needed temporary buffer and adjust the line's
> key-defining members if they're set.
> (overlap): New function.
> * tests/misc/sort: New tests.
> * NEWS (Bug fixes): Mention it.
> * THANKS.in: Update.
> Bug introduced via commit v8.5-89-g9face83.
> Reported by Rasmus Borup Hansen in
> http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/23173/focus=24647

That sort -u can cause data loss is a big deal.
I want to make a release with this fix as soon as possible.
Since I'm making this a mostly-bug-fix release, the du and md5 --tag
changes will have to wait for 8.20.
However, I'll be happy to apply documentation-correcting changes
if someone would post a complete, updated patch or two.

If Bruce and Paul find that changing gnulib's parse-datetime test
will avoid a failure on LFS, I'll pull in a gnulib update for that.

Any other bug-fix-like changes that people can suggest?




This bug report was last modified 12 years and 278 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.