GNU bug report logs - #26422
historical feature or grand daddy bug?

Previous Next

Package: coreutils;

Reported by: Kyle Sallee <kyle.sallee <at> gmail.com>

Date: Sun, 9 Apr 2017 18:38:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #16 received at 26422-done <at> debbugs.gnu.org (full text, mbox):

From: Kyle Sallee <kyle.sallee <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 26422-done <at> debbugs.gnu.org
Subject: Re: bug#26422: historical feature or grand daddy bug?
Date: Sun, 9 Apr 2017 15:53:26 -0700
[Message part 1 (text/plain, inline)]
Thanks for the fast response.
Right or wrong POSIX is POSIX,
Yet a LF as part of a line does seem worth counting.
A line must terminate with a line feed.
Yet a string does not require a line feed.

Paul Eggert's assumption seems correct.
During line indexing
the lines which consist of only a line feed can be counted
and excluded from the sort.
And then the correct amount of line feeds can be output
before the sorted lines, or after if the -r parameter is present.
For test data consecutive LF seems plausible,
but for actual sorting tasks; would consecutive LF be common?
It might be a potential optimization worthy of omission.

If the sort function's compare function was inlined
rather than called from a pointer
then a modest 5% performance boon could become.
To implement some creativity would be required.

If the input data was not copied
and string conversion was omitted
then another 5% performance boon could become.

The sort method used is not known.
However, a merge sort has some surprisingly frequent
uhm code paths like a 3 way comparison
which can be implemented for 2 or 3 comparisons
and 0 to 4 memory moves.

A 15% to 20% overall performance improvement
from the three suggestions is not implausible.
Thanks for making it faster.

On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Historically, 'sort' ignored the \n at the end of each line, so that empty
> lines (i.e., lines consisting only of a single \n) collated before all
> other lines. An earlier version of the POSIX spec was (mis)written to
> require treating the \n as part of the data, and during development in 1999
> GNU sort was briefly changed to conform to that, but this was an error in
> the POSIX spec that was eventually fixed and GNU sort was changed back to
> the traditional behavior, before any release was made with the funky
> behavior.
>
> So, it's not a bug that \t\n collates after \n, since "\t" is
> lexicographically after "".
>
> As I understand it, the empty string should collate before all other
> strings in all POSIX locales, so empty lines should always sort first in
> 'sort' output. I'm by no means a collation expert, though, and if I'm wrong
> I'd like to see a counterexample.
>
> Come to think of it, 'sort' might be able to improve performance in the
> common case of sorting text files containing many empty lines, by merely
> counting the lines rather than storing them internally. I suppose this is a
> different topic, though.
>
[Message part 2 (text/html, inline)]

This bug report was last modified 8 years and 38 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.