GNU bug report logs -
#26422
historical feature or grand daddy bug?
Previous Next
Full log
Message #16 received at 26422-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks for the fast response.
Right or wrong POSIX is POSIX,
Yet a LF as part of a line does seem worth counting.
A line must terminate with a line feed.
Yet a string does not require a line feed.
Paul Eggert's assumption seems correct.
During line indexing
the lines which consist of only a line feed can be counted
and excluded from the sort.
And then the correct amount of line feeds can be output
before the sorted lines, or after if the -r parameter is present.
For test data consecutive LF seems plausible,
but for actual sorting tasks; would consecutive LF be common?
It might be a potential optimization worthy of omission.
If the sort function's compare function was inlined
rather than called from a pointer
then a modest 5% performance boon could become.
To implement some creativity would be required.
If the input data was not copied
and string conversion was omitted
then another 5% performance boon could become.
The sort method used is not known.
However, a merge sort has some surprisingly frequent
uhm code paths like a 3 way comparison
which can be implemented for 2 or 3 comparisons
and 0 to 4 memory moves.
A 15% to 20% overall performance improvement
from the three suggestions is not implausible.
Thanks for making it faster.
On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Historically, 'sort' ignored the \n at the end of each line, so that empty
> lines (i.e., lines consisting only of a single \n) collated before all
> other lines. An earlier version of the POSIX spec was (mis)written to
> require treating the \n as part of the data, and during development in 1999
> GNU sort was briefly changed to conform to that, but this was an error in
> the POSIX spec that was eventually fixed and GNU sort was changed back to
> the traditional behavior, before any release was made with the funky
> behavior.
>
> So, it's not a bug that \t\n collates after \n, since "\t" is
> lexicographically after "".
>
> As I understand it, the empty string should collate before all other
> strings in all POSIX locales, so empty lines should always sort first in
> 'sort' output. I'm by no means a collation expert, though, and if I'm wrong
> I'd like to see a counterexample.
>
> Come to think of it, 'sort' might be able to improve performance in the
> common case of sorting text files containing many empty lines, by merely
> counting the lines rather than storing them internally. I suppose this is a
> different topic, though.
>
[Message part 2 (text/html, inline)]
This bug report was last modified 8 years and 38 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.