#26422 - historical feature or grand daddy bug?

GNU bug report logs - #26422
historical feature or grand daddy bug?

Reported by: Kyle Sallee <kyle.sallee <at> gmail.com>

Date: Sun, 9 Apr 2017 18:38:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #16 received at 26422-done <at> debbugs.gnu.org (full text, mbox):

From: Kyle Sallee <kyle.sallee <at> gmail.com> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 26422-done <at> debbugs.gnu.org Subject: Re: bug#26422: historical feature or grand daddy bug? Date: Sun, 9 Apr 2017 15:53:26 -0700

[Message part 1 (text/plain, inline)]

Thanks for the fast response. Right or wrong POSIX is POSIX, Yet a LF as part of a line does seem worth counting. A line must terminate with a line feed. Yet a string does not require a line feed. Paul Eggert's assumption seems correct. During line indexing the lines which consist of only a line feed can be counted and excluded from the sort. And then the correct amount of line feeds can be output before the sorted lines, or after if the -r parameter is present. For test data consecutive LF seems plausible, but for actual sorting tasks; would consecutive LF be common? It might be a potential optimization worthy of omission. If the sort function's compare function was inlined rather than called from a pointer then a modest 5% performance boon could become. To implement some creativity would be required. If the input data was not copied and string conversion was omitted then another 5% performance boon could become. The sort method used is not known. However, a merge sort has some surprisingly frequent uhm code paths like a 3 way comparison which can be implemented for 2 or 3 comparisons and 0 to 4 memory moves. A 15% to 20% overall performance improvement from the three suggestions is not implausible. Thanks for making it faster. On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote: > Historically, 'sort' ignored the \n at the end of each line, so that empty > lines (i.e., lines consisting only of a single \n) collated before all > other lines. An earlier version of the POSIX spec was (mis)written to > require treating the \n as part of the data, and during development in 1999 > GNU sort was briefly changed to conform to that, but this was an error in > the POSIX spec that was eventually fixed and GNU sort was changed back to > the traditional behavior, before any release was made with the funky > behavior. > > So, it's not a bug that \t\n collates after \n, since "\t" is > lexicographically after "". > > As I understand it, the empty string should collate before all other > strings in all POSIX locales, so empty lines should always sort first in > 'sort' output. I'm by no means a collation expert, though, and if I'm wrong > I'd like to see a counterexample. > > Come to think of it, 'sort' might be able to improve performance in the > common case of sorting text files containing many empty lines, by merely > counting the lines rather than storing them internally. I suppose this is a > different topic, though. >

[Message part 2 (text/html, inline)]

This bug report was last modified 8 years and 99 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #26422 historical feature or grand daddy bug?

GNU bug report logs - #26422
historical feature or grand daddy bug?