Thanks for the fast response.

Right or wrong POSIX is POSIX,

Yet a LF as part of a line does seem worth counting.

A line must terminate with a line feed.
Yet a string does not require a line feed.

Paul Eggert's assumption seems correct.
During line indexing
the lines which consist of only a line feed can be counted

and excluded from the sort.

And then the correct amount of line feeds can be output

before the sorted lines, or after if the -r parameter is present.

For test data consecutive LF seems plausible,

but for actual sorting tasks; would consecutive LF be common?
It might be a potential optimization worthy of omission.

If the sort function's compare function was inlined
rather than called from a pointer
then a modest 5% performance boon could become.

To implement some creativity would be required.

If the input data was not copied
and string conversion was omitted
then another 5% performance boon could become.

The sort method used is not known.
However, a merge sort has some surprisingly frequent

uhm code paths like a 3 way comparison

which can be implemented for 2 or 3 comparisons
and 0 to 4 memory moves.

A 15% to 20% overall performance improvement
from the three suggestions is not implausible.

Thanks for making it faster.

On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:

Historically, 'sort' ignored the \n at the end of each line, so that empty lines (i.e., lines consisting only of a single \n) collated before all other lines. An earlier version of the POSIX spec was (mis)written to require treating the \n as part of the data, and during development in 1999 GNU sort was briefly changed to conform to that, but this was an error in the POSIX spec that was eventually fixed and GNU sort was changed back to the traditional behavior, before any release was made with the funky behavior.

So, it's not a bug that \t\n collates after \n, since "\t" is lexicographically after "".

As I understand it, the empty string should collate before all other strings in all POSIX locales, so empty lines should always sort first in 'sort' output. I'm by no means a collation expert, though, and if I'm wrong I'd like to see a counterexample.

Come to think of it, 'sort' might be able to improve performance in the common case of sorting text files containing many empty lines, by merely counting the lines rather than storing them internally. I suppose this is a different topic, though.