GNU bug report logs -
#26422
historical feature or grand daddy bug?
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 26422 in the body.
You can then email your comments to 26422 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#26422
; Package
coreutils
.
(Sun, 09 Apr 2017 18:38:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Kyle Sallee <kyle.sallee <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sun, 09 Apr 2017 18:38:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
By the sort program
when a file is sorted
the lines which start with line feed
output earlier than lines which begin with tab.
Tab ASCII value is 9.
LF ASCII value is 10.
Tabs should be first?
However, to strings
if the lines are converted
then to mitigate a larger address space
presumably with 0 the LF are replaced.
Yet after the LF if the 0 byte was placed
then the expected output would become.
If expected behavior becomes
then historical behavior relied upon scripts might break.
The sort.c source code was not viewed.
Therefore, a patch is not offered.
Discussion is solicited.
Concerning empty lines first.
Is it a bug?
Should it be fixed?
Because I am not on the email list;
if the topic is worth discussion
if a decision is made
then please forward.
Thanks for maintaining and sharing awesome software.
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#26422
; Package
coreutils
.
(Sun, 09 Apr 2017 18:52:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 26422 <at> debbugs.gnu.org (full text, mbox):
On 09/04/17 11:37, Kyle Sallee wrote:
> By the sort program
> when a file is sorted
> the lines which start with line feed
> output earlier than lines which begin with tab.
>
> Tab ASCII value is 9.
> LF ASCII value is 10.
> Tabs should be first?
>
> However, to strings
> if the lines are converted
> then to mitigate a larger address space
> presumably with 0 the LF are replaced.
> Yet after the LF if the 0 byte was placed
> then the expected output would become.
>
> If expected behavior becomes
> then historical behavior relied upon scripts might break.
>
> The sort.c source code was not viewed.
> Therefore, a patch is not offered.
> Discussion is solicited.
> Concerning empty lines first.
> Is it a bug?
> Should it be fixed?
>
> Because I am not on the email list;
> if the topic is worth discussion
> if a decision is made
> then please forward.
> Thanks for maintaining and sharing awesome software.
I think you're hitting locale issues.
Do you see the same issue when you specify LC_ALL=C to sort?
For example:
$ printf '\nLF\0\tTAB\0' | sort -z | tr '\n\t\0' 'nt\n'
nLF
tTAB
$ printf '\nLF\0\tTAB\0' | LC_ALL=C sort -z | tr '\n\t\0' 'nt\n'
tTAB
nLF
thanks,
Pádraig
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sun, 09 Apr 2017 19:05:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Kyle Sallee <kyle.sallee <at> gmail.com>
:
bug acknowledged by developer.
(Sun, 09 Apr 2017 19:05:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 26422-done <at> debbugs.gnu.org (full text, mbox):
Historically, 'sort' ignored the \n at the end of each line, so that
empty lines (i.e., lines consisting only of a single \n) collated before
all other lines. An earlier version of the POSIX spec was (mis)written
to require treating the \n as part of the data, and during development
in 1999 GNU sort was briefly changed to conform to that, but this was an
error in the POSIX spec that was eventually fixed and GNU sort was
changed back to the traditional behavior, before any release was made
with the funky behavior.
So, it's not a bug that \t\n collates after \n, since "\t" is
lexicographically after "".
As I understand it, the empty string should collate before all other
strings in all POSIX locales, so empty lines should always sort first in
'sort' output. I'm by no means a collation expert, though, and if I'm
wrong I'd like to see a counterexample.
Come to think of it, 'sort' might be able to improve performance in the
common case of sorting text files containing many empty lines, by merely
counting the lines rather than storing them internally. I suppose this
is a different topic, though.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#26422
; Package
coreutils
.
(Sun, 09 Apr 2017 22:54:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 26422-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks for the fast response.
Right or wrong POSIX is POSIX,
Yet a LF as part of a line does seem worth counting.
A line must terminate with a line feed.
Yet a string does not require a line feed.
Paul Eggert's assumption seems correct.
During line indexing
the lines which consist of only a line feed can be counted
and excluded from the sort.
And then the correct amount of line feeds can be output
before the sorted lines, or after if the -r parameter is present.
For test data consecutive LF seems plausible,
but for actual sorting tasks; would consecutive LF be common?
It might be a potential optimization worthy of omission.
If the sort function's compare function was inlined
rather than called from a pointer
then a modest 5% performance boon could become.
To implement some creativity would be required.
If the input data was not copied
and string conversion was omitted
then another 5% performance boon could become.
The sort method used is not known.
However, a merge sort has some surprisingly frequent
uhm code paths like a 3 way comparison
which can be implemented for 2 or 3 comparisons
and 0 to 4 memory moves.
A 15% to 20% overall performance improvement
from the three suggestions is not implausible.
Thanks for making it faster.
On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Historically, 'sort' ignored the \n at the end of each line, so that empty
> lines (i.e., lines consisting only of a single \n) collated before all
> other lines. An earlier version of the POSIX spec was (mis)written to
> require treating the \n as part of the data, and during development in 1999
> GNU sort was briefly changed to conform to that, but this was an error in
> the POSIX spec that was eventually fixed and GNU sort was changed back to
> the traditional behavior, before any release was made with the funky
> behavior.
>
> So, it's not a bug that \t\n collates after \n, since "\t" is
> lexicographically after "".
>
> As I understand it, the empty string should collate before all other
> strings in all POSIX locales, so empty lines should always sort first in
> 'sort' output. I'm by no means a collation expert, though, and if I'm wrong
> I'd like to see a counterexample.
>
> Come to think of it, 'sort' might be able to improve performance in the
> common case of sorting text files containing many empty lines, by merely
> counting the lines rather than storing them internally. I suppose this is a
> different topic, though.
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#26422
; Package
coreutils
.
(Mon, 10 Apr 2017 06:00:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 26422-done <at> debbugs.gnu.org (full text, mbox):
Kyle Sallee wrote:
> Thanks for the fast response.
> Right or wrong POSIX is POSIX,
> Yet a LF as part of a line does seem worth counting.
> A line must terminate with a line feed.
> Yet a string does not require a line feed.
>
----
how is that important? Sort sorts lines, not strings.
> but for actual sorting tasks; would consecutive LF be common?
>
----
Anytime you have multiple blank lines in a row,
you have consecutive line feeds.
> If the sort function's compare function was inlined
> rather than called from a pointer
> then a modest 5% performance boon could become.
> To implement some creativity would be required.
>
----
I'm sure if you submitted a working patch + documentation
+ rights assigned to GNU, and first born child given to FSF,
the coreutil maintainers would consider it.
(ok maybe the first born isn't required these days,
I think some POSIX update changed that)
> If the input data was not copied
> and string conversion was omitted
> then another 5% performance boon could become.
>
----
patches patches patches...
> The sort method used is not known.
> However, a merge sort has some surprisingly frequent
> uhm code paths like a 3 way comparison
> which can be implemented for 2 or 3 comparisons
> and 0 to 4 memory moves.
>
um... it's open source... note -- something that might
affect your algorithm design: it has to handle sort
input that is greater than the size of memory
and in different character encodings.
*cheers*
-linda
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#26422
; Package
coreutils
.
(Mon, 10 Apr 2017 18:02:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 26422-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Sun, Apr 9, 2017 at 10:59 PM, L A Walsh <coreutils <at> tlinx.org> wrote:
> Anytime you have multiple blank lines in a row,
> you have consecutive line feeds.
>
For typical sort processed data;
concurrent LF might be uncommon.
When the event does not become
then by the specialized code
the CPU cycles could be wasted.
> ----
> I'm sure if you submitted a working patch + documentation
> + rights assigned to GNU, and first born child given to FSF,
> the coreutil maintainers would consider it.
>
Past self authored cat patches were declined. :(
For self only desired modifications;
self authored software gains immediate approval. :)
From not parsing sort source code;
accidental source copy is mitigated
and the boons and banes can not be inherited.
From dissenting output the question became;
"Why LF before TAB?"
Program sort's performance is fine.
A complaint did not exist.
In all executed software
the same performance bane exists.
A fork + execve overhead or
a relevant functions + posix_spawnp overhead exists.
Or more succinctly put
for program start CPU cycles are required.
The overhead might seem insignificant,
but in a script for each program launch
the program reliance overhead accumulates.
Performance is lost.
In contrast to program provided code;
library provided code can be
loaded once and used frequently.
As compared to program invocation duration
the library load duration also is less.
To sort a small line amount
by program sort invocation
a considerable program launch
overhead duration becomes.
From a library perspective the following
potentially complex tasks seem attractive:
cp; mv; sort; tsort; wc.
cp and mv implementations can be surprisingly complex.
Self authored implementations already exist.
For coreutils provided cp and mv;
parameter options that from the kernel cache
purge used file data could be useful.
From files that were copied or moved
the content is probably not again immediately useful
yet lingers in the kernel cache.
By the kernel cache
when almost all available RAM is used
then file copy performance tanks.
By a large and irrelevantly stocked kernel cache search
performance also tanks.
A less than ideally configured in use kernel seems plausible.
For this task perhaps .../vfs_cache_pressure and
.../drop_caches might not suffice?
Function posix_fadvise seems useful.
But on descriptors posix_fadvise must be invoked.
For directory cache data, however,
posix_fadvise does not seem useful.
Thanks again for maintaining and sharing coreutils.
For coreutils if a library interface existed
then it would gain use.
P.S.
If the Linux kernel provided sendfile function
became POSIX approved,
then
http://lists.gnu.org/archive/html/bug-fileutils/2003-03/msg00030.html
might merit reconsideration.
For systems where mass storage device data
throughput is not bottle-necked
or where a cached conclusion suffices
then by user space buffer omission;
significant CPU cycles can be conserved.
When between descriptors; data must be transferred
the sendfile function is useful.
[Message part 2 (text/html, inline)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 09 May 2017 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 38 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.