From debbugs-submit-bounces@debbugs.gnu.org Sun Apr 09 14:37:48 2017 Received: (at submit) by debbugs.gnu.org; 9 Apr 2017 18:37:49 +0000 Received: from localhost ([127.0.0.1]:40054 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxHie-00044T-Ns for submit@debbugs.gnu.org; Sun, 09 Apr 2017 14:37:48 -0400 Received: from eggs.gnu.org ([208.118.235.92]:33614) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxHic-00044F-4h for submit@debbugs.gnu.org; Sun, 09 Apr 2017 14:37:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cxHiW-00075h-6S for submit@debbugs.gnu.org; Sun, 09 Apr 2017 14:37:40 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=AC_DIV_BONANZA,BAYES_50, FREEMAIL_FROM,HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:54739) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cxHiV-00075a-QS for submit@debbugs.gnu.org; Sun, 09 Apr 2017 14:37:40 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52816) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cxHiU-0001WU-FR for bug-coreutils@gnu.org; Sun, 09 Apr 2017 14:37:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cxHiT-000747-QB for bug-coreutils@gnu.org; Sun, 09 Apr 2017 14:37:38 -0400 Received: from mail-ua0-x236.google.com ([2607:f8b0:400c:c08::236]:34854) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1cxHiT-00073u-KZ for bug-coreutils@gnu.org; Sun, 09 Apr 2017 14:37:37 -0400 Received: by mail-ua0-x236.google.com with SMTP id 49so17897970uau.2 for ; Sun, 09 Apr 2017 11:37:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=Oron2u5Gr4eevLIrINHt7eppWtH0QBblOG0xwJskhcU=; b=eAaGPZXecj5Sbve3G488fATbVxTJ6F2E7LwPldIXtgaRqvkG69o9Az5C40Pkptz8xz S3KB/iDtu7dGxc05a2WBAnD4b+vzo+5ORW13v8yV26lLOS2YchtyfUA8mSH63lR6Bq2y DJt1Uaix/H9JLQklJd2dRrQeelvEFaGqnkT25vXs/idKpdZ4zXY0IhRLCi8UHFsvr3Lp Qd80gQSrMcjpyGHrGodRWSFrBztJYJDH8CGy0YJgAdx8YkZmoTXtiITiw+xywd0M1FKd f/OI0C+d3yD8gUJ3Xp3qKne7a0PcRW7moxueYnMg762yPQigw5X24zdGd8xJ5RtU3E41 M4pA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=Oron2u5Gr4eevLIrINHt7eppWtH0QBblOG0xwJskhcU=; b=M1NyzlQkBvh76/GB7wecrUwOWS6tUeMzFpFh0rMBzFTeIxA7VzJOsIT0jsPJJ76K2R XsAd28GMAFXhtNgA7imCeVROBkKTmfztjhDUvATzrkMV25IcI8HRPEVwXBo8qP3l0AyJ G+4r6+MuUe/p+10kDEidXCvbjL1En5zN8IIU4LACVT0r4tScciuxEIdt0CbPqDoUyMea qAOkkGNuTQ5iXqS8QgmmGNmVE6HDOgJTdbFx681zDhlOO0fYb5XbGoHdpuv4aEGxeEaI +5SpC4NhIkLOGE5XL6WUCeVUr0M7lLlwuQnZM3gfx+oF22C1Yfu5GN4tKSxk05aLyuxe jxNQ== X-Gm-Message-State: AN3rC/7qZjLnRvmZbFfD+01Pn+kboGOxSneXDgsyxNAaM7yMcuCKCNDpOx0cw74EWXQwRLjZSk+3Ab4rkUcDOQ== X-Received: by 10.31.194.6 with SMTP id s6mr10630378vkf.57.1491763055133; Sun, 09 Apr 2017 11:37:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.140.137 with HTTP; Sun, 9 Apr 2017 11:37:34 -0700 (PDT) From: Kyle Sallee Date: Sun, 9 Apr 2017 11:37:34 -0700 Message-ID: Subject: historical feature or grand daddy bug? To: bug-coreutils@gnu.org Content-Type: multipart/alternative; boundary=001a1139d8fafdaa79054cc026df X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --001a1139d8fafdaa79054cc026df Content-Type: text/plain; charset=UTF-8 By the sort program when a file is sorted the lines which start with line feed output earlier than lines which begin with tab. Tab ASCII value is 9. LF ASCII value is 10. Tabs should be first? However, to strings if the lines are converted then to mitigate a larger address space presumably with 0 the LF are replaced. Yet after the LF if the 0 byte was placed then the expected output would become. If expected behavior becomes then historical behavior relied upon scripts might break. The sort.c source code was not viewed. Therefore, a patch is not offered. Discussion is solicited. Concerning empty lines first. Is it a bug? Should it be fixed? Because I am not on the email list; if the topic is worth discussion if a decision is made then please forward. Thanks for maintaining and sharing awesome software. --001a1139d8fafdaa79054cc026df Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
By the s= ort program
when a file is sorted
the lines which start w= ith line feed
output earlier than lines which begin with tab.
<= /div>

Tab ASCII value is 9.
LF=C2=A0 ASCII value is 10.
Tabs should be first?

However, to strings
if the li= nes are converted
then to mitigate a larger address space
presumably with 0 the LF are replaced.
Yet after the LF= if the 0 byte was placed
then the expected output would beco= me.

If expected behavior becomes
then historical behav= ior relied upon scripts might break.

The sort.c source co= de was not viewed.
Therefore, a patch is not offered.
Disc= ussion is solicited.
Concerning empty lines first.
<= div>Is it a bug?
Should it be fixed?

Because I am not on the email list;
if the to= pic is worth discussion
if a decision is made
then please forward.
Thanks for maintaining and sharing awesome software.
--001a1139d8fafdaa79054cc026df-- From debbugs-submit-bounces@debbugs.gnu.org Sun Apr 09 14:51:55 2017 Received: (at 26422) by debbugs.gnu.org; 9 Apr 2017 18:51:55 +0000 Received: from localhost ([127.0.0.1]:40065 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxHwJ-0004Qh-K1 for submit@debbugs.gnu.org; Sun, 09 Apr 2017 14:51:55 -0400 Received: from midir.magicbluesmoke.com ([82.195.144.46]:39740 helo=mail.magicbluesmoke.com) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxHwH-0004QY-Ib for 26422@debbugs.gnu.org; Sun, 09 Apr 2017 14:51:54 -0400 Received: from localhost.localdomain (unknown [166.170.36.106]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id 691019E80; Sun, 9 Apr 2017 19:51:51 +0100 (IST) Subject: Re: bug#26422: historical feature or grand daddy bug? To: Kyle Sallee , 26422@debbugs.gnu.org References: From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: Date: Sun, 9 Apr 2017 11:51:47 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 26422 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 09/04/17 11:37, Kyle Sallee wrote: > By the sort program > when a file is sorted > the lines which start with line feed > output earlier than lines which begin with tab. > > Tab ASCII value is 9. > LF ASCII value is 10. > Tabs should be first? > > However, to strings > if the lines are converted > then to mitigate a larger address space > presumably with 0 the LF are replaced. > Yet after the LF if the 0 byte was placed > then the expected output would become. > > If expected behavior becomes > then historical behavior relied upon scripts might break. > > The sort.c source code was not viewed. > Therefore, a patch is not offered. > Discussion is solicited. > Concerning empty lines first. > Is it a bug? > Should it be fixed? > > Because I am not on the email list; > if the topic is worth discussion > if a decision is made > then please forward. > Thanks for maintaining and sharing awesome software. I think you're hitting locale issues. Do you see the same issue when you specify LC_ALL=C to sort? For example: $ printf '\nLF\0\tTAB\0' | sort -z | tr '\n\t\0' 'nt\n' nLF tTAB $ printf '\nLF\0\tTAB\0' | LC_ALL=C sort -z | tr '\n\t\0' 'nt\n' tTAB nLF thanks, Pádraig From debbugs-submit-bounces@debbugs.gnu.org Sun Apr 09 15:04:45 2017 Received: (at 26422-done) by debbugs.gnu.org; 9 Apr 2017 19:04:46 +0000 Received: from localhost ([127.0.0.1]:40071 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxI8j-0004ld-OD for submit@debbugs.gnu.org; Sun, 09 Apr 2017 15:04:45 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35824) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxI8h-0004lP-Na for 26422-done@debbugs.gnu.org; Sun, 09 Apr 2017 15:04:44 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id A22AB160061; Sun, 9 Apr 2017 12:04:37 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id eGthDn2joRFq; Sun, 9 Apr 2017 12:04:36 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E1292160071; Sun, 9 Apr 2017 12:04:36 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id lK5nh_Vwfkd5; Sun, 9 Apr 2017 12:04:36 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C7347160061; Sun, 9 Apr 2017 12:04:36 -0700 (PDT) Subject: Re: bug#26422: historical feature or grand daddy bug? To: Kyle Sallee , 26422-done@debbugs.gnu.org References: From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sun, 9 Apr 2017 12:04:34 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 26422-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Historically, 'sort' ignored the \n at the end of each line, so that empty lines (i.e., lines consisting only of a single \n) collated before all other lines. An earlier version of the POSIX spec was (mis)written to require treating the \n as part of the data, and during development in 1999 GNU sort was briefly changed to conform to that, but this was an error in the POSIX spec that was eventually fixed and GNU sort was changed back to the traditional behavior, before any release was made with the funky behavior. So, it's not a bug that \t\n collates after \n, since "\t" is lexicographically after "". As I understand it, the empty string should collate before all other strings in all POSIX locales, so empty lines should always sort first in 'sort' output. I'm by no means a collation expert, though, and if I'm wrong I'd like to see a counterexample. Come to think of it, 'sort' might be able to improve performance in the common case of sorting text files containing many empty lines, by merely counting the lines rather than storing them internally. I suppose this is a different topic, though. From debbugs-submit-bounces@debbugs.gnu.org Sun Apr 09 18:53:36 2017 Received: (at 26422-done) by debbugs.gnu.org; 9 Apr 2017 22:53:36 +0000 Received: from localhost ([127.0.0.1]:40242 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxLiC-00021X-85 for submit@debbugs.gnu.org; Sun, 09 Apr 2017 18:53:36 -0400 Received: from mail-ua0-f174.google.com ([209.85.217.174]:33933) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxLiA-00021H-63 for 26422-done@debbugs.gnu.org; Sun, 09 Apr 2017 18:53:34 -0400 Received: by mail-ua0-f174.google.com with SMTP id u103so20402951uau.1 for <26422-done@debbugs.gnu.org>; Sun, 09 Apr 2017 15:53:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=2qOIhPzFv/oKQpRXFJJWEtzSZ2eY++X1RB9Bh0/5esI=; b=sbJWodGBDEoQo0SE+fLo+HDoX3+tl0DMRF9nIYo7jjH6mRuKUXxCja2qJ2zctoq8tQ c8cRdvfUBkZlC8QwJjAOH+C+xfG+lxF3ud0sUEmWKd/OCFgkI7LvkLOnfw8Otulgb90Q jQ6PuuOPthMg3c6rS3g5bmXpJ3x6FCna/KUuhnU7RdPhQnLztfS0qh4knSds/s6D8yiu Z3BK6wpPCo5Mqg0mNBTJW8zxoeGc8FovjwZFGi6+MXyJhEAMRFlJUXkaxwsIdM/2H8Ye 2l0LJ+pM41r8EJeMeXHLNtEl3TOG8o+f3SNWJlof/lqQzFRcZf8JD8avv3lBnP+0iikt FaKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=2qOIhPzFv/oKQpRXFJJWEtzSZ2eY++X1RB9Bh0/5esI=; b=ZTz4HGAvdc1UUH1FiCWjHzbmj0pXXSKDD0NZJ4fCgLzb5jChh8DsfneqUlZ7fb5jAz 1aDBDsIK97stWVCi/89xXO34dmVXRyqG99itrXX7fiwd2TB3CtFyydiQsWP5Ar+CLZtv hvK9YxHXSn0GR8qxHXQX1m7opoppO1i/Jy9UIHfpVwuSMrhwfkLNXfZAqkvro+WHJTs+ 0GQk0riczriVlXdGpbiw/8Q4mQ24xkIKBlh8zmfRfZw+mMGiLOLm2AGxGORWz6nFwRMJ lscGkt0Apn9q4CMGW7lwtVaWDNzd+4uR2K60teilDEUbEDzi3uIefatMmFOrrgCH7GUz 7X8g== X-Gm-Message-State: AFeK/H3G9ihXH2SjznNrQipLUp5INN+MBiUEN5aHgA5lbV39PA9Sj9Mu/+KQB29iJ5+6jC0kPiZB298gt1yDRw== X-Received: by 10.31.50.77 with SMTP id y74mr22849935vky.131.1491778406504; Sun, 09 Apr 2017 15:53:26 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.140.137 with HTTP; Sun, 9 Apr 2017 15:53:26 -0700 (PDT) In-Reply-To: References: From: Kyle Sallee Date: Sun, 9 Apr 2017 15:53:26 -0700 Message-ID: Subject: Re: bug#26422: historical feature or grand daddy bug? To: Paul Eggert Content-Type: multipart/alternative; boundary=001a1143172a00e561054cc3bad1 X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 26422-done Cc: 26422-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --001a1143172a00e561054cc3bad1 Content-Type: text/plain; charset=UTF-8 Thanks for the fast response. Right or wrong POSIX is POSIX, Yet a LF as part of a line does seem worth counting. A line must terminate with a line feed. Yet a string does not require a line feed. Paul Eggert's assumption seems correct. During line indexing the lines which consist of only a line feed can be counted and excluded from the sort. And then the correct amount of line feeds can be output before the sorted lines, or after if the -r parameter is present. For test data consecutive LF seems plausible, but for actual sorting tasks; would consecutive LF be common? It might be a potential optimization worthy of omission. If the sort function's compare function was inlined rather than called from a pointer then a modest 5% performance boon could become. To implement some creativity would be required. If the input data was not copied and string conversion was omitted then another 5% performance boon could become. The sort method used is not known. However, a merge sort has some surprisingly frequent uhm code paths like a 3 way comparison which can be implemented for 2 or 3 comparisons and 0 to 4 memory moves. A 15% to 20% overall performance improvement from the three suggestions is not implausible. Thanks for making it faster. On Sun, Apr 9, 2017 at 12:04 PM, Paul Eggert wrote: > Historically, 'sort' ignored the \n at the end of each line, so that empty > lines (i.e., lines consisting only of a single \n) collated before all > other lines. An earlier version of the POSIX spec was (mis)written to > require treating the \n as part of the data, and during development in 1999 > GNU sort was briefly changed to conform to that, but this was an error in > the POSIX spec that was eventually fixed and GNU sort was changed back to > the traditional behavior, before any release was made with the funky > behavior. > > So, it's not a bug that \t\n collates after \n, since "\t" is > lexicographically after "". > > As I understand it, the empty string should collate before all other > strings in all POSIX locales, so empty lines should always sort first in > 'sort' output. I'm by no means a collation expert, though, and if I'm wrong > I'd like to see a counterexample. > > Come to think of it, 'sort' might be able to improve performance in the > common case of sorting text files containing many empty lines, by merely > counting the lines rather than storing them internally. I suppose this is a > different topic, though. > --001a1143172a00e561054cc3bad1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks for the fast res= ponse.
Right or wrong POSIX is POSIX,
Yet a LF as part of= a line does seem worth counting.
A line must terminate with = a line feed.
Yet a string does not require a line feed.
Paul Eggert's assumption seems correct.
During line indexing=
the lines which consist of only a line feed can be counted
and excluded from the sort.
And then the correct amount of line= feeds can be output
before the sorted lines, or after if the= -r parameter is present.
For test data consecutiv= e LF seems plausible,
but for actual sorting tasks; would con= secutive LF be common?
It might be a potential optimization worthy of om= ission.

If the sort function's compare fun= ction was inlined
rather than called from a pointer
then a modest 5% = performance boon could become.
To implement some creativity w= ould be required.

If the input data was not copied
and= string conversion was omitted
then another 5% performance boon could be= come.

The sort method used is not known.
However, a me= rge sort has some surprisingly frequent
uhm code paths like a= 3 way comparison
which can be implemented for 2 or 3 compari= sons
and 0 to 4 memory moves.

A 15% to 20% overall per= formance improvement
from the three suggestions is not implausible.
<= /div>
Thanks for making it faster.

On Sun, Apr 9, 2017 at 12:04 PM, Paul = Eggert <eggert@cs.ucla.edu> wrote:
Historically, 'sort' ignored the \n at the end of eac= h line, so that empty lines (i.e., lines consisting only of a single \n) co= llated before all other lines. An earlier version of the POSIX spec was (mi= s)written to require treating the \n as part of the data, and during develo= pment in 1999 GNU sort was briefly changed to conform to that, but this was= an error in the POSIX spec that was eventually fixed and GNU sort was chan= ged back to the traditional behavior, before any release was made with the = funky behavior.

So, it's not a bug that \t\n collates after \n, since "\t" is= lexicographically after "".

As I understand it, the empty string should collate before all other string= s in all POSIX locales, so empty lines should always sort first in 'sor= t' output. I'm by no means a collation expert, though, and if I'= ;m wrong I'd like to see a counterexample.

Come to think of it, 'sort' might be able to improve performance in= the common case of sorting text files containing many empty lines, by mere= ly counting the lines rather than storing them internally. I suppose this i= s a different topic, though.

--001a1143172a00e561054cc3bad1-- From debbugs-submit-bounces@debbugs.gnu.org Mon Apr 10 01:59:55 2017 Received: (at 26422-done) by debbugs.gnu.org; 10 Apr 2017 05:59:55 +0000 Received: from localhost ([127.0.0.1]:40437 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxSMk-0003la-R8 for submit@debbugs.gnu.org; Mon, 10 Apr 2017 01:59:54 -0400 Received: from ishtar.tlinx.org ([173.164.175.65]:45660 helo=Ishtar.sc.tlinx.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxSMj-0003lR-Gi for 26422-done@debbugs.gnu.org; Mon, 10 Apr 2017 01:59:54 -0400 Received: from [192.168.3.12] (Athenae [192.168.3.12]) by Ishtar.sc.tlinx.org (8.14.7/8.14.4/SuSE Linux 0.8) with ESMTP id v3A5xeYp048428; Sun, 9 Apr 2017 22:59:42 -0700 Message-ID: <58EB1F4C.20404@tlinx.org> Date: Sun, 09 Apr 2017 22:59:40 -0700 From: L A Walsh User-Agent: Thunderbird MIME-Version: 1.0 To: Kyle Sallee Subject: Re: bug#26422: historical feature or grand daddy bug? References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 26422-done Cc: 26422-done@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Kyle Sallee wrote: > Thanks for the fast response. > Right or wrong POSIX is POSIX, > Yet a LF as part of a line does seem worth counting. > A line must terminate with a line feed. > Yet a string does not require a line feed. > ---- how is that important? Sort sorts lines, not strings. > but for actual sorting tasks; would consecutive LF be common? > ---- Anytime you have multiple blank lines in a row, you have consecutive line feeds. > If the sort function's compare function was inlined > rather than called from a pointer > then a modest 5% performance boon could become. > To implement some creativity would be required. > ---- I'm sure if you submitted a working patch + documentation + rights assigned to GNU, and first born child given to FSF, the coreutil maintainers would consider it. (ok maybe the first born isn't required these days, I think some POSIX update changed that) > If the input data was not copied > and string conversion was omitted > then another 5% performance boon could become. > ---- patches patches patches... > The sort method used is not known. > However, a merge sort has some surprisingly frequent > uhm code paths like a 3 way comparison > which can be implemented for 2 or 3 comparisons > and 0 to 4 memory moves. > um... it's open source... note -- something that might affect your algorithm design: it has to handle sort input that is greater than the size of memory and in different character encodings. *cheers* -linda From debbugs-submit-bounces@debbugs.gnu.org Mon Apr 10 14:01:52 2017 Received: (at 26422-done) by debbugs.gnu.org; 10 Apr 2017 18:01:52 +0000 Received: from localhost ([127.0.0.1]:41459 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxddP-0006ZE-Gu for submit@debbugs.gnu.org; Mon, 10 Apr 2017 14:01:51 -0400 Received: from mail-ua0-f173.google.com ([209.85.217.173]:36751) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cxddK-0006Yv-Hy for 26422-done@debbugs.gnu.org; Mon, 10 Apr 2017 14:01:47 -0400 Received: by mail-ua0-f173.google.com with SMTP id a1so40705915uaf.3 for <26422-done@debbugs.gnu.org>; Mon, 10 Apr 2017 11:01:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=4knwPOPCWXKYNgetQNSGvccnwoD+UOoIYSrtHBGje/4=; b=eEpzxr+O4K3b5XddeVPb5j+iCpYp2YR9VMat8kT6gwC6jo3xN9ngMvILQdkrI427t6 +OppNa3Aqdb9d0jmTncw405+L67Pg67wniRz7EbmRjnNL+jyQU5r6q/tEKT11zEkRyPa 7lmyBrh1ej5V97Phso0idTNyl7VXDzinqy2x0f2S94SEHEPYP+KIubjp4XHfNf452jim lYYPf6rkauSXRTbiZf6mgfV33LgX/cofGAdW0fwdNGjhmHvcSrarNFo1fUyyRw235cbu SqZH5hHFLvTuSGdr0qRjYgdrnVKVXwhxE/BNP6fxKo0WlebKhFuMTnwSlNVAj2FLNDXo RqwA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=4knwPOPCWXKYNgetQNSGvccnwoD+UOoIYSrtHBGje/4=; b=kP5MIsMtFgxzcCh9aMNQcHuwmzgujwVt4hRTwIgJ4jJYr/8oBYWA2hF1VShg2Mgpgc WVvdI16PGW/rd9TRZmLTd4udRb2ztwcDaLW7E9jeZ00zqQYvSMC1EAr1pFAKZVmuwJih eRzp0CjTvcs21ljiZEcvKRmGRGDiWS8wEinr/93w2LcE1sXEMY4qFw9DF/RXSwAgJEGg w+CmIL8p5WDKbHJbQhM3m3cb0l1RabBXGZ3wiu/EvxC3k90482IbPsyuL1KqWklad2OL 0ade2gGPozRQbNTAxTTYVavoZ0o0sV5zm3xn86i+HJyh7W1+C3Y3TVWmfrwi303eib6I Jijg== X-Gm-Message-State: AN3rC/6E5ZDskmG7HTuUNIdI3jd5rq+3lPoYUoCXBmeBGW4QmTnmLEru9lWdbcPTKWYLoXM+uQfruzNtcL+njQ== X-Received: by 10.176.23.97 with SMTP id k33mr6935004uaf.132.1491847300619; Mon, 10 Apr 2017 11:01:40 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.140.137 with HTTP; Mon, 10 Apr 2017 11:01:40 -0700 (PDT) In-Reply-To: <58EB1F4C.20404@tlinx.org> References: <58EB1F4C.20404@tlinx.org> From: Kyle Sallee Date: Mon, 10 Apr 2017 11:01:40 -0700 Message-ID: Subject: Re: bug#26422: historical feature or grand daddy bug? To: L A Walsh Content-Type: multipart/alternative; boundary=f4030436212669a416054cd3c447 X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 26422-done Cc: 26422-done@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --f4030436212669a416054cd3c447 Content-Type: text/plain; charset=UTF-8 On Sun, Apr 9, 2017 at 10:59 PM, L A Walsh wrote: > Anytime you have multiple blank lines in a row, > you have consecutive line feeds. > For typical sort processed data; concurrent LF might be uncommon. When the event does not become then by the specialized code the CPU cycles could be wasted. > ---- > I'm sure if you submitted a working patch + documentation > + rights assigned to GNU, and first born child given to FSF, > the coreutil maintainers would consider it. > Past self authored cat patches were declined. :( For self only desired modifications; self authored software gains immediate approval. :) >From not parsing sort source code; accidental source copy is mitigated and the boons and banes can not be inherited. >From dissenting output the question became; "Why LF before TAB?" Program sort's performance is fine. A complaint did not exist. In all executed software the same performance bane exists. A fork + execve overhead or a relevant functions + posix_spawnp overhead exists. Or more succinctly put for program start CPU cycles are required. The overhead might seem insignificant, but in a script for each program launch the program reliance overhead accumulates. Performance is lost. In contrast to program provided code; library provided code can be loaded once and used frequently. As compared to program invocation duration the library load duration also is less. To sort a small line amount by program sort invocation a considerable program launch overhead duration becomes. >From a library perspective the following potentially complex tasks seem attractive: cp; mv; sort; tsort; wc. cp and mv implementations can be surprisingly complex. Self authored implementations already exist. For coreutils provided cp and mv; parameter options that from the kernel cache purge used file data could be useful. >From files that were copied or moved the content is probably not again immediately useful yet lingers in the kernel cache. By the kernel cache when almost all available RAM is used then file copy performance tanks. By a large and irrelevantly stocked kernel cache search performance also tanks. A less than ideally configured in use kernel seems plausible. For this task perhaps .../vfs_cache_pressure and .../drop_caches might not suffice? Function posix_fadvise seems useful. But on descriptors posix_fadvise must be invoked. For directory cache data, however, posix_fadvise does not seem useful. Thanks again for maintaining and sharing coreutils. For coreutils if a library interface existed then it would gain use. P.S. If the Linux kernel provided sendfile function became POSIX approved, then http://lists.gnu.org/archive/html/bug-fileutils/2003-03/msg00030.html might merit reconsideration. For systems where mass storage device data throughput is not bottle-necked or where a cached conclusion suffices then by user space buffer omission; significant CPU cycles can be conserved. When between descriptors; data must be transferred the sendfile function is useful. --f4030436212669a416054cd3c447 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
On S= un, Apr 9, 2017 at 10:59 PM, L A Walsh <coreutils@tlinx.org> wrote:
=C2=A0 =C2=A0Anytime you have multiple blank lines in a row,
you have consecutive line feeds.

For typical sort processed data;
concurrent LF m= ight be uncommon.
When the event does not become
then by t= he specialized code
the CPU cycles could be wasted.
=
=C2=A0
----
=C2=A0 =C2=A0I'm sure if you submitted a working patch + documentation<= br> + rights assigned to GNU, and first born child given to FSF,
the coreutil maintainers would consider it.

=
Past self authored cat patches were declined.=C2=A0 :(
F= or self only desired modifications;
self authored software gains immedia= te approval.=C2=A0 :)

From not parsing sort so= urce code;
accidental source copy is mitigated
and the boo= ns and banes can not be inherited.
From dissenting output the= question became;
"Why LF before TAB?"

Progr= am sort's performance is fine.
A complaint did not exist.=

In all executed software
the same performance bane ex= ists.

A fork + execve overhead or
a relevan= t functions + posix_spawnp overhead exists.
Or more succinctl= y put
for program start CPU cycles are required.

The overhead might seem insignificant,
but in a script = for each program launch
the program reliance overhead accumulates.
Performance is lost.

In contrast to p= rogram provided code;
library provided code can be
loaded once and us= ed frequently.
As compared to program invocation duration
= the library load duration also is less.

To sort a small l= ine amount
by program sort invocation
a considerable program launchoverhead duration becomes.

From a library= perspective the following
potentially complex tasks seem attractive:cp; mv; sort; tsort; wc.
cp and mv im= plementations can be surprisingly complex.
Self authored implementations already exist.

For coreutils provi= ded cp and mv;
parameter options that from the kernel cache
purge use= d file data could be useful.
From files= that were copied or moved
the content is probably not again immediately= useful
yet lingers in the kernel cache= .
By t= he kernel cache
when almost all available RAM is used
then file copy = performance tanks.
By a large and irrel= evantly stocked kernel cache search
per= formance also tanks.

A less than id= eally configured in use kernel seems plausible.
For this task perhaps .../vfs_cache_pressure and
.../drop_caches might not suffice?

Function posix_fadvise seems useful.
But on descriptors posix_fadvise must be invoked.
For directory cache data, however,
posix_f= advise does not seem useful.

=
Thanks again for maintaining and sharing coreuti= ls.
For coreutils if a library interfac= e existed
then it would gain use.
P.S.
If the Linux kernel provided sendfile function
became POSIX approved,
might merit reconsideration.
=
For systems where mass storage device data=
throughput is not bottle-necked
or = where a cached conclusion suffices
then= by user space buffer omission;
signifi= cant CPU cycles can be conserved.
When = between descriptors; data must be transferred
the sendfile function is useful.
--f4030436212669a416054cd3c447-- From unknown Sun Jun 15 08:49:13 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Tue, 09 May 2017 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator