From debbugs-submit-bounces@debbugs.gnu.org Mon Oct 17 21:03:56 2011 Received: (at submit) by debbugs.gnu.org; 18 Oct 2011 01:03:56 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RFy5z-0002FA-LP for submit@debbugs.gnu.org; Mon, 17 Oct 2011 21:03:56 -0400 Received: from eggs.gnu.org ([140.186.70.92]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RFxwy-000214-SV for submit@debbugs.gnu.org; Mon, 17 Oct 2011 20:54:37 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RFxw0-0005iP-R3 for submit@debbugs.gnu.org; Mon, 17 Oct 2011 20:53:37 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD autolearn=unavailable version=3.3.1 Received: from lists.gnu.org ([140.186.70.17]:43300) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RFxw0-0005iL-Pd for submit@debbugs.gnu.org; Mon, 17 Oct 2011 20:53:36 -0400 Received: from eggs.gnu.org ([140.186.70.92]:49421) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RFxvz-0000pt-Iv for bug-coreutils@gnu.org; Mon, 17 Oct 2011 20:53:36 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RFxvy-0005i4-EO for bug-coreutils@gnu.org; Mon, 17 Oct 2011 20:53:35 -0400 Received: from bero.eu ([88.198.22.18]:41343 helo=mail.bero.eu) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RFxvy-0005gT-81 for bug-coreutils@gnu.org; Mon, 17 Oct 2011 20:53:34 -0400 Received: from mail.bero.eu (unknown [127.0.0.1]) by mail.bero.eu (Postfix) with ESMTP id F256B9B83 for ; Tue, 18 Oct 2011 02:58:48 +0200 (CEST) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 18 Oct 2011 01:59:12 +0100 From: Bernhard Rosenkraenzer To: Subject: sort -u throws out non-duplicates Message-ID: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> X-Sender: bero@bero.eu User-Agent: Ark Linux Roundcube Webmail/0.5.3 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.17 X-Spam-Score: -6.6 (------) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 17 Oct 2011 21:03:54 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.6 (------) [bero@matterhorn tmp]$ wget http://bero.eu/java-source-list [...] [bero@matterhorn tmp]$ tr ' ' '\n' ) id 1RFyDx-0002RX-8H for submit@debbugs.gnu.org; Mon, 17 Oct 2011 21:12:09 -0400 Received: from eggs.gnu.org ([140.186.70.92]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RFy0a-00026H-0Y for submit@debbugs.gnu.org; Mon, 17 Oct 2011 20:58:20 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RFxzc-0008MF-1R for submit@debbugs.gnu.org; Mon, 17 Oct 2011 20:57:20 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=5.0 tests=BAYES_00,RP_MATCHES_RCVD autolearn=unavailable version=3.3.1 Received: from lists.gnu.org ([140.186.70.17]:39658) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RFxzb-0008MA-VK for submit@debbugs.gnu.org; Mon, 17 Oct 2011 20:57:19 -0400 Received: from eggs.gnu.org ([140.186.70.92]:48487) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RFxza-00010h-NY for bug-coreutils@gnu.org; Mon, 17 Oct 2011 20:57:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1RFxzY-0008KQ-Qp for bug-coreutils@gnu.org; Mon, 17 Oct 2011 20:57:18 -0400 Received: from bero.eu ([88.198.22.18]:56648 helo=mail.bero.eu) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1RFxzY-0008KD-KB for bug-coreutils@gnu.org; Mon, 17 Oct 2011 20:57:16 -0400 Received: from mail.bero.eu (unknown [127.0.0.1]) by mail.bero.eu (Postfix) with ESMTP id 5A581480E for ; Tue, 18 Oct 2011 03:02:33 +0200 (CEST) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 18 Oct 2011 02:02:57 +0100 From: Bernhard Rosenkraenzer To: Subject: Re: sort -u throws out non-duplicates In-Reply-To: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> Message-ID: <738eae486d2b12975fb55c4774aa9068@bero.eu> X-Sender: bero@bero.eu User-Agent: Ark Linux Roundcube Webmail/0.5.3 X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.17 X-Spam-Score: -6.6 (------) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 17 Oct 2011 21:12:08 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -6.6 (------) On Tue, 18 Oct 2011 01:59:12 +0100, Bernhard Rosenkraenzer wrote: > Note the missing .../java/java/security/cert/X509Certificate.java > > The problem occurs (at least) with sort from coreutils 8.12, 8.13 and > 8.14. This is locale related... Seems to happen in any non-C locale. [bero@matterhorn ~]$ tr ' ' '\n' ) id 1RFzLO-0004kc-4g for submit@debbugs.gnu.org; Mon, 17 Oct 2011 22:23:54 -0400 Received: from mx1.redhat.com ([209.132.183.28]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RFzLK-0004kP-EQ; Mon, 17 Oct 2011 22:23:52 -0400 Received: from int-mx12.intmail.prod.int.phx2.redhat.com (int-mx12.intmail.prod.int.phx2.redhat.com [10.5.11.25]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p9I2MrqQ015774 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 17 Oct 2011 22:22:53 -0400 Received: from [10.3.113.158] (ovpn-113-158.phx2.redhat.com [10.3.113.158]) by int-mx12.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p9I2MqsR016163; Mon, 17 Oct 2011 22:22:52 -0400 Message-ID: <4E9CE2FC.9070107@redhat.com> Date: Mon, 17 Oct 2011 20:22:52 -0600 From: Eric Blake Organization: Red Hat User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110928 Fedora/3.1.15-1.fc14 Lightning/1.0b3pre Mnenhy/0.8.4 Thunderbird/3.1.15 MIME-Version: 1.0 To: Bernhard Rosenkraenzer Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> In-Reply-To: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.68 on 10.5.11.25 X-Spam-Score: -10.3 (----------) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -10.3 (----------) tag 9780 moreinfo thanks On 10/17/2011 06:59 PM, Bernhard Rosenkraenzer wrote: > [bero@matterhorn tmp]$ wget http://bero.eu/java-source-list > [...] > [bero@matterhorn tmp]$ tr ' ' '\n' X509Certificate > libcore/luni/src/main/java/java/security/cert/X509Certificate.java > libcore/luni/src/main/java/javax/security/cert/X509Certificate.java > > This is correct... > > [bero@matterhorn tmp]$ tr ' ' '\n' X509Certificate > libcore/luni/src/main/java/javax/security/cert/X509Certificate.java > > Note the missing .../java/java/security/cert/X509Certificate.java Thanks for the report. Unfortunately, you did not provide enough information to reproduce this - for example, what platform are you running on? Can you narrow it down to a single file of say 5 or so lines? Can you reproduce the problem with shorter input lines? My guess, although I need more info to confirm it, is that this is not a bug, but rather that java-source-list contains some lines that differ in case and/or punctuation but happen to collate identically. If so, then sort -u is picking the lower-case version as the unique line, at which point your grep for the case-sensitive X509Certificate is obviously failing. The fact that you already proved that LC_ALL=C changes the behavior lends credence to my supposition, since C is byte-sensitive, but most other languages collate case-insensitively. See also the FAQ: https://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021 > The problem occurs (at least) with sort from coreutils 8.12, 8.13 and 8.14. Use 'sort --debug' to help decipher sort's behavior. Here's my demonstration that I cannot reproduce it using coreutils.git with just two input lines: $ printf 'libcore/luni/src/main/java/java/security/cert/X509Certificate.java\nlibcore/luni/src/main/java/javax/security/cert/X509Certificate.java\n' | sort -u --debug sort: using `en_US.UTF-8' sorting rules libcore/luni/src/main/java/java/security/cert/X509Certificate.java __________________________________________________________________ libcore/luni/src/main/java/javax/security/cert/X509Certificate.java ___________________________________________________________________ So there's definitely something else in java-source-list that we aren't seeing that is (probably correctly) affecting your output. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 18 04:43:19 2011 Received: (at 9780) by debbugs.gnu.org; 18 Oct 2011 08:43:19 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RG5GZ-0006Wv-F2 for submit@debbugs.gnu.org; Tue, 18 Oct 2011 04:43:19 -0400 Received: from bero.eu ([88.198.22.18] helo=mail.bero.eu) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RG5GV-0006Wk-4I for 9780@debbugs.gnu.org; Tue, 18 Oct 2011 04:43:16 -0400 Received: from mail.bero.eu (unknown [127.0.0.1]) by mail.bero.eu (Postfix) with ESMTP id 67E0B9E2D; Tue, 18 Oct 2011 10:47:36 +0200 (CEST) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 18 Oct 2011 09:48:00 +0100 From: Bernhard Rosenkraenzer To: Eric Blake Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <4E9CE2FC.9070107@redhat.com> References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <4E9CE2FC.9070107@redhat.com> Message-ID: <7ccc8c25284b6fc4ca4c1d89dcbe0d8f@bero.eu> X-Sender: bero@bero.eu User-Agent: Ark Linux Roundcube Webmail/0.5.3 X-Spam-Score: -4.6 (----) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -4.2 (----) On Mon, 17 Oct 2011 20:22:52 -0600, Eric Blake wrote: > On 10/17/2011 06:59 PM, Bernhard Rosenkraenzer wrote: > Thanks for the report. Unfortunately, you did not provide enough > information to reproduce this - for example, what platform are you > running on? Fairly current Linux -- kernel 3.1-rc9, eglibc 2.14.1 > Can you narrow it down to a single file of say 5 or so > lines? Can you reproduce the problem with shorter input lines? Yes: [bero@matterhorn ~]$ echo 'libcore/luni/src/main/java/java/security/cert/X509CRLSelector.java libcore/luni/src/main/java/java/security/cert/X509CertSelector.java libcore/luni/src/main/java/java/security/cert/X509Certificate.java libcore/luni/src/main/java/javax/security/cert/X509Certificate.java' |tr ' ' '\n' |sort -u --debug sort: using `en_US' sorting rules libcore/luni/src/main/java/java/security/cert/X509CertSelector.java ___________________________________________________________________ libcore/luni/src/main/java/java/security/cert/X509CRLSelector.java __________________________________________________________________ libcore/luni/src/main/java/javax/security/cert/X509Certificate.java ___________________________________________________________________ It starts working correctly if any of the entries are removed, yet none of those should match as a duplicate as far as I can see. > My guess, although I need more info to confirm it, is that this is > not a bug, but rather that java-source-list contains some lines that > differ in case and/or punctuation but happen to collate identically. > If so, then sort -u is picking the lower-case version as the unique > line, at which point your grep for the case-sensitive X509Certificate > is obviously failing. FWIW changing everything to lower case doesn't change anything [bero@matterhorn ~]$ echo 'libcore/luni/src/main/java/java/security/cert/x509crlselector.java libcore/luni/src/main/java/java/security/cert/x509certselector.java libcore/luni/src/main/java/java/security/cert/x509certificate.java libcore/luni/src/main/java/javax/security/cert/x509certificate.java' |tr ' ' '\n' |sort -u --debug sort: using `en_US' sorting rules libcore/luni/src/main/java/java/security/cert/x509certselector.java ___________________________________________________________________ libcore/luni/src/main/java/java/security/cert/x509crlselector.java __________________________________________________________________ libcore/luni/src/main/java/javax/security/cert/x509certificate.java ___________________________________________________________________ ttyl bero From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 18 05:30:29 2011 Received: (at 9780) by debbugs.gnu.org; 18 Oct 2011 09:30:29 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RG60C-0007cA-Os for submit@debbugs.gnu.org; Tue, 18 Oct 2011 05:30:29 -0400 Received: from mx.meyering.net ([88.168.87.75]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RG60B-0007c2-1g for 9780@debbugs.gnu.org; Tue, 18 Oct 2011 05:30:28 -0400 Received: from rho.meyering.net (localhost.localdomain [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id ABCFF6004E; Tue, 18 Oct 2011 11:29:29 +0200 (CEST) From: Jim Meyering To: Bernhard Rosenkraenzer Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <7ccc8c25284b6fc4ca4c1d89dcbe0d8f@bero.eu> (Bernhard Rosenkraenzer's message of "Tue, 18 Oct 2011 09:48:00 +0100") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <4E9CE2FC.9070107@redhat.com> <7ccc8c25284b6fc4ca4c1d89dcbe0d8f@bero.eu> Date: Tue, 18 Oct 2011 11:29:29 +0200 Message-ID: <87hb36bqbq.fsf@rho.meyering.net> Lines: 32 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -2.8 (--) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Eric Blake X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.8 (--) Bernhard Rosenkraenzer wrote: > On Mon, 17 Oct 2011 20:22:52 -0600, Eric Blake wrote: >> On 10/17/2011 06:59 PM, Bernhard Rosenkraenzer wrote: >> Thanks for the report. Unfortunately, you did not provide enough >> information to reproduce this - for example, what platform are you >> running on? > > Fairly current Linux -- kernel 3.1-rc9, eglibc 2.14.1 > >> Can you narrow it down to a single file of say 5 or so >> lines? Can you reproduce the problem with shorter input lines? > > Yes: > [bero@matterhorn ~]$ echo > libcore/luni/src/main/java/java/security/cert/X509CRLSelector.java > libcore/luni/src/main/java/java/security/cert/X509CertSelector.java > libcore/luni/src/main/java/java/security/cert/X509Certificate.java > libcore/luni/src/main/java/javax/security/cert/X509Certificate.java' > |tr ' ' '\n' |sort -u --debug So far, I've been unable to reproduce that on Fedora 16 or Debian unstable both x86_64. I.e., the following (equivalent to above, but with no long lines) always prints the four input lines: cert=libcore/luni/src/main/java/java/security/cert echo \ $cert/X509CRLSelector.java \ $cert/X509CertSelector.java \ $cert/X509Certificate.java \ libcore/luni/src/main/java/javax/security/cert/X509Certificate.java \ |tr ' ' '\n' |sort -u --debug From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 18 08:03:35 2011 Received: (at 9780) by debbugs.gnu.org; 18 Oct 2011 12:03:35 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RG8OM-0003Pd-Ug for submit@debbugs.gnu.org; Tue, 18 Oct 2011 08:03:35 -0400 Received: from mx1.redhat.com ([209.132.183.28]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1RG8OK-0003PV-7R for 9780@debbugs.gnu.org; Tue, 18 Oct 2011 08:03:33 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id p9IC2Wtr021572 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 18 Oct 2011 08:02:32 -0400 Received: from [10.3.113.116] (ovpn-113-116.phx2.redhat.com [10.3.113.116]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id p9IC2UZx028305 (version=TLSv1/SSLv3 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 18 Oct 2011 08:02:31 -0400 Message-ID: <4E9D6A98.9030602@draigBrady.com> Date: Tue, 18 Oct 2011 13:01:28 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 MIME-Version: 1.0 To: Bernhard Rosenkraenzer Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <4E9CE2FC.9070107@redhat.com> <7ccc8c25284b6fc4ca4c1d89dcbe0d8f@bero.eu> In-Reply-To: <7ccc8c25284b6fc4ca4c1d89dcbe0d8f@bero.eu> X-Enigmail-Version: 1.3.2 Content-Type: text/plain; charset=UTF-8 X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 Content-Transfer-Encoding: quoted-printable X-MIME-Autoconverted: from 8bit to quoted-printable by mx1.redhat.com id p9IC2Wtr021572 X-Spam-Score: -10.6 (----------) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -10.6 (----------) On 10/18/2011 09:48 AM, Bernhard Rosenkraenzer wrote: > On Mon, 17 Oct 2011 20:22:52 -0600, Eric Blake wrote: >> On 10/17/2011 06:59 PM, Bernhard Rosenkraenzer wrote: >> Thanks for the report. Unfortunately, you did not provide enough >> information to reproduce this - for example, what platform are you >> running on? >=20 > Fairly current Linux -- kernel 3.1-rc9, eglibc 2.14.1 >=20 >> Can you narrow it down to a single file of say 5 or so >> lines? Can you reproduce the problem with shorter input lines? >=20 > Yes: > [bero@matterhorn ~]$ echo 'libcore/luni/src/main/java/java/security/cer= t/X509CRLSelector.java libcore/luni/src/main/java/java/security/cert/X509= CertSelector.java libcore/luni/src/main/java/java/security/cert/X509Certi= ficate.java libcore/luni/src/main/java/javax/security/cert/X509Certificat= e.java' |tr ' ' '\n' |sort -u --debug > sort: using `en_US' sorting rules > libcore/luni/src/main/java/java/security/cert/X509CertSelector.java > ___________________________________________________________________ > libcore/luni/src/main/java/java/security/cert/X509CRLSelector.java > __________________________________________________________________ > libcore/luni/src/main/java/javax/security/cert/X509Certificate.java > ___________________________________________________________________ >=20 >=20 > It starts working correctly if any of the entries are removed, yet none= of those should match as a duplicate as far as I can see. >=20 >> My guess, although I need more info to confirm it, is that this is >> not a bug, but rather that java-source-list contains some lines that >> differ in case and/or punctuation but happen to collate identically. >> If so, then sort -u is picking the lower-case version as the unique >> line, at which point your grep for the case-sensitive X509Certificate >> is obviously failing. >=20 > FWIW changing everything to lower case doesn't change anything > [bero@matterhorn ~]$ echo 'libcore/luni/src/main/java/java/security/cer= t/x509crlselector.java libcore/luni/src/main/java/java/security/cert/x509= certselector.java libcore/luni/src/main/java/java/security/cert/x509certi= ficate.java libcore/luni/src/main/java/javax/security/cert/x509certificat= e.java' |tr ' ' '\n' |sort -u --debug > sort: using `en_US' sorting rules > libcore/luni/src/main/java/java/security/cert/x509certselector.java > ___________________________________________________________________ > libcore/luni/src/main/java/java/security/cert/x509crlselector.java > __________________________________________________________________ > libcore/luni/src/main/java/javax/security/cert/x509certificate.java > ___________________________________________________________________ >=20 >=20 I can't reproduce this. There may be some issues currently with debian locale defs? http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D636286 cheers, P=C3=A1draig. From debbugs-submit-bounces@debbugs.gnu.org Tue Aug 14 11:58:15 2012 Received: (at 9780) by debbugs.gnu.org; 14 Aug 2012 15:58:15 +0000 Received: from localhost ([127.0.0.1]:56495 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1JVW-0007V2-E5 for submit@debbugs.gnu.org; Tue, 14 Aug 2012 11:58:15 -0400 Received: from columba.intomics.com ([77.72.50.68]:37172) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1CRu-0003ZX-AP for 9780@debbugs.gnu.org; Tue, 14 Aug 2012 04:26:03 -0400 Received: from localhost (localhost [127.0.0.1]) by columba.intomics.com (Postfix) with ESMTP id D8C5222039C for <9780@debbugs.gnu.org>; Tue, 14 Aug 2012 10:17:23 +0200 (CEST) X-Virus-Scanned: Debian amavisd-new at intomics.dk Received: from columba ([127.0.0.1]) by localhost (columba.intomics.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9QJh0gWc4Al6 for <9780@debbugs.gnu.org>; Tue, 14 Aug 2012 10:17:22 +0200 (CEST) Received: from dhcp-0-1-112.intomics.com (dhcp-0-1-112.intomics.com [10.0.1.112]) (Authenticated sender: rbh) by columba.intomics.com (Postfix) with ESMTPSA id 1EA1022034D for <9780@debbugs.gnu.org>; Tue, 14 Aug 2012 10:17:22 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=intomics.com; s=201205; t=1344932242; bh=/VB0CnoJtNn3j0AN3qYjDPUsGhEs6Ub0NW//G2jgZ20=; h=From:Subject:Date:To; b=emBn7VwpHUWJdcQURpHkcfKPwwHVEgMW6JDHyhf/ensL4UGXYk2sPeRbiQ+aj0+zP IBbc0ZitbHZssn4QdBjI6MuvpkWSyfVmXPspDBIfpccZNchbYEt21Rzh6CvHBlTxjs HsUdBxoGXfzbxozJpTZMy92QCKgbmx8PQOnNv+s0= From: Rasmus Borup Hansen Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: sort -u throws out non-duplicates Date: Tue, 14 Aug 2012 10:17:21 +0200 Message-Id: <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> To: 9780@debbugs.gnu.org Mime-Version: 1.0 (Apple Message framework v1278) X-Mailer: Apple Mail (2.1278) X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 X-Mailman-Approved-At: Tue, 14 Aug 2012 11:58:12 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) I came across this bug and have written a small shell script (below) = that reproduces it on recent Linux distributions. It also reproduces the = error using the latest coreutils compiled from sources. Best regards, Rasmus #!/bin/sh # Generate a file consisting only of 24 long sequences of lines with # numbers from 0 to 23. This is actually a file that strongly # resembles one we came upon during our work. ( for i in `seq 1 18624` ; do echo 18; done for i in `seq 1 69001` ; do echo 10; done for i in `seq 1 37950` ; do echo 20; done for i in `seq 1 124026` ; do echo 2; done for i in `seq 1 52202` ; do echo 15; done for i in `seq 1 3660` ; do echo 0; done for i in `seq 1 71627` ; do echo 5; done for i in `seq 1 69989` ; do echo 19; done for i in `seq 1 65192` ; do echo 9; done for i in `seq 1 51058` ; do echo 16; done for i in `seq 1 26810` ; do echo 13; done for i in `seq 1 56387` ; do echo 23; done for i in `seq 1 77273` ; do echo 7; done for i in `seq 1 159425` ; do echo 1; done for i in `seq 1 36851` ; do echo 22; done for i in `seq 1 102583` ; do echo 12; done for i in `seq 1 75429` ; do echo 17; done for i in `seq 1 82322` ; do echo 6; done for i in `seq 1 101135` ; do echo 3; done for i in `seq 1 63726` ; do echo 4; done for i in `seq 1 57302` ; do echo 14; done for i in `seq 1 57770` ; do echo 8; done for i in `seq 1 18032` ; do echo 21; done for i in `seq 1 101938` ; do echo 11; done ) > inputfile # There should be 24 unique lines in inputfile no matter what the -S # parameter to sort is. for SIZE in `seq 128 140` ; do sort -S $SIZE -u inputfile | wc -l done # Ubuntu 12.04 OpenSuSE 11.4 SLES 10 SP1 Gentoo # coreutils 8.12 coreutils 8.9 coreutils 5.93 coreutils 8.14 # 23 24 24 24 # 24 24 24 24 # 24 24 24 23 # 23 23 24 24 # 24 21 24 22 # 24 24 24 23 # 22 23 24 23 # 24 23 24 23 # 22 24 24 23 # 24 24 24 24 # 24 24 24 24 # 24 24 24 22 # 24 24 24 23 From debbugs-submit-bounces@debbugs.gnu.org Tue Aug 14 14:37:17 2012 Received: (at 9780) by debbugs.gnu.org; 14 Aug 2012 18:37:17 +0000 Received: from localhost ([127.0.0.1]:56844 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1LzR-00088v-9W for submit@debbugs.gnu.org; Tue, 14 Aug 2012 14:37:17 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:50407) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1LzO-00088m-EP for 9780@debbugs.gnu.org; Tue, 14 Aug 2012 14:37:15 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 66E6539E8014; Tue, 14 Aug 2012 11:28:34 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id lTN5y7Gy0SQA; Tue, 14 Aug 2012 11:28:34 -0700 (PDT) Received: from [10.10.73.118] (unknown [208.181.80.18]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id E103139E8008; Tue, 14 Aug 2012 11:28:33 -0700 (PDT) Message-ID: <502A98C5.40302@cs.ucla.edu> Date: Tue, 14 Aug 2012 11:28:21 -0700 From: Paul Eggert User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Rasmus Borup Hansen Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> In-Reply-To: <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Thanks very much for that test case; I've confirmed the bug on my platform with the latest 'sort'. If nobody else gets to it I will try to take a look at it when I find the time (most likely in a week or so). From debbugs-submit-bounces@debbugs.gnu.org Tue Aug 14 17:17:56 2012 Received: (at 9780) by debbugs.gnu.org; 14 Aug 2012 21:17:56 +0000 Received: from localhost ([127.0.0.1]:57120 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1OUu-00058U-Gx for submit@debbugs.gnu.org; Tue, 14 Aug 2012 17:17:56 -0400 Received: from mx.meyering.net ([88.168.87.75]:50758) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1OUs-00058N-CV for 9780@debbugs.gnu.org; Tue, 14 Aug 2012 17:17:55 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id F382D60098; Tue, 14 Aug 2012 23:09:11 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502A98C5.40302@cs.ucla.edu> (Paul Eggert's message of "Tue, 14 Aug 2012 11:28:21 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> Date: Tue, 14 Aug 2012 23:09:11 +0200 Message-ID: <87obmdp4eg.fsf@rho.meyering.net> Lines: 69 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Paul Eggert wrote: > Thanks very much for that test case; I've confirmed the bug on > my platform with the latest 'sort'. If nobody else gets to it > I will try to take a look at it when I find the time > (most likely in a week or so). Yes, thanks again! That is a serious bug. It has been around for a long time. This small reproducer (on a 2-core i686 system) perl -e 'printf "33\n"x2 ."7\n"x31 ."1\n"' | src/sort -S1 -u prints this: 33 7 but should print this: 1 33 7 On a multi-core x86_64, I need slightly different input to trigger the failure. Note how this uses only 22 '7's and --parallel=1. $ perl -e 'printf "33\n"x2 ."7\n"x22 ."1\n"'|src/sort --para=1 -S1 -u 33 7 The problem starts with write_unique's static variable: static void write_unique (struct line const *line, FILE *tfp, char const *temp_output) { static struct line saved; if (unique) { if (saved.text && ! compare (line, &saved)) return; saved = *line; } ... Note how that merely makes a shallow copy of "*line". I.e., it merely copies line's 4 members, 3 of which are pointers. (gdb) p *line $1 = { text = 0x806221e "1", length = 2, keybeg = 0x62626262
, keylim = 0x62626262
In that example, the two key* variables are not even initialized, which is not a problem, since they're not used in this example. The one that causes trouble is the .text pointer. The line buffer storage into which it points ends up being overwritten when new data is read in (via fread), and so if you are unlucky, you'll get an accidental match and mistakenly skip the sole (in reduced temp files) occurrence of a line, resulting in incorrect output. The solution may be to make a deep copy, and store it in an extensible (probably never-freed) buffer. However, that looks like it will be comparatively expensive, since determining whether keybeg and keylim must also be copied depends on many global options, the same ones used by sort's compare function. A slower-yet-correct "sort -u" is obviously preferable to our currently faster-yet-buggy one. From debbugs-submit-bounces@debbugs.gnu.org Wed Aug 15 13:56:23 2012 Received: (at 9780) by debbugs.gnu.org; 15 Aug 2012 17:56:23 +0000 Received: from localhost ([127.0.0.1]:59459 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1hpO-0001aL-N5 for submit@debbugs.gnu.org; Wed, 15 Aug 2012 13:56:23 -0400 Received: from mx.meyering.net ([88.168.87.75]:53828) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1hpL-0001aB-I4 for 9780@debbugs.gnu.org; Wed, 15 Aug 2012 13:56:21 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 30AE9600B1; Wed, 15 Aug 2012 19:47:33 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <87obmdp4eg.fsf@rho.meyering.net> (Jim Meyering's message of "Tue, 14 Aug 2012 23:09:11 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> Date: Wed, 15 Aug 2012 19:47:33 +0200 Message-ID: <87r4r8m4i2.fsf@rho.meyering.net> Lines: 173 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: > Paul Eggert wrote: >> Thanks very much for that test case; I've confirmed the bug on >> my platform with the latest 'sort'. If nobody else gets to it >> I will try to take a look at it when I find the time >> (most likely in a week or so). > > Yes, thanks again! > That is a serious bug. > It has been around for a long time. > > This small reproducer (on a 2-core i686 system) > perl -e 'printf "33\n"x2 ."7\n"x31 ."1\n"' | src/sort -S1 -u > prints this: > 33 > 7 > but should print this: > 1 > 33 > 7 > > On a multi-core x86_64, I need slightly different input to trigger > the failure. Note how this uses only 22 '7's and --parallel=1. > > $ perl -e 'printf "33\n"x2 ."7\n"x22 ."1\n"'|src/sort --para=1 -S1 -u > 33 > 7 > > The problem starts with write_unique's static variable: > > static void > write_unique (struct line const *line, FILE *tfp, char const *temp_output) > { > static struct line saved; > if (unique) > { > if (saved.text && ! compare (line, &saved)) > return; > saved = *line; > } > ... > > Note how that merely makes a shallow copy of "*line". > I.e., it merely copies line's 4 members, 3 of which are pointers. > > (gdb) p *line > $1 = { > text = 0x806221e "1", > length = 2, > keybeg = 0x62626262
, > keylim = 0x62626262
> > In that example, the two key* variables are not even initialized, > which is not a problem, since they're not used in this example. > The one that causes trouble is the .text pointer. > The line buffer storage into which it points ends > up being overwritten when new data is read in (via fread), > and so if you are unlucky, you'll get an accidental match > and mistakenly skip the sole (in reduced temp files) occurrence > of a line, resulting in incorrect output. > > The solution may be to make a deep copy, and store it in > an extensible (probably never-freed) buffer. > However, that looks like it will be comparatively expensive, > since determining whether keybeg and keylim must > also be copied depends on many global options, > the same ones used by sort's compare function. > > A slower-yet-correct "sort -u" is obviously preferable to > our currently faster-yet-buggy one. I'm technically "off" today, so have had little time. In case anyone is chomping at the bit, here's a preliminary patch: Here's a smaller test case that appears to be host/nproc-independent: It should print two lines: 1, then 7. Without this patch, it prints only "7". (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u Of course, it needs more/better comments, NEWS and tests -- and not just the one above, but also one that demonstrates the need for the key* adjustments below. This solution doesn't incur much of a performance penalty because it copies the line only rarely: just before an fread call that might modify the currently-saved text. >From 24f9646e3b954b7c914c0f3139054dfce466d314 Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Wed, 15 Aug 2012 12:30:44 +0200 Subject: [PATCH] sort: fix bug with --unique --- src/sort.c | 37 ++++++++++++++++++++++++++++++++++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/src/sort.c b/src/sort.c index d362dc5..6b07c22 100644 --- a/src/sort.c +++ b/src/sort.c @@ -262,6 +262,9 @@ struct merge_node_queue when popping. */ }; +/* Used to implement --unique (-u). */ +static struct line saved_line; + /* FIXME: None of these tables work with multibyte character sets. Also, there are many other bugs when handling multibyte characters. One way to fix this is to rewrite 'sort' to use wide characters @@ -1702,6 +1705,14 @@ limfield (struct line const *line, struct keyfield const *key) return ptr; } +/* Return true if LINE and the buffer BUF of length LEN overlap. */ +static inline bool +overlap (char const *buf, size_t len, struct line const *line) +{ + char const *line_end = line->text + line->length; + return !(line_end <= buf || buf + len <= line->text); +} + /* Fill BUF reading from FP, moving buf->left bytes from the end of buf->buf to the beginning first. If EOF is reached and the file wasn't terminated by a newline, supply one. Set up BUF's line @@ -1742,6 +1753,27 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file) rest of the input file consists entirely of newlines, except that the last byte is not a newline. */ size_t readsize = (avail - 1) / (line_bytes + 1); + if (unique && overlap (ptr, readsize, &saved_line)) + { + /* Copy saved_line.text into a buffer where it won't be clobbered + and if KEY is non-NULL, adjust saved_line.key* to match. */ + static char *safe_text; + static size_t safe_text_n_alloc; + if (safe_text_n_alloc < saved_line.length) + { + safe_text_n_alloc = saved_line.length; + safe_text = x2nrealloc (safe_text, &safe_text_n_alloc, 1); + } + memcpy (safe_text, saved_line.text, saved_line.length); + if (key) + { + #define s saved_line + s.keybeg = safe_text + (s.keybeg - s.text); + s.keylim = safe_text + (s.keylim - s.text); + #undef s + } + saved_line.text = safe_text; + } size_t bytes_read = fread (ptr, 1, readsize, fp); char *ptrlim = ptr + bytes_read; char *p; @@ -3348,13 +3380,12 @@ queue_pop (struct merge_node_queue *queue) static void write_unique (struct line const *line, FILE *tfp, char const *temp_output) { - static struct line saved; if (unique) { - if (saved.text && ! compare (line, &saved)) + if (saved_line.text && ! compare (line, &saved_line)) return; - saved = *line; + saved_line = *line; } write_line (line, tfp, temp_output); -- 1.7.12.rc2.16.g034161a From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 16 03:36:14 2012 Received: (at 9780) by debbugs.gnu.org; 16 Aug 2012 07:36:14 +0000 Received: from localhost ([127.0.0.1]:60380 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1uco-0005r9-Cz for submit@debbugs.gnu.org; Thu, 16 Aug 2012 03:36:14 -0400 Received: from mx.meyering.net ([88.168.87.75]:55827) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1ucn-0005r2-0Q for 9780@debbugs.gnu.org; Thu, 16 Aug 2012 03:36:13 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id DA0C4600BB; Thu, 16 Aug 2012 09:27:23 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <87r4r8m4i2.fsf@rho.meyering.net> (Jim Meyering's message of "Wed, 15 Aug 2012 19:47:33 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> Date: Thu, 16 Aug 2012 09:27:23 +0200 Message-ID: <87628jmh44.fsf@rho.meyering.net> Lines: 19 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: ... > Here's a smaller test case that appears to be host/nproc-independent: > It should print two lines: 1, then 7. > Without this patch, it prints only "7". > > (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u > > Of course, it needs more/better comments, NEWS and > tests -- and not just the one above, but also one that > demonstrates the need for the key* adjustments below. FYI, here's the required test: (yes 7|head -10; echo 1)|sed 's/^/1 /'|sort -k2,2 --p=1 -S32b -u Without the if (key) { ... } part of my patch, it would fail. I had to tweak the number of '7's (s/11/10) in the input to make it trigger. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 16 04:18:16 2012 Received: (at 9780) by debbugs.gnu.org; 16 Aug 2012 08:18:16 +0000 Received: from localhost ([127.0.0.1]:60450 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1vHT-0006n5-St for submit@debbugs.gnu.org; Thu, 16 Aug 2012 04:18:16 -0400 Received: from mx.meyering.net ([88.168.87.75]:55928) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1vHR-0006my-Qs for 9780@debbugs.gnu.org; Thu, 16 Aug 2012 04:18:14 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id C9F0660103; Thu, 16 Aug 2012 10:09:24 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <87628jmh44.fsf@rho.meyering.net> (Jim Meyering's message of "Thu, 16 Aug 2012 09:27:23 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <87628jmh44.fsf@rho.meyering.net> Date: Thu, 16 Aug 2012 10:09:24 +0200 Message-ID: <87zk5vl0ln.fsf@rho.meyering.net> Lines: 24 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: > Jim Meyering wrote: > ... >> Here's a smaller test case that appears to be host/nproc-independent: >> It should print two lines: 1, then 7. >> Without this patch, it prints only "7". >> >> (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u >> >> Of course, it needs more/better comments, NEWS and >> tests -- and not just the one above, but also one that >> demonstrates the need for the key* adjustments below. > > FYI, here's the required test: > > (yes 7|head -10; echo 1)|sed 's/^/1 /'|sort -k2,2 --p=1 -S32b -u > > Without the if (key) { ... } part of my patch, it would fail. > I had to tweak the number of '7's (s/11/10) in the input to make > it trigger. Hmm... The above is arch-specific. It triggers the bug on i686, but not on x86_64. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 16 04:37:12 2012 Received: (at 9780) by debbugs.gnu.org; 16 Aug 2012 08:37:13 +0000 Received: from localhost ([127.0.0.1]:60465 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1vZo-0007DJ-1c for submit@debbugs.gnu.org; Thu, 16 Aug 2012 04:37:12 -0400 Received: from moutng.kundenserver.de ([212.227.17.8]:57551) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1vZl-0007DC-Qi for 9780@debbugs.gnu.org; Thu, 16 Aug 2012 04:37:11 -0400 Received: from [192.168.2.108] (p4FF74EFC.dip.t-dialin.net [79.247.78.252]) by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis) id 0MR93L-1T8Gid3AYZ-00Ug5z; Thu, 16 Aug 2012 10:28:01 +0200 Message-ID: <502CAF0E.3090805@bernhard-voelker.de> Date: Thu, 16 Aug 2012 10:27:58 +0200 From: Bernhard Voelker User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <87628jmh44.fsf@rho.meyering.net> <87zk5vl0ln.fsf@rho.meyering.net> In-Reply-To: <87zk5vl0ln.fsf@rho.meyering.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:UPomHcKMxgmbFSEOw9OMWO1DKjAZLC5dM/9Xpr39gg6 WGkcdjrUMNj0eRU671O4XDygJtOySXB2Bik/yz7zGzntbfi7r0 d9B7Brjo4ypODFci7zMk5dve4ssDcKZfiwyWmWTkaGMLzTThEX 2PSm2HX380yWMIWw0vyLfkl0+vVyGe6HGshcXlayn4+skQgKrP kLyH8OQSDn2w/ysWD+6U0l9BvpxF2DN/dBIRKQ98KSakbqdt8t dk3/76GFKu7qS3un5DTnxYn+0ghHlIl63ulQdJhm8NGU4CYrga t9v3iImI2xjY2i3NodFpANGHltPLzTX0puftaFtfhJ3qTO3cqf 728lWQtn5A58TPho/n0MtTM7zyPvQZ4bZeV9tQY6f X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Paul Eggert , Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/16/2012 10:09 AM, Jim Meyering wrote: >> FYI, here's the required test: >> > >> > (yes 7|head -10; echo 1)|sed 's/^/1 /'|sort -k2,2 --p=1 -S32b -u >> > >> > Without the if (key) { ... } part of my patch, it would fail. >> > I had to tweak the number of '7's (s/11/10) in the input to make >> > it trigger. > Hmm... The above is arch-specific. > It triggers the bug on i686, but not on x86_64. This triggers the bug on my x86_64: $ ~/cu> (yes 7|head -n 100; echo 1)|sed 's/^/1 /'| src/sort -k2,2 --p=1 -S1k -u 1 7 However, a little different line does not: $ ~/cu> (yes 7|head -n 10; echo 1)|sed 's/^/1 /'| src/sort -k2,2 --p=1 -S1k -u 1 1 1 7 $ ~/cu> (yes 7|head -n 100; echo 1)|sed 's/^/1 /'| src/sort -k2,2 --p=1 -S1M -u 1 1 1 7 Have a ncie day, Berny From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 16 04:37:22 2012 Received: (at 9780) by debbugs.gnu.org; 16 Aug 2012 08:37:22 +0000 Received: from localhost ([127.0.0.1]:60468 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1vZu-0007Dd-VZ for submit@debbugs.gnu.org; Thu, 16 Aug 2012 04:37:20 -0400 Received: from mx.meyering.net ([88.168.87.75]:55977) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T1vZr-0007DS-I4 for 9780@debbugs.gnu.org; Thu, 16 Aug 2012 04:37:16 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id A707560103; Thu, 16 Aug 2012 10:28:25 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <87zk5vl0ln.fsf@rho.meyering.net> (Jim Meyering's message of "Thu, 16 Aug 2012 10:09:24 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <87628jmh44.fsf@rho.meyering.net> <87zk5vl0ln.fsf@rho.meyering.net> Date: Thu, 16 Aug 2012 10:28:25 +0200 Message-ID: <87txw3kzpy.fsf@rho.meyering.net> Lines: 41 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: > Jim Meyering wrote: >> Jim Meyering wrote: >> ... >>> Here's a smaller test case that appears to be host/nproc-independent: >>> It should print two lines: 1, then 7. >>> Without this patch, it prints only "7". >>> >>> (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u >>> >>> Of course, it needs more/better comments, NEWS and >>> tests -- and not just the one above, but also one that >>> demonstrates the need for the key* adjustments below. >> >> FYI, here's the required test: >> >> (yes 7|head -10; echo 1)|sed 's/^/1 /'|sort -k2,2 --p=1 -S32b -u >> >> Without the if (key) { ... } part of my patch, it would fail. >> I had to tweak the number of '7's (s/11/10) in the input to make >> it trigger. > > Hmm... The above is arch-specific. > It triggers the bug on i686, but not on x86_64. Here's an interesting one, this time x86_64-specific: perl -e 'print "0\n"x5000 ."6\n"x6000 ."8\n"x3000 ."4\n"x8000 ."1\n"x2000' \ | sed 's/^/a /'| sort -k2,2 -u --par=1 -S1k It prints a single line: a 1 rather than the required five: a 0 a 1 a 4 a 6 a 8 From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 16 17:11:59 2012 Received: (at 9780) by debbugs.gnu.org; 16 Aug 2012 21:11:59 +0000 Received: from localhost ([127.0.0.1]:34151 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T27ME-0001uW-Uh for submit@debbugs.gnu.org; Thu, 16 Aug 2012 17:11:59 -0400 Received: from mx.meyering.net ([88.168.87.75]:57996) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T27MA-0001uL-O2 for 9780@debbugs.gnu.org; Thu, 16 Aug 2012 17:11:57 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 3B48964045; Thu, 16 Aug 2012 23:03:02 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <87r4r8m4i2.fsf@rho.meyering.net> (Jim Meyering's message of "Wed, 15 Aug 2012 19:47:33 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> Date: Thu, 16 Aug 2012 23:03:02 +0200 Message-ID: <878vdea6t5.fsf@rho.meyering.net> Lines: 191 MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: ... > In case anyone is chomping at the bit, here's a preliminary patch: > > Here's a smaller test case that appears to be host/nproc-independent: > It should print two lines: 1, then 7. > Without this patch, it prints only "7". > > (yes 7|head -11; echo 1)|sort --parallel=3D1 -S32b -u > > Of course, it needs more/better comments, NEWS and > tests -- and not just the one above, but also one that > demonstrates the need for the key* adjustments below. > > This solution doesn't incur much of a performance penalty > because it copies the line only rarely: just before an fread > call that might modify the currently-saved text. > ... > Subject: [PATCH] sort: fix bug with --unique > > --- > src/sort.c | 37 ++++++++++++++++++++++++++++++++++--- > 1 file changed, 34 insertions(+), 3 deletions(-) Here's a complete patch: >From 431102766cbf7c360ee6fa1f157ebcd7d8b9ca0e Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Wed, 15 Aug 2012 12:30:44 +0200 Subject: [PATCH] sort: sort --unique (-u) could cause data loss sort -u could omit one or more lines of expected output. This bug arose because sort recorded the most recently printed line via reference, and if you were unlucky, the storage for that line would be reused (overwritten) as additional input was read into memory. If you were doubly unlucky, the new value of the "saved" line would not only match the very next line, but if that next line were also the first in a series of identical, not-yet-printed lines, then the corrupted "saved" line value would result in the omission of all matching lines. * src/sort.c (saved_line): New static/global, renamed and moved from... (write_unique): ...here. Old name was "saved", which was too generic for its new role as file-scoped global. (fillbuf): With --unique, when we're about to read into a buffer that overlaps the saved "preceding" line (saved_line), copy the line's .text member to a realloc'd-as-needed temporary buffer and adjust the line's key-defining members if they're set. (overlap): New function. * tests/misc/sort: New tests. * NEWS (Bug fixes): Mention it. * THANKS.in: Update. Bug introduced via commit v8.5-89-g9face83. Reported by Rasmus Borup Hansen in http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/23173/focus=3D24647 --- NEWS | 5 +++++ THANKS.in | 1 + src/sort.c | 44 ++++++++++++++++++++++++++++++++++++++++---- tests/misc/sort | 9 +++++++++ 4 files changed, 55 insertions(+), 4 deletions(-) diff --git a/NEWS b/NEWS index 012a633..f39a76a 100644 --- a/NEWS +++ b/NEWS @@ -9,6 +9,11 @@ GNU coreutils NEWS -*- = outline -*- certain options like -a, -l, -t and -x. [This bug was present in "the beginning".] + sort -u could fail to output one or more result lines. + For example, this command would fail to print "1": + (yes 7 | head -11; echo 1) | sort --p=3D1 -S32b -u + [bug introduced in coreutils-8.6] + ** New features rm now accepts the --dir (-d) option which makes it remove empty directo= ries. diff --git a/THANKS.in b/THANKS.in index 5db443b..a736201 100644 --- a/THANKS.in +++ b/THANKS.in @@ -508,6 +508,7 @@ Primoz PETERLIN primozz.peterlin@gm= ail.com Rainer Orth ro@TechFak.Uni-Bielefeld.DE Ralf W. Stephan stephan@tmt.de Ralph Loader loader@maths.ox.ac.uk +Rasmus Borup Hansen rbh@intomics.com Raul Miller moth@magenta.com Ra=FAl N=FA=F1ez de Arenas Coronado raul@pleyades.net Richard A Downing richard.downing@bcs.org.uk diff --git a/src/sort.c b/src/sort.c index d362dc5..c2d2d49 100644 --- a/src/sort.c +++ b/src/sort.c @@ -262,6 +262,9 @@ struct merge_node_queue when popping. */ }; +/* Used to implement --unique (-u). */ +static struct line saved_line; + /* FIXME: None of these tables work with multibyte character sets. Also, there are many other bugs when handling multibyte characters. One way to fix this is to rewrite 'sort' to use wide characters @@ -1702,6 +1705,14 @@ limfield (struct line const *line, struct keyfield c= onst *key) return ptr; } +/* Return true if LINE and the buffer BUF of length LEN overlap. */ +static inline bool +overlap (char const *buf, size_t len, struct line const *line) +{ + char const *line_end =3D line->text + line->length; + return !(line_end <=3D buf || buf + len <=3D line->text); +} + /* Fill BUF reading from FP, moving buf->left bytes from the end of buf->buf to the beginning first. If EOF is reached and the file wasn't terminated by a newline, supply one. Set up BUF's line @@ -1742,6 +1753,33 @@ fillbuf (struct buffer *buf, FILE *fp, char const *f= ile) rest of the input file consists entirely of newlines, except that the last byte is not a newline. */ size_t readsize =3D (avail - 1) / (line_bytes + 1); + + /* With --unique, when we're about to read into a buffer that + overlaps the saved "preceding" line (saved_line), copy the li= ne's + .text member to a realloc'd-as-needed temporary buffer and ad= just + the line's key-defining members if they're set. */ + if (unique && overlap (ptr, readsize, &saved_line)) + { + /* Copy saved_line.text into a buffer where it won't be clob= bered + and if KEY is non-NULL, adjust saved_line.key* to match. = */ + static char *safe_text; + static size_t safe_text_n_alloc; + if (safe_text_n_alloc < saved_line.length) + { + safe_text_n_alloc =3D saved_line.length; + safe_text =3D x2nrealloc (safe_text, &safe_text_n_alloc,= 1); + } + memcpy (safe_text, saved_line.text, saved_line.length); + if (key) + { + #define s saved_line + s.keybeg =3D safe_text + (s.keybeg - s.text); + s.keylim =3D safe_text + (s.keylim - s.text); + #undef s + } + saved_line.text =3D safe_text; + } + size_t bytes_read =3D fread (ptr, 1, readsize, fp); char *ptrlim =3D ptr + bytes_read; char *p; @@ -3348,13 +3386,11 @@ queue_pop (struct merge_node_queue *queue) static void write_unique (struct line const *line, FILE *tfp, char const *temp_output) { - static struct line saved; - if (unique) { - if (saved.text && ! compare (line, &saved)) + if (saved_line.text && ! compare (line, &saved_line)) return; - saved =3D *line; + saved_line =3D *line; } write_line (line, tfp, temp_output); diff --git a/tests/misc/sort b/tests/misc/sort index 5d15d75..050d2f8 100755 --- a/tests/misc/sort +++ b/tests/misc/sort @@ -227,6 +227,15 @@ my @Tests =3D ["15d", '-i -u', {IN=3D>"\1a\na\n"}, {OUT=3D>"\1a\n"}], ["15e", '-i -u', {IN=3D>"a\n\1\1\1\1\1a\1\1\1\1\n"}, {OUT=3D>"a\n"}], +# This would fail (printing only the 7) for 8.6..8.18. +["unique-1", '--p=3D1 -S32b -u', {IN=3D>"7\n"x11 . "1\n"}, {OUT=3D>"1\n7\n= "}], +# Demonstrate that 8.19's key-spec-adjusting code is required. +# These are more finicky in that they are arch-dependent. +["unique-key-i686", '-k2,2 --p=3D1 -S32b -u', + {IN=3D>"a 7\n"x10 . "b 1\n"}, {OUT=3D>"b 1\na 7\n"}], +["unique-key-x86_64", '-k2,2 --p=3D1 -S1k -u', + {IN=3D>"a 7\n"x20 . "b 1\n"}, {OUT=3D>"b 1\na 7\n"}], + # From Erick Branderhorst -- fixed around 1.19e ["16a", '-f', {IN=3D>"=E9minence\n=FCberhaupt\n's-Gravenhage\na=EBroclub\nAag\naagtappe= ls\n"}, -- 1.7.12.rc2 From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 06:09:24 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 10:09:24 +0000 Received: from localhost ([127.0.0.1]:34808 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2JUZ-00050j-MD for submit@debbugs.gnu.org; Fri, 17 Aug 2012 06:09:24 -0400 Received: from mx.meyering.net ([88.168.87.75]:59958) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2JUV-00050Z-Fv for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 06:09:21 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 2D441601ED; Fri, 17 Aug 2012 12:00:24 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: sort -u data loss deserves new release ASAP [Re: bug#9780: sort -u... In-Reply-To: <878vdea6t5.fsf@rho.meyering.net> (Jim Meyering's message of "Thu, 16 Aug 2012 23:03:02 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> Date: Fri, 17 Aug 2012 12:00:24 +0200 Message-ID: <87fw7l7s93.fsf_-_@rho.meyering.net> Lines: 53 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: Paul Eggert , 9780@debbugs.gnu.org, Benno Schulenberg , Bruce Dubbs , Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: > Jim Meyering wrote: > ... >> In case anyone is chomping at the bit, here's a preliminary patch: >> >> Here's a smaller test case that appears to be host/nproc-independent: >> It should print two lines: 1, then 7. >> Without this patch, it prints only "7". >> >> (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u ... > Here's a complete patch: > >>>From 431102766cbf7c360ee6fa1f157ebcd7d8b9ca0e Mon Sep 17 00:00:00 2001 > From: Jim Meyering > Date: Wed, 15 Aug 2012 12:30:44 +0200 > Subject: [PATCH] sort: sort --unique (-u) could cause data loss > > sort -u could omit one or more lines of expected output. > This bug arose because sort recorded the most recently printed line via > reference, and if you were unlucky, the storage for that line would be > reused (overwritten) as additional input was read into memory. If you > were doubly unlucky, the new value of the "saved" line would not only > match the very next line, but if that next line were also the first in > a series of identical, not-yet-printed lines, then the corrupted "saved" > line value would result in the omission of all matching lines. > > * src/sort.c (saved_line): New static/global, renamed and moved from... > (write_unique): ...here. Old name was "saved", which was too generic > for its new role as file-scoped global. > (fillbuf): With --unique, when we're about to read into a buffer that > overlaps the saved "preceding" line (saved_line), copy the line's .text > member to a realloc'd-as-needed temporary buffer and adjust the line's > key-defining members if they're set. > (overlap): New function. > * tests/misc/sort: New tests. > * NEWS (Bug fixes): Mention it. > * THANKS.in: Update. > Bug introduced via commit v8.5-89-g9face83. > Reported by Rasmus Borup Hansen in > http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/23173/focus=24647 That sort -u can cause data loss is a big deal. I want to make a release with this fix as soon as possible. Since I'm making this a mostly-bug-fix release, the du and md5 --tag changes will have to wait for 8.20. However, I'll be happy to apply documentation-correcting changes if someone would post a complete, updated patch or two. If Bruce and Paul find that changing gnulib's parse-datetime test will avoid a failure on LFS, I'll pull in a gnulib update for that. Any other bug-fix-like changes that people can suggest? From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 07:17:01 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 11:17:01 +0000 Received: from localhost ([127.0.0.1]:34895 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2KY0-0006cc-OL for submit@debbugs.gnu.org; Fri, 17 Aug 2012 07:17:00 -0400 Received: from moutng.kundenserver.de ([212.227.126.171]:56924) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2KXy-0006cT-QY for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 07:16:59 -0400 Received: from [192.168.2.108] (p4FF74796.dip.t-dialin.net [79.247.71.150]) by mrelayeu.kundenserver.de (node=mrbap2) with ESMTP (Nemesis) id 0Linfx-1TYdnf18IP-00dW9X; Fri, 17 Aug 2012 13:06:54 +0200 Message-ID: <502E25CD.3080709@bernhard-voelker.de> Date: Fri, 17 Aug 2012 13:06:53 +0200 From: Bernhard Voelker User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120713 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u data loss deserves new release ASAP [Re: bug#9780: sort -u... References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <87fw7l7s93.fsf_-_@rho.meyering.net> In-Reply-To: <87fw7l7s93.fsf_-_@rho.meyering.net> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:xfwiqx9x06uz6iGyryRd8CIUBGpNDsrwKHo4ndekqAY MAlxN1Sd/bpEBW7PNmhWtKR2luaTPOKsYWak4EbdCi/aorW4L3 HMGySM3JcSVPQeTQxD+MWrjOnB+U+jLdrNXrmbJCu0c3fT5ef5 nGrYI8YjnUu1SiqYZ/DAN01oGK2MSEayyCmOadQhJ2thGRQNdE VwYi2FEuLn5cXYN/KI0Ro+nFyy9lCSFt1G7NGusPUj8WNbAcAH igrNIRKn3s3zumPxrise9MuefgXjG73SNZ6X3PTDWFa3r4w7CQ XNujOYvpbgJTFM/jVALY1rmTbsDdF7EOEfi3ZOvLuXWj8R2NKO G7qXjOMfLTfJKDitG1E35Txw7WJbWVmjKd0hF9aDP X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: Benno Schulenberg , 9780@debbugs.gnu.org, Paul Eggert , Bruce Dubbs , Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/17/2012 12:00 PM, Jim Meyering wrote: > I want to make a release with this fix as soon as possible. > Since I'm making this a mostly-bug-fix release, the du and md5 --tag > changes will have to wait for 8.20. > However, I'll be happy to apply documentation-correcting changes > if someone would post a complete, updated patch or two. > > If Bruce and Paul find that changing gnulib's parse-datetime test > will avoid a failure on LFS, I'll pull in a gnulib update for that. > > Any other bug-fix-like changes that people can suggest? Hi Jim, the first part of Benno's patch is a trivial documentation fix: http://debbugs.gnu.org/12212 [PATCH 1/2] dd: the word BLOCKS no longer occurs in the help text It fixes the man-page of dd. I replied that the same is necessary in coreutils.texi, but there's no commitable patch yet. Now I see that you already CC'ed Benno ... Have a nice day, Berny From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 15:28:35 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 19:28:35 +0000 Received: from localhost ([127.0.0.1]:35836 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SDi-0002Pf-No for submit@debbugs.gnu.org; Fri, 17 Aug 2012 15:28:35 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:33258) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SDg-0002PY-CU for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 15:28:33 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 8E14039E800A; Fri, 17 Aug 2012 12:19:35 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xm1-bGR3VOC8; Fri, 17 Aug 2012 12:19:35 -0700 (PDT) Received: from [192.168.1.3] (pool-108-23-119-2.lsanca.fios.verizon.net [108.23.119.2]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 1FDEA39E8007; Fri, 17 Aug 2012 12:19:35 -0700 (PDT) Message-ID: <502E9947.1090902@cs.ucla.edu> Date: Fri, 17 Aug 2012 12:19:35 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> In-Reply-To: <878vdea6t5.fsf@rho.meyering.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/16/2012 02:03 PM, Jim Meyering wrote: > * src/sort.c (saved_line): New static/global, renamed and moved from... > (write_unique): ...here. I see a couple of problems with this patch. Pedantically, the behavior of 'overlap' is undefined on hosts that use a segmented architecture, because '<=' is not reliable on pointers into different buffers. (I have the vague recollection that some compilers even rely on this to generate faster code on flat architectures....) More importantly, suppose the buffer is reallocated (because it grows)? Won't 'overlap' do the wrong thing after that? And it'd be nice if we didn't have to worry about making a copy of that line. I'll see if I can come up with something that addresses these objectinos. From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 15:45:31 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 19:45:32 +0000 Received: from localhost ([127.0.0.1]:35853 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SU7-0002ne-8p for submit@debbugs.gnu.org; Fri, 17 Aug 2012 15:45:31 -0400 Received: from mx.meyering.net ([88.168.87.75]:33276) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SU5-0002nX-Qg for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 15:45:30 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id C4B8E60115; Fri, 17 Aug 2012 21:36:32 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502E9947.1090902@cs.ucla.edu> (Paul Eggert's message of "Fri, 17 Aug 2012 12:19:35 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> Date: Fri, 17 Aug 2012 21:36:32 +0200 Message-ID: <87y5ld1fb3.fsf@rho.meyering.net> Lines: 39 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Paul Eggert wrote: > On 08/16/2012 02:03 PM, Jim Meyering wrote: >> * src/sort.c (saved_line): New static/global, renamed and moved from... >> (write_unique): ...here. > > I see a couple of problems with this patch. Pedantically, > the behavior of 'overlap' is undefined on hosts that > use a segmented architecture, because '<=' is not reliable > on pointers into different buffers. (I have the vague recollection > that some compilers even rely on this to generate faster code > on flat architectures....) I pushed the change seconds before your message arrived. But that's probably best. If you can change it to do the job reliably even on fringe systems, that would be welcome. > More importantly, suppose the > buffer is reallocated (because it grows)? Won't 'overlap' > do the wrong thing after that? How? The first time the safe_text buffer is allocated it will have to be disjoint from the line.text buffer and from the buffer into which we're about to fread. Thereafter, regardless of reallocation, overlap should always be false. > And it'd be nice if we didn't > have to worry about making a copy of that line. It appears that the need to copy a line (overlap) is very rare, in practice. If you find a way to avoid it, it seems like it would have to be small and simple to be worthwhile. > I'll see if I can come up with something that addresses these > objectinos. Thanks! And thanks for the review. From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 15:49:37 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 19:49:37 +0000 Received: from localhost ([127.0.0.1]:35859 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SY4-0002tX-7C for submit@debbugs.gnu.org; Fri, 17 Aug 2012 15:49:36 -0400 Received: from mx.meyering.net ([88.168.87.75]:33286) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SY1-0002tQ-GU for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 15:49:34 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id A32DA600BB; Fri, 17 Aug 2012 21:40:36 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u data loss deserves new release ASAP [Re: bug#9780: sort -u... In-Reply-To: <87fw7l7s93.fsf_-_@rho.meyering.net> (Jim Meyering's message of "Fri, 17 Aug 2012 12:00:24 +0200") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <87fw7l7s93.fsf_-_@rho.meyering.net> Date: Fri, 17 Aug 2012 21:40:36 +0200 Message-ID: <87vcgh1f4b.fsf@rho.meyering.net> Lines: 18 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Benno Schulenberg , Bruce Dubbs , Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Jim Meyering wrote: ... > That sort -u can cause data loss is a big deal. > I want to make a release with this fix as soon as possible. > Since I'm making this a mostly-bug-fix release, the du and md5 --tag > changes will have to wait for 8.20. > However, I'll be happy to apply documentation-correcting changes > if someone would post a complete, updated patch or two. On second thought, these changes would require translation adjustments. Thus, I will defer these until after 8.19. > If Bruce and Paul find that changing gnulib's parse-datetime test > will avoid a failure on LFS, I'll pull in a gnulib update for that. Paul fixed it, so I'll update from gnulib shortly. > Any other bug-fix-like changes that people can suggest? From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 15:50:39 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 19:50:39 +0000 Received: from localhost ([127.0.0.1]:35863 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SZ5-0002vD-4C for submit@debbugs.gnu.org; Fri, 17 Aug 2012 15:50:39 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:34277) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SZ3-0002v4-K3 for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 15:50:38 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id E109239E800A; Fri, 17 Aug 2012 12:41:40 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bNn5yIZra+Cc; Fri, 17 Aug 2012 12:41:40 -0700 (PDT) Received: from [192.168.1.3] (pool-108-23-119-2.lsanca.fios.verizon.net [108.23.119.2]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 4973C39E8007; Fri, 17 Aug 2012 12:41:40 -0700 (PDT) Message-ID: <502E9E75.8080209@cs.ucla.edu> Date: Fri, 17 Aug 2012 12:41:41 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> In-Reply-To: <87y5ld1fb3.fsf@rho.meyering.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/17/2012 12:36 PM, Jim Meyering wrote: > The first time the safe_text buffer is allocated > it will have to be disjoint from the line.text buffer > and from the buffer into which we're about to fread. > Thereafter, regardless of reallocation, overlap should > always be false. I haven't thought it through entirely, but I was worried about the case where there is a saved line but no saved_text, the buffer is reallocated, and then we test for overlap. If the reallocated buffer does not overlap the original buffer, the test for overlap will fail even though the saved line needs to be copied into a new saved_text buffer. I'll stare at the code some more.... From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 16:02:07 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 20:02:07 +0000 Received: from localhost ([127.0.0.1]:35869 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SkA-0003B0-LB for submit@debbugs.gnu.org; Fri, 17 Aug 2012 16:02:07 -0400 Received: from mx.meyering.net ([88.168.87.75]:33329) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2Sk7-0003As-QG for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 16:02:04 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 08AB6605ED; Fri, 17 Aug 2012 21:53:06 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502E9E75.8080209@cs.ucla.edu> (Paul Eggert's message of "Fri, 17 Aug 2012 12:41:41 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502E9E75.8080209@cs.ucla.edu> Date: Fri, 17 Aug 2012 21:53:06 +0200 Message-ID: <87pq6p1ejh.fsf@rho.meyering.net> Lines: 59 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Paul Eggert wrote: > On 08/17/2012 12:36 PM, Jim Meyering wrote: >> The first time the safe_text buffer is allocated >> it will have to be disjoint from the line.text buffer >> and from the buffer into which we're about to fread. >> Thereafter, regardless of reallocation, overlap should >> always be false. > > I haven't thought it through entirely, but I was > worried about the case where there is a saved line > but no saved_text, the buffer is reallocated, and That is precisely what happens when this "(unique && ..." condition is true for the first time (presuming you mean s/saved_text/safe_text/) /* With --unique, when we're about to read into a buffer that overlaps the saved "preceding" line (saved_line), copy the line's .text member to a realloc'd-as-needed temporary buffer and adjust the line's key-defining members if they're set. */ if (unique && overlap (ptr, readsize, &saved_line)) { /* Copy saved_line.text into a buffer where it won't be clobbered and if KEY is non-NULL, adjust saved_line.key* to match. */ static char *safe_text; static size_t safe_text_n_alloc; if (safe_text_n_alloc < saved_line.length) { safe_text_n_alloc = saved_line.length; safe_text = x2nrealloc (safe_text, &safe_text_n_alloc, 1); } memcpy (safe_text, saved_line.text, saved_line.length); if (key) { #define s saved_line s.keybeg = safe_text + (s.keybeg - s.text); s.keylim = safe_text + (s.keylim - s.text); #undef s } saved_line.text = safe_text; } safe_text is initially NULL and we enter that block only when we're about to fread into a buffer that overlaps the current saved_line.text buffer. In that case, we allocate an initial safe_text buffer, copy saved_line.text into it, and update saved_line.text to point to the just-allocated/initialized buffer. Any test of overlap that compares that just-allocated (or realloc'd) buffer with the about-to-be-fread-into buffer will return false. > then we test for overlap. If the reallocated buffer > does not overlap the original buffer, the test for > overlap will fail even though the saved line needs > to be copied into a new saved_text buffer. > > I'll stare at the code some more.... From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 16:10:46 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 20:10:46 +0000 Received: from localhost ([127.0.0.1]:35890 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SsX-0003Mt-WB for submit@debbugs.gnu.org; Fri, 17 Aug 2012 16:10:46 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:35267) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2SsW-0003Mm-DF for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 16:10:44 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 8CCED39E800E; Fri, 17 Aug 2012 13:01:47 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id x4OuozG0rgig; Fri, 17 Aug 2012 13:01:47 -0700 (PDT) Received: from [192.168.1.3] (pool-108-23-119-2.lsanca.fios.verizon.net [108.23.119.2]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 2747239E8008; Fri, 17 Aug 2012 13:01:47 -0700 (PDT) Message-ID: <502EA32C.7010403@cs.ucla.edu> Date: Fri, 17 Aug 2012 13:01:48 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502E9E75.8080209@cs.ucla.edu> <87pq6p1ejh.fsf@rho.meyering.net> In-Reply-To: <87pq6p1ejh.fsf@rho.meyering.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/17/2012 12:53 PM, Jim Meyering wrote: > safe_text is initially NULL and we enter that block > only when we're about to fread into a buffer that overlaps > the current saved_line.text buffer. Sorry, I wasn't clear enough. I was worried about the case when saved_line.text does not overlap the buffer we're about to read into, because the buffer we're about to read into has been realloc'ed. The idea is that we saved a line, then realloc'ed the buffer, and now we're doing the overlap test. There won't be an overlap (assuming realloc gave us fresh space), but the saved line points into freed memory. From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 16:40:22 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 20:40:22 +0000 Received: from localhost ([127.0.0.1]:35908 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2TLC-00041d-2T for submit@debbugs.gnu.org; Fri, 17 Aug 2012 16:40:22 -0400 Received: from mx.meyering.net ([88.168.87.75]:33429) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2TL9-00041U-Rb for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 16:40:21 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 0E4C66018D; Fri, 17 Aug 2012 22:31:21 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502EA32C.7010403@cs.ucla.edu> (Paul Eggert's message of "Fri, 17 Aug 2012 13:01:48 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502E9E75.8080209@cs.ucla.edu> <87pq6p1ejh.fsf@rho.meyering.net> <502EA32C.7010403@cs.ucla.edu> Date: Fri, 17 Aug 2012 22:31:21 +0200 Message-ID: <87ipch1crq.fsf@rho.meyering.net> Lines: 74 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Paul Eggert wrote: > On 08/17/2012 12:53 PM, Jim Meyering wrote: >> safe_text is initially NULL and we enter that block >> only when we're about to fread into a buffer that overlaps >> the current saved_line.text buffer. > > Sorry, I wasn't clear enough. I was worried about the > case when saved_line.text does not overlap the buffer > we're about to read into, because the buffer we're about > to read into has been realloc'ed. The idea is that we > saved a line, then realloc'ed the buffer, and now we're > doing the overlap test. There won't be an overlap (assuming > realloc gave us fresh space), but the saved line points > into freed memory. Ohhh. Good catch. That is a related, but independent bug. It also afflicts the code from before today's change. Here's the part of fillbuf that can realloc "buf->buf", leaving saved_line.text pointing into freed memory: { /* The current input line is too long to fit in the buffer. Increase the buffer size and try again, keeping it properly aligned. */ size_t line_alloc = buf->alloc / sizeof (struct line); buf->buf = x2nrealloc (buf->buf, &line_alloc, sizeof (struct line)); buf->alloc = line_alloc * sizeof (struct line); } } } One way to work around that is to update saved_line.text, if needed, right after that x2nrealloc call. Understanding that scenario, it was easy to construct a case that triggers a free memory read: $ perl -le 'print "a\n"."0"x900'|valgrind ./sort --p=1 -S32b -u>/dev/null ==5263== Memcheck, a memory error detector ==5263== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. ==5263== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info ==5263== Command: ./sort --p=1 -S32b -u ==5263== ==5263== Invalid read of size 1 ==5263== at 0x4A0AD1C: bcmp (mc_replace_strmem.c:889) ==5263== by 0x408118: compare (sort.c:2736) ==5263== by 0x408425: write_unique (sort.c:3391) ==5263== by 0x40467A: main (sort.c:3959) ==5263== Address 0x4c34270 is 0 bytes inside a block of size 576 free'd ==5263== at 0x4A0892E: realloc (vg_replace_malloc.c:632) ==5263== by 0x410130: xrealloc (xmalloc.c:63) ==5263== by 0x406C04: fillbuf (sort.c:1857) ==5263== by 0x40462C: main (sort.c:3916) ==5263== ==5263== ==5263== HEAP SUMMARY: ==5263== in use at exit: 128 bytes in 1 blocks ==5263== total heap usage: 23 allocs, 22 frees, 530,779 bytes allocated ==5263== ==5263== LEAK SUMMARY: ==5263== definitely lost: 0 bytes in 0 blocks ==5263== indirectly lost: 0 bytes in 0 blocks ==5263== possibly lost: 0 bytes in 0 blocks ==5263== still reachable: 128 bytes in 1 blocks ==5263== suppressed: 0 bytes in 0 blocks ==5263== Rerun with --leak-check=full to see details of leaked memory ==5263== ==5263== For counts of detected and suppressed errors, rerun with: -v ==5263== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 2 from 2) So we definitely have a *second* bug here. Thanks! From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 16:43:56 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 20:43:56 +0000 Received: from localhost ([127.0.0.1]:35914 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2TOd-00046S-Mc for submit@debbugs.gnu.org; Fri, 17 Aug 2012 16:43:56 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:36606) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2TOa-00046J-HR for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 16:43:54 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 532D8A60006; Fri, 17 Aug 2012 13:34:55 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xeevWicG9bK7; Fri, 17 Aug 2012 13:34:54 -0700 (PDT) Received: from [192.168.1.3] (pool-108-23-119-2.lsanca.fios.verizon.net [108.23.119.2]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 7A16139E800A; Fri, 17 Aug 2012 13:34:54 -0700 (PDT) Message-ID: <502EAAEF.4010109@cs.ucla.edu> Date: Fri, 17 Aug 2012 13:34:55 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> In-Reply-To: <87y5ld1fb3.fsf@rho.meyering.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) OK, I scratched my head for a bit and came up with the following further patch, which addresses the issues that I mentioned. >From ac405d343c379096c7ed51b481d5ed08ee18d6e0 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 17 Aug 2012 13:26:00 -0700 Subject: [PATCH] sort: simpler fix for sort -u data-loss bug * src/sort.c (overlap): Remove. (fillbuf): Do not try to copy saved lines, as that is too risky in the presence of parallelism, reallocated buffers, etc. (sort): Invalidate any saved line before sorting a new batch. --- src/sort.c | 36 +----------------------------------- 1 files changed, 1 insertions(+), 35 deletions(-) diff --git a/src/sort.c b/src/sort.c index c2d2d49..9dbfee1 100644 --- a/src/sort.c +++ b/src/sort.c @@ -1705,14 +1705,6 @@ limfield (struct line const *line, struct keyfield const *key) return ptr; } -/* Return true if LINE and the buffer BUF of length LEN overlap. */ -static inline bool -overlap (char const *buf, size_t len, struct line const *line) -{ - char const *line_end = line->text + line->length; - return !(line_end <= buf || buf + len <= line->text); -} - /* Fill BUF reading from FP, moving buf->left bytes from the end of buf->buf to the beginning first. If EOF is reached and the file wasn't terminated by a newline, supply one. Set up BUF's line @@ -1753,33 +1745,6 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file) rest of the input file consists entirely of newlines, except that the last byte is not a newline. */ size_t readsize = (avail - 1) / (line_bytes + 1); - - /* With --unique, when we're about to read into a buffer that - overlaps the saved "preceding" line (saved_line), copy the line's - .text member to a realloc'd-as-needed temporary buffer and adjust - the line's key-defining members if they're set. */ - if (unique && overlap (ptr, readsize, &saved_line)) - { - /* Copy saved_line.text into a buffer where it won't be clobbered - and if KEY is non-NULL, adjust saved_line.key* to match. */ - static char *safe_text; - static size_t safe_text_n_alloc; - if (safe_text_n_alloc < saved_line.length) - { - safe_text_n_alloc = saved_line.length; - safe_text = x2nrealloc (safe_text, &safe_text_n_alloc, 1); - } - memcpy (safe_text, saved_line.text, saved_line.length); - if (key) - { - #define s saved_line - s.keybeg = safe_text + (s.keybeg - s.text); - s.keylim = safe_text + (s.keylim - s.text); - #undef s - } - saved_line.text = safe_text; - } - size_t bytes_read = fread (ptr, 1, readsize, fp); char *ptrlim = ptr + bytes_read; char *p; @@ -3928,6 +3893,7 @@ sort (char *const *files, size_t nfiles, char const *output_file, break; } + saved_line.text = NULL; line = buffer_linelim (&buf); if (buf.eof && !nfiles && !ntemps && !buf.left) { -- 1.7.6.5 From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 16:47:35 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 20:47:35 +0000 Received: from localhost ([127.0.0.1]:35919 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2TSA-0004Dd-AG for submit@debbugs.gnu.org; Fri, 17 Aug 2012 16:47:34 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:36797) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2TS6-0004DT-Nk for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 16:47:31 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id D7D0BA60006; Fri, 17 Aug 2012 13:38:33 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id OLydPL1XZ8vB; Fri, 17 Aug 2012 13:38:33 -0700 (PDT) Received: from [192.168.1.3] (pool-108-23-119-2.lsanca.fios.verizon.net [108.23.119.2]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 812D9A60003; Fri, 17 Aug 2012 13:38:33 -0700 (PDT) Message-ID: <502EABCA.4070207@cs.ucla.edu> Date: Fri, 17 Aug 2012 13:38:34 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502E9E75.8080209@cs.ucla.edu> <87pq6p1ejh.fsf@rho.meyering.net> <502EA32C.7010403@cs.ucla.edu> <87ipch1crq.fsf@rho.meyering.net> In-Reply-To: <87ipch1crq.fsf@rho.meyering.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/17/2012 01:31 PM, Jim Meyering wrote: > So we definitely have a *second* bug here. Yes, I noticed. It definitely counts as a double-ouch. I'm glad the bug report prompted us to read this code more carefully. My latest patch should fix both bugs. From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 17 17:18:16 2012 Received: (at 9780) by debbugs.gnu.org; 17 Aug 2012 21:18:16 +0000 Received: from localhost ([127.0.0.1]:35939 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2Tvr-0004t5-RH for submit@debbugs.gnu.org; Fri, 17 Aug 2012 17:18:16 -0400 Received: from mx.meyering.net ([88.168.87.75]:33536) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2Tvo-0004sx-HV for 9780@debbugs.gnu.org; Fri, 17 Aug 2012 17:18:13 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id 07AA060167; Fri, 17 Aug 2012 23:09:15 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502EAAEF.4010109@cs.ucla.edu> (Paul Eggert's message of "Fri, 17 Aug 2012 13:34:55 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502EAAEF.4010109@cs.ucla.edu> Date: Fri, 17 Aug 2012 23:09:15 +0200 Message-ID: <87d32p1b0k.fsf@rho.meyering.net> Lines: 33 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Paul Eggert wrote: > OK, I scratched my head for a bit and came up with the following > further patch, which addresses the issues that I mentioned. ... > Subject: [PATCH] sort: simpler fix for sort -u data-loss bug > > * src/sort.c (overlap): Remove. > (fillbuf): Do not try to copy saved lines, as that is too risky > in the presence of parallelism, reallocated buffers, etc. > (sort): Invalidate any saved line before sorting a new batch. > --- > src/sort.c | 36 +----------------------------------- Very nice! That fixes not just the original bug, but also the FMR, and eliminates my entire patch. The only cost is in writing at most one more line per buffer. I hate to look such a nice gift horse in the mouth, but it's getting late here... Would you mind adjusting that to add NEWS and mention that you've fixed the second, free-memory-read bug, too? And even add the test? If you don't find time, I'll get to that over the weekend. =============== Regarding your patch... For the record, at first I thought an input that used one (long) line per buffer would make --unique a no-op, but then I realized that in that case, each buffers-worth (one line each) would be written to its own temporary file, and the merge phase would handle the --unique semantics. Thanks again! From debbugs-submit-bounces@debbugs.gnu.org Sat Aug 18 01:40:59 2012 Received: (at 9780) by debbugs.gnu.org; 18 Aug 2012 05:40:59 +0000 Received: from localhost ([127.0.0.1]:36372 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2bmN-0001xU-0k for submit@debbugs.gnu.org; Sat, 18 Aug 2012 01:40:59 -0400 Received: from mx.meyering.net ([88.168.87.75]:34682) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2bmJ-0001xL-Da for 9780@debbugs.gnu.org; Sat, 18 Aug 2012 01:40:57 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id E160760103; Sat, 18 Aug 2012 07:40:53 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502EAAEF.4010109@cs.ucla.edu> (Paul Eggert's message of "Fri, 17 Aug 2012 13:34:55 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502EAAEF.4010109@cs.ucla.edu> Date: Sat, 18 Aug 2012 07:40:53 +0200 Message-ID: <87mx1s69lm.fsf@rho.meyering.net> Lines: 190 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) Paul Eggert wrote: > OK, I scratched my head for a bit and came up with the following > further patch, which addresses the issues that I mentioned. > > Subject: [PATCH] sort: simpler fix for sort -u data-loss bug > > * src/sort.c (overlap): Remove. > (fillbuf): Do not try to copy saved lines, as that is too risky > in the presence of parallelism, reallocated buffers, etc. > (sort): Invalidate any saved line before sorting a new batch. Hi Paul, I've adjusted your commit log to look like this. Is that ok with you? commit eb6427938ffe009ca7d8bcb4fc768bb9bc6bd135 Author: Paul Eggert Date: Fri Aug 17 13:26:00 2012 -0700 sort: simpler fix for sort -u data-loss bug, and for a FMR bug This also fixes a free-memory-read (FMR) bug: when fillbuf's realloc of buf->buf frees the buffer into which saved_line.text points, the processing of that just-read longer line includes comparison against the saved line in freed memory. * src/sort.c (overlap): Remove. (fillbuf): Do not try to copy saved lines, as that is too risky in the presence of parallelism, reallocated buffers, etc. (sort): Invalidate any saved line before sorting a new batch. I've also written these two commits: tests: wrap the valgrind-requiring assertion in a function tests: trigger the sort -u free-memory-read bug ----- NEWS | 5 +++++ tests/Makefile.am | 1 + tests/init.cfg | 6 ++++++ tests/misc/sort | 4 ++++ tests/misc/sort-stale-thread-mem | 2 +- tests/misc/sort-u-FMR | 29 +++++++++++++++++++++++++++++ 6 files changed, 46 insertions(+), 1 deletion(-) >From d46873d2eb35f4fa6e735c1e094613fb0ae0dadb Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Sat, 18 Aug 2012 07:25:28 +0200 Subject: [PATCH 1/2] tests: wrap the valgrind-requiring assertion in a function * tests/init.cfg (require_valgrind_): New function... * tests/misc/sort-stale-thread-mem: ...extracted from here. --- tests/init.cfg | 6 ++++++ tests/misc/sort-stale-thread-mem | 2 +- 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/tests/init.cfg b/tests/init.cfg index 4ff5ad4..f223f13 100644 --- a/tests/init.cfg +++ b/tests/init.cfg @@ -160,6 +160,12 @@ require_strace_() fi } +# Skip the current test if valgrind doesn't work. +require_valgrind_() +{ + valgrind --help >/dev/null || skip_ "requires valgrind" +} + require_setfacl_() { setfacl -m user::rwx . \ diff --git a/tests/misc/sort-stale-thread-mem b/tests/misc/sort-stale-thread-mem index c19f62e..05cc9ba 100755 --- a/tests/misc/sort-stale-thread-mem +++ b/tests/misc/sort-stale-thread-mem @@ -22,8 +22,8 @@ print_ver_ sort very_expensive_ +require_valgrind_ -valgrind --help >/dev/null || skip_ "requires valgrind" grep '^#define HAVE_PTHREAD_T 1' "$CONFIG_HEADER" > /dev/null || skip_ 'requires pthreads' -- 1.7.12.rc2 >From d33e68da52bd0457acdc861ab2effba4f45a71fc Mon Sep 17 00:00:00 2001 From: Jim Meyering Date: Sat, 18 Aug 2012 07:26:30 +0200 Subject: [PATCH 2/2] tests: trigger the sort -u free-memory-read bug * tests/misc/sort-u-FMR: New file. * tests/Makefile.am (TESTS): Add it. * tests/misc/sort: Add the test here, too. * NEWS (Bug fixes): Mention it. --- NEWS | 5 +++++ tests/Makefile.am | 1 + tests/misc/sort | 4 ++++ tests/misc/sort-u-FMR | 29 +++++++++++++++++++++++++++++ 4 files changed, 39 insertions(+) create mode 100755 tests/misc/sort-u-FMR diff --git a/NEWS b/NEWS index f39a76a..1737235 100644 --- a/NEWS +++ b/NEWS @@ -14,6 +14,11 @@ GNU coreutils NEWS -*- outline -*- (yes 7 | head -11; echo 1) | sort --p=1 -S32b -u [bug introduced in coreutils-8.6] + sort -u could read freed memory. + For example, this evokes a read from freed memory: + perl -le 'print "a\n"."0"x900'|valgrind sort --p=1 -S32b -u>/dev/null + [bug introduced in coreutils-8.6] + ** New features rm now accepts the --dir (-d) option which makes it remove empty directories. diff --git a/tests/Makefile.am b/tests/Makefile.am index 09d2658..69078bd 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -260,6 +260,7 @@ TESTS = \ misc/sort-unique-segv \ misc/sort-version \ misc/sort-NaN-infloop \ + misc/sort-u-FMR \ split/filter \ split/suffix-auto-length \ split/suffix-length \ diff --git a/tests/misc/sort b/tests/misc/sort index 4e51161..894a59a 100755 --- a/tests/misc/sort +++ b/tests/misc/sort @@ -237,6 +237,10 @@ my @Tests = {IN=>"a 7\n"x10 . "b 1\n"}, {OUT=>"b 1\na 7\n"}], ["unique-key-x86_64", '-u -k2,2 --p=1 -S32b', {IN=>"a 7\n"x11 . "b 1\n"}, {OUT=>"b 1\na 7\n"}], +# Before 8.19, this would trigger a free-memory read. +["unique-free-mem-read", '-u --p=1 -S32b', + {IN=>"a\n"."b\n"x900}, + {OUT=>"a\n"."b\n"x900}], # From Erick Branderhorst -- fixed around 1.19e ["16a", '-f', diff --git a/tests/misc/sort-u-FMR b/tests/misc/sort-u-FMR new file mode 100755 index 0000000..303b429 --- /dev/null +++ b/tests/misc/sort-u-FMR @@ -0,0 +1,29 @@ +#!/bin/sh +# Before 8.19, this would trigger a free-memory read. + +# Copyright (C) 2012 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +. "${srcdir=.}/init.sh"; path_prepend_ ../src +print_ver_ sort +require_valgrind_ + +{ echo 0; printf '%0900d\n' 1; } > in || framework_failure_ + +valgrind --error-exitcode=1 sort --p=1 -S32b -u in > out || fail=1 + +compare in out || fail=1 + +Exit $fail -- 1.7.12.rc2 From debbugs-submit-bounces@debbugs.gnu.org Sat Aug 18 01:47:33 2012 Received: (at 9780) by debbugs.gnu.org; 18 Aug 2012 05:47:33 +0000 Received: from localhost ([127.0.0.1]:36386 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2bsj-0002xH-3E for submit@debbugs.gnu.org; Sat, 18 Aug 2012 01:47:33 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:53312) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T2bsh-0002xA-4v for 9780@debbugs.gnu.org; Sat, 18 Aug 2012 01:47:32 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 3C07939E800A; Fri, 17 Aug 2012 22:47:30 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 1tnU-crucFps; Fri, 17 Aug 2012 22:47:29 -0700 (PDT) Received: from [192.168.1.3] (pool-108-23-119-2.lsanca.fios.verizon.net [108.23.119.2]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id BCB0B39E8007; Fri, 17 Aug 2012 22:47:29 -0700 (PDT) Message-ID: <502F2C75.3030905@cs.ucla.edu> Date: Fri, 17 Aug 2012 22:47:33 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux i686; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: Jim Meyering Subject: Re: bug#9780: sort -u throws out non-duplicates References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502EAAEF.4010109@cs.ucla.edu> <87mx1s69lm.fsf@rho.meyering.net> In-Reply-To: <87mx1s69lm.fsf@rho.meyering.net> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 9780 Cc: 9780@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On 08/17/2012 10:40 PM, Jim Meyering wrote: > I've adjusted your commit log to look like this. > Is that ok with you? Sure, that all looks good. Thanks for doing that. From debbugs-submit-bounces@debbugs.gnu.org Mon Aug 20 15:20:51 2012 Received: (at 9780-done) by debbugs.gnu.org; 20 Aug 2012 19:20:51 +0000 Received: from localhost ([127.0.0.1]:40399 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T3XWn-0001WU-PT for submit@debbugs.gnu.org; Mon, 20 Aug 2012 15:20:51 -0400 Received: from mx.meyering.net ([88.168.87.75]:43275) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1T3XWh-0001WJ-6A for 9780-done@debbugs.gnu.org; Mon, 20 Aug 2012 15:20:44 -0400 Received: from rho.meyering.net (rho.meyering.net [127.0.0.1]) by rho.meyering.net (Acme Bit-Twister) with ESMTP id B053560115; Mon, 20 Aug 2012 21:20:23 +0200 (CEST) From: Jim Meyering To: Paul Eggert Subject: Re: bug#9780: sort -u throws out non-duplicates In-Reply-To: <502EABCA.4070207@cs.ucla.edu> (Paul Eggert's message of "Fri, 17 Aug 2012 13:38:34 -0700") References: <8ff7de89ff9e93d19ef76b533f4997cc@bero.eu> <1DC2888B-7F28-4AB5-B997-7BBF47170D12@intomics.com> <502A98C5.40302@cs.ucla.edu> <87obmdp4eg.fsf@rho.meyering.net> <87r4r8m4i2.fsf@rho.meyering.net> <878vdea6t5.fsf@rho.meyering.net> <502E9947.1090902@cs.ucla.edu> <87y5ld1fb3.fsf@rho.meyering.net> <502E9E75.8080209@cs.ucla.edu> <87pq6p1ejh.fsf@rho.meyering.net> <502EA32C.7010403@cs.ucla.edu> <87ipch1crq.fsf@rho.meyering.net> <502EABCA.4070207@cs.ucla.edu> Date: Mon, 20 Aug 2012 21:20:23 +0200 Message-ID: <87ipcdl6a0.fsf@rho.meyering.net> Lines: 12 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -2.1 (--) X-Debbugs-Envelope-To: 9780-done Cc: 9780-done@debbugs.gnu.org, Rasmus Borup Hansen X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.1 (--) Paul Eggert wrote: > On 08/17/2012 01:31 PM, Jim Meyering wrote: >> So we definitely have a *second* bug here. > > Yes, I noticed. It definitely counts as a double-ouch. > I'm glad the bug report prompted us to read this code > more carefully. > > My latest patch should fix both bugs. Thanks again. Closing this bug, finally. From unknown Fri Jun 20 19:45:56 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Tue, 18 Sep 2012 11:24:03 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator