From unknown Mon Aug 18 17:54:23 2025 X-Loop: help-debbugs@gnu.org Subject: bug#6669: [PATCH] sort: -R no longer disables multithreading Resent-From: Paul Eggert Original-Sender: debbugs-submit-bounces@debbugs.gnu.org Resent-To: owner@debbugs.gnu.org Resent-CC: bug-coreutils@gnu.org Resent-Date: Mon, 19 Jul 2010 18:05:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 6669 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: patch To: 6669@debbugs.gnu.org X-Debbugs-Original-To: Bug Coreutils Received: via spool by submit@debbugs.gnu.org id=B.12795626577215 (code B ref -1); Mon, 19 Jul 2010 18:05:02 +0000 Received: (at submit) by debbugs.gnu.org; 19 Jul 2010 18:04:17 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OauhM-0001sK-K2 for submit@debbugs.gnu.org; Mon, 19 Jul 2010 14:04:17 -0400 Received: from mx10.gnu.org ([199.232.76.166]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1OauhK-0001sE-Ft for submit@debbugs.gnu.org; Mon, 19 Jul 2010 14:04:15 -0400 Received: from lists.gnu.org ([199.232.76.165]:59686) by monty-python.gnu.org with esmtps (TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60) (envelope-from ) id 1Oauhg-0001GC-4h for submit@debbugs.gnu.org; Mon, 19 Jul 2010 14:04:36 -0400 Received: from [140.186.70.92] (port=35193 helo=eggs.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1Oauhe-0003Xk-6g for bug-coreutils@gnu.org; Mon, 19 Jul 2010 14:04:35 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,T_RP_MATCHES_RCVD autolearn=unavailable version=3.3.1 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69) (envelope-from ) id 1Oauhc-0003WV-EX for bug-coreutils@gnu.org; Mon, 19 Jul 2010 14:04:33 -0400 Received: from kiwi.cs.ucla.edu ([131.179.128.19]:37560) by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1Oauhc-0003W7-4N for bug-coreutils@gnu.org; Mon, 19 Jul 2010 14:04:32 -0400 Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by kiwi.cs.ucla.edu (8.13.8+Sun/8.13.8/UCLACS-6.0) with ESMTP id o6JI4TwX023801 for ; Mon, 19 Jul 2010 11:04:29 -0700 (PDT) Message-ID: <4C4493AD.10506@cs.ucla.edu> Date: Mon, 19 Jul 2010 11:04:29 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.10) Gecko/20100527 Thunderbird/3.0.5 MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: Solaris 10 (beta) X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6, seldom 2.4 (older, 4) X-Spam-Score: -3.8 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -5.1 (-----) While looking into the random-number issues I mentioned last week I discovered that sort's -R flag disables the new multithreading stuff. (Actually, I knew that, but I'd forgotten.) This is pretty easy to fix, albeit at the theoretical price of not being random when there are MD5 collisions. I don't think MD5 collisions are a problem in practice with sort -R, but if they do turn out to be a problem we should simply use a better checksum algorithm. So I took the liberty of installing the following, which simplifies the code. (I don't see a reasonable way to test this sort of patch.) >From ce379d4449f30ff33daa272978869eb517a62467 Mon Sep 17 00:00:00 2001 From: Paul R. Eggert Date: Mon, 19 Jul 2010 10:46:41 -0700 Subject: [PATCH] sort: -R no longer disables multithreading * src/sort.c (random_md5_state): New static var. (random_md5_state_init): New function, to initialize random_md5_state. (random_state, randread_source): Remove. (cmp_hashes): Use random_md5_state rather than random_state. Break ties using memcmp, not by getting more randomness. If MD5 collisions turn into a problem in practice, we should simply use a better checksum. (main): If -R is given, call random_md5_state_init rather than going single-threaded. --- src/sort.c | 80 ++++++++++++++++++++--------------------------------------- 1 files changed, 27 insertions(+), 53 deletions(-) diff --git a/src/sort.c b/src/sort.c index 960df74..afa45a8 100644 --- a/src/sort.c +++ b/src/sort.c @@ -2028,42 +2028,23 @@ getmonth (char const *month, size_t len, char const **ea) return 0; } -/* A source of random data. */ -static struct randread_source *randread_source; +/* A randomly chosen MD5 state, used for random comparison. */ +static struct md5_ctx random_md5_state; -/* Return the Ith randomly-generated state. The caller must invoke - random_state (H) for all H less than I before invoking random_state - (I). */ +/* Initialize the randomly chosen MD5 state. */ -static struct md5_ctx -random_state (size_t i) +static void +random_md5_state_init (char const *random_source) { - /* An array of states resulting from the random data, and counts of - its used and allocated members. */ - static struct md5_ctx *state; - static size_t used; - static size_t allocated; - - struct md5_ctx *s = &state[i]; - - if (used <= i) - { - unsigned char buf[MD5_DIGEST_SIZE]; - - used++; - - if (allocated <= i) - { - state = X2NREALLOC (state, &allocated); - s = &state[i]; - } - - randread (randread_source, buf, sizeof buf); - md5_init_ctx (s); - md5_process_bytes (buf, sizeof buf, s); - } - - return *s; + unsigned char buf[MD5_DIGEST_SIZE]; + struct randread_source *r = randread_new (random_source, sizeof buf); + if (! r) + die (_("open failed"), random_source); + randread (r, buf, sizeof buf); + if (randread_free (r) != 0) + die (_("close failed"), random_source); + md5_init_ctx (&random_md5_state); + md5_process_bytes (buf, sizeof buf, &random_md5_state); } /* Compare the hashes of TEXTA with length LENGTHA to those of TEXTB @@ -2078,19 +2059,19 @@ cmp_hashes (char const *texta, size_t lena, first pair of random hashes agree, check whether the keys are identical and if so report no difference. */ int diff; - size_t i; - for (i = 0; ; i++) + uint32_t dig[2][MD5_DIGEST_SIZE / sizeof (uint32_t)]; + struct md5_ctx s[2]; + s[0] = s[1] = random_md5_state; + md5_process_bytes (texta, lena, &s[0]); md5_finish_ctx (&s[0], dig[0]); + md5_process_bytes (textb, lenb, &s[1]); md5_finish_ctx (&s[1], dig[1]); + diff = memcmp (dig[0], dig[1], sizeof dig[0]); + if (! diff) { - uint32_t dig[2][MD5_DIGEST_SIZE / sizeof (uint32_t)]; - struct md5_ctx s[2]; - s[0] = s[1] = random_state (i); - md5_process_bytes (texta, lena, &s[0]); md5_finish_ctx (&s[0], dig[0]); - md5_process_bytes (textb, lenb, &s[1]); md5_finish_ctx (&s[1], dig[1]); - diff = memcmp (dig[0], dig[1], sizeof dig[0]); - if (diff != 0) - break; - if (i == 0 && lena == lenb && memcmp (texta, textb, lena) == 0) - break; + /* Break ties with memcmp. This is good enough given the + astronomically low probability of an MD5 collision. */ + diff = memcmp (texta, textb, MIN (lena, lenb)); + if (! diff) + diff = lena < lenb ? -1 : lena != lenb; } return diff; @@ -4485,14 +4466,7 @@ main (int argc, char **argv) reverse = gkey.reverse; if (need_random) - { - /* Threading does not lock the randread_source structure, so - downgrade to one thread to avoid race conditions. */ - nthreads = 1; - randread_source = randread_new (random_source, MD5_DIGEST_SIZE); - if (! randread_source) - die (_("open failed"), random_source); - } + random_md5_state_init (random_source); if (temp_dir_count == 0) { -- 1.7.1 From debbugs-submit-bounces@debbugs.gnu.org Tue Apr 19 03:15:45 2011 Received: (at control) by debbugs.gnu.org; 19 Apr 2011 07:15:46 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QC5A1-0000WX-2z for submit@debbugs.gnu.org; Tue, 19 Apr 2011 03:15:45 -0400 Received: from mx.meyering.net ([82.230.74.64]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QC59Z-000079-TJ for control@debbugs.gnu.org; Tue, 19 Apr 2011 03:15:43 -0400 Received: by rho.meyering.net (Acme Bit-Twister, from userid 1000) id 6D2326010E; Tue, 19 Apr 2011 09:15:12 +0200 (CEST) From: Jim Meyering To: control@debbugs.gnu.org Subject: sort: -R no longer disables multithreading Date: Tue, 19 Apr 2011 09:15:12 +0200 Message-ID: <87ipua1zlr.fsf@rho.meyering.net> Lines: 4 MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -5.9 (-----) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -5.9 (-----) close 6669 thanks Pushed nearly a year ago.