From debbugs-submit-bounces@debbugs.gnu.org Fri Apr 10 07:11:49 2020 Received: (at submit) by debbugs.gnu.org; 10 Apr 2020 11:11:49 +0000 Received: from localhost ([127.0.0.1]:54784 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jMrZd-0007WM-1F for submit@debbugs.gnu.org; Fri, 10 Apr 2020 07:11:49 -0400 Received: from lists.gnu.org ([209.51.188.17]:36939) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jMrZb-0007WF-Rk for submit@debbugs.gnu.org; Fri, 10 Apr 2020 07:11:48 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58614) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jMrZa-0002j0-9S for bug-coreutils@gnu.org; Fri, 10 Apr 2020 07:11:47 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1jMrZY-0005PW-Hm for bug-coreutils@gnu.org; Fri, 10 Apr 2020 07:11:46 -0400 Received: from mail-oi1-f182.google.com ([209.85.167.182]:44757) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1jMrZY-0005Oq-D0 for bug-coreutils@gnu.org; Fri, 10 Apr 2020 07:11:44 -0400 Received: by mail-oi1-f182.google.com with SMTP id o25so1109537oic.11 for ; Fri, 10 Apr 2020 04:11:43 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=K3EcJEq87xYBX+pstjvONadqGk5Pw1IlKxf+ohYX1D4=; b=YwcUyFE1wHlzm2kN5O5o2LS5DV+n6akMpw4JLHpEAq3m//HpmycFH2xtp4ws/Jch7X IKnWh0ocIAjcWWPiN9Dd1q2tBUliRlBqjci7DwoQ8qk/s9nuE1Ls4KYmyRF5kz9KJQZ6 FVc/ARBEURx1U6YtZGOZ5nefsFo6rN291nZvFPZVpcsxkcKcd4sKzPDRXZrF9K3UExSJ JIqItRAb+5P8BZNJGJEB1qnuNMraQg8IZ+0oMufYT4KhvCPB7mq0Vgam+n4V8yyQO8n+ wriKZsPAV2IB1fGxf2GDSrQXo79xuAezeTe8zOQoV//T0GvE68DWv3ibtc/65Z7JJMym JLtg== X-Gm-Message-State: AGi0PuZmG16fGPhQWYYA51jdNgSZebazrsRGO+8hvE6l0Yi2KDAQBrab D8/XdNSwHBJMSKIpCKfDpGDKf1bfTi05L7ZOQavhCOWO X-Google-Smtp-Source: APiQypKP8i6E/9W8CrWlahrJvobBMkNB21hIJXE8zIsD+ePnF8ZjMKcXxwaSfqh7+er9YTfmdCDgtnVCvCHDLnL9jaQ= X-Received: by 2002:a05:6808:11:: with SMTP id u17mr402576oic.87.1586517102979; Fri, 10 Apr 2020 04:11:42 -0700 (PDT) MIME-Version: 1.0 From: Ole Tange Date: Fri, 10 Apr 2020 13:11:31 +0200 Message-ID: Subject: GNU sort could be faster on machines with more cores To: bug-coreutils@gnu.org Content-Type: text/plain; charset="UTF-8" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.85.167.182 X-Spam-Score: 2.8 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: $ sort --version sort (GNU coreutils) 8.28 I am the author of GNU Parallel and my fetish is to try to run more stuff in parallel. I recently sorted a 2.4 GBytes/100 M lines file: export LC_ALL=C time sort --parallel=48 bigfile >/dev/null Content analysis details: (2.8 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.7 RCVD_IN_DNSWL_LOW RBL: Sender listed at https://www.dnswl.org/, low trust [209.51.188.17 listed in list.dnswl.org] 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (ole.tange[at]gmail.com) 0.2 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different 2.0 SPOOFED_FREEMAIL No description available. X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.2 (/) $ sort --version sort (GNU coreutils) 8.28 I am the author of GNU Parallel and my fetish is to try to run more stuff in parallel. I recently sorted a 2.4 GBytes/100 M lines file: export LC_ALL=C time sort --parallel=48 bigfile >/dev/null This takes 87 seconds on my 48 core machine. Everything is done in a RAM disk. By adjusting 48 I find the optimal is --parallel=46: 80 seconds. But it is possible to do better than that. A big part of the final part of the sorting is a single tread. It is likely merging the results of the other threads, but it leaves the other 47 cores idle. So I thought: Can we parallelize that more? The reasoning is: If you have N-1 cores sitting idle, you really do not care if you waste some CPU cycles, if you can get the result faster. So I tested. Let us start by looking at an easier case: We do not have 1 big file, but the bigfile is split into 48 files of similar sizes. In this case we can do: sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file1) <(sort file2) ) <(sort -m <(sort file3) <(sort file4) ) ) <(sort -m <(sort -m <(sort file5) <(sort file6) ) <(sort -m <(sort file7) <(sort file8) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) ) <(sort -m <(sort file11) <(sort file12) ) ) <(sort -m <(sort -m <(sort file13) <(sort file14) ) <(sort -m <(sort file15) <(sort file16) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) ) <(sort -m <(sort file19) <(sort file20) ) ) <(sort -m <(sort -m <(sort file21) <(sort file22) ) <(sort -m <(sort file23) <(sort file24) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file25) <(sort file26) ) <(sort -m <(sort file27) <(sort file28) ) ) <(sort -m <(sort -m <(sort file29) <(sort file30) ) <(sort -m <(sort file31) <(sort file32) ) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33) <(sort file34) ) <(sort -m <(sort file35) <(sort file36) ) ) <(sort -m <(sort -m <(sort file37) <(sort file38) ) <(sort -m <(sort file39) <(sort file40) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) ) <(sort -m <(sort file43) <(sort file44) ) ) <(sort -m <(sort -m <(sort file45) <(sort file46) ) <(sort -m <(sort file47) <(sort file48) ) ) ) ) ) Or indented: sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file1) <(sort file2) )\ <(sort -m <(sort file3) <(sort file4) ) )\ <(sort -m \ <(sort -m <(sort file5) <(sort file6) )\ <(sort -m <(sort file7) <(sort file8) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file9) <(sort file10) )\ <(sort -m <(sort file11) <(sort file12) ) )\ <(sort -m \ <(sort -m <(sort file13) <(sort file14) )\ <(sort -m <(sort file15) <(sort file16) ) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file17) <(sort file18) )\ <(sort -m <(sort file19) <(sort file20) ) )\ <(sort -m <(sort -m <(sort file21) <(sort file22) )\ <(sort -m <(sort file23) <(sort file24) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file25) <(sort file26) )\ <(sort -m <(sort file27) <(sort file28) ) )\ <(sort -m \ <(sort -m <(sort file29) <(sort file30) )\ <(sort -m <(sort file31) <(sort file32) ) ) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file33) <(sort file34) )\ <(sort -m <(sort file35) <(sort file36) ) )\ <(sort -m <(sort -m <(sort file37) <(sort file38) )\ <(sort -m <(sort file39) <(sort file40) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file41) <(sort file42) )\ <(sort -m <(sort file43) <(sort file44) ) )\ <(sort -m \ <(sort -m <(sort file45) <(sort file46) )\ <(sort -m <(sort file47) <(sort file48) ) ) ) ) ) So WTF is going on here? Here we do an initial sort of each of the 48 files. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible. This takes 60 seconds. But a lot of time is waiting for pipes. This can be lowered by added a buffer after each merge (mbuffer -m 30M): sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file1) <(sort file2) | mbuffer -m 30M ) <(sort -m <(sort file3) <(sort file4) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file5) <(sort file6) | mbuffer -m 30M ) <(sort -m <(sort file7) <(sort file8) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) | mbuffer -m 30M ) <(sort -m <(sort file11) <(sort file12) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file13) <(sort file14) | mbuffer -m 30M ) <(sort -m <(sort file15) <(sort file16) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) | mbuffer -m 30M ) <(sort -m <(sort file19) <(sort file20) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file21) <(sort file22) | mbuffer -m 30M ) <(sort -m <(sort file23) <(sort file24) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file25) <(sort file26) | mbuffer -m 30M ) <(sort -m <(sort file27) <(sort file28) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file29) <(sort file30) | mbuffer -m 30M ) <(sort -m <(sort file31) <(sort file32) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33) <(sort file34) | mbuffer -m 30M ) <(sort -m <(sort file35) <(sort file36) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file37) <(sort file38) | mbuffer -m 30M ) <(sort -m <(sort file39) <(sort file40) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) | mbuffer -m 30M ) <(sort -m <(sort file43) <(sort file44) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file45) <(sort file46) | mbuffer -m 30M ) <(sort -m <(sort file47) <(sort file48) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) This takes 24 seconds. This is including overhead of starting multiple sorts and mbuffers, and moving everything through pipes. If this was in a single process with a single address space, I imagine this would be even faster: Instead of moving data around, it should be possible to move only pointers around. It may also be possible to optimize the merge sort when there are only two inputs: i1 = read(input1) i2 = read(input2) while(input1 and input2) { while(i1 <= i2) { print i1 i1 = read(input1) } while(i2 <= i1) { print i2 i2 = read(input2) } } But I cheated a little: I started with 48 evenly sized files. I think this could easily be emulated by having a single reader thread spread blocks of full lines in a round robin style to the 48 threads that will do the initial sorting. Something like: while(read_block()) { find_last_newline(); write_block_to_process(n%48); n++; } If we are reading from a disk, then the reading may be slow, so if the 48 initial sorters can start sorting without having the full input, then we will save some wall clock time. /Ole