Package: coreutils;
View this message in rfc822 format
From: Ole Tange <tange <at> gnu.org> To: 40533 <at> debbugs.gnu.org Subject: bug#40533: GNU sort could be faster on machines with more cores Date: Fri, 10 Apr 2020 13:11:31 +0200
$ sort --version sort (GNU coreutils) 8.28 I am the author of GNU Parallel and my fetish is to try to run more stuff in parallel. I recently sorted a 2.4 GBytes/100 M lines file: export LC_ALL=C time sort --parallel=48 bigfile >/dev/null This takes 87 seconds on my 48 core machine. Everything is done in a RAM disk. By adjusting 48 I find the optimal is --parallel=46: 80 seconds. But it is possible to do better than that. A big part of the final part of the sorting is a single tread. It is likely merging the results of the other threads, but it leaves the other 47 cores idle. So I thought: Can we parallelize that more? The reasoning is: If you have N-1 cores sitting idle, you really do not care if you waste some CPU cycles, if you can get the result faster. So I tested. Let us start by looking at an easier case: We do not have 1 big file, but the bigfile is split into 48 files of similar sizes. In this case we can do: sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file1) <(sort file2) ) <(sort -m <(sort file3) <(sort file4) ) ) <(sort -m <(sort -m <(sort file5) <(sort file6) ) <(sort -m <(sort file7) <(sort file8) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) ) <(sort -m <(sort file11) <(sort file12) ) ) <(sort -m <(sort -m <(sort file13) <(sort file14) ) <(sort -m <(sort file15) <(sort file16) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) ) <(sort -m <(sort file19) <(sort file20) ) ) <(sort -m <(sort -m <(sort file21) <(sort file22) ) <(sort -m <(sort file23) <(sort file24) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file25) <(sort file26) ) <(sort -m <(sort file27) <(sort file28) ) ) <(sort -m <(sort -m <(sort file29) <(sort file30) ) <(sort -m <(sort file31) <(sort file32) ) ) ) ) ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33) <(sort file34) ) <(sort -m <(sort file35) <(sort file36) ) ) <(sort -m <(sort -m <(sort file37) <(sort file38) ) <(sort -m <(sort file39) <(sort file40) ) ) ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) ) <(sort -m <(sort file43) <(sort file44) ) ) <(sort -m <(sort -m <(sort file45) <(sort file46) ) <(sort -m <(sort file47) <(sort file48) ) ) ) ) ) Or indented: sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file1) <(sort file2) )\ <(sort -m <(sort file3) <(sort file4) ) )\ <(sort -m \ <(sort -m <(sort file5) <(sort file6) )\ <(sort -m <(sort file7) <(sort file8) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file9) <(sort file10) )\ <(sort -m <(sort file11) <(sort file12) ) )\ <(sort -m \ <(sort -m <(sort file13) <(sort file14) )\ <(sort -m <(sort file15) <(sort file16) ) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file17) <(sort file18) )\ <(sort -m <(sort file19) <(sort file20) ) )\ <(sort -m <(sort -m <(sort file21) <(sort file22) )\ <(sort -m <(sort file23) <(sort file24) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file25) <(sort file26) )\ <(sort -m <(sort file27) <(sort file28) ) )\ <(sort -m \ <(sort -m <(sort file29) <(sort file30) )\ <(sort -m <(sort file31) <(sort file32) ) ) ) ) ) \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m \ <(sort -m <(sort file33) <(sort file34) )\ <(sort -m <(sort file35) <(sort file36) ) )\ <(sort -m <(sort -m <(sort file37) <(sort file38) )\ <(sort -m <(sort file39) <(sort file40) ) ) )\ <(sort -m \ <(sort -m \ <(sort -m <(sort file41) <(sort file42) )\ <(sort -m <(sort file43) <(sort file44) ) )\ <(sort -m \ <(sort -m <(sort file45) <(sort file46) )\ <(sort -m <(sort file47) <(sort file48) ) ) ) ) ) So WTF is going on here? Here we do an initial sort of each of the 48 files. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. Then we do a merge sort of a pair of those. And so on, until we only have a single input. All of this is done in parallel when possible. This takes 60 seconds. But a lot of time is waiting for pipes. This can be lowered by added a buffer after each merge (mbuffer -m 30M): sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file1) <(sort file2) | mbuffer -m 30M ) <(sort -m <(sort file3) <(sort file4) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file5) <(sort file6) | mbuffer -m 30M ) <(sort -m <(sort file7) <(sort file8) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file9) <(sort file10) | mbuffer -m 30M ) <(sort -m <(sort file11) <(sort file12) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file13) <(sort file14) | mbuffer -m 30M ) <(sort -m <(sort file15) <(sort file16) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort file17) <(sort file18) | mbuffer -m 30M ) <(sort -m <(sort file19) <(sort file20) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file21) <(sort file22) | mbuffer -m 30M ) <(sort -m <(sort file23) <(sort file24) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file25) <(sort file26) | mbuffer -m 30M ) <(sort -m <(sort file27) <(sort file28) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file29) <(sort file30) | mbuffer -m 30M ) <(sort -m <(sort file31) <(sort file32) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort -m <(sort -m <(sort file33) <(sort file34) | mbuffer -m 30M ) <(sort -m <(sort file35) <(sort file36) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file37) <(sort file38) | mbuffer -m 30M ) <(sort -m <(sort file39) <(sort file40) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort -m <(sort file41) <(sort file42) | mbuffer -m 30M ) <(sort -m <(sort file43) <(sort file44) | mbuffer -m 30M ) | mbuffer -m 30M ) <(sort -m <(sort -m <(sort file45) <(sort file46) | mbuffer -m 30M ) <(sort -m <(sort file47) <(sort file48) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) | mbuffer -m 30M ) This takes 24 seconds. This is including overhead of starting multiple sorts and mbuffers, and moving everything through pipes. If this was in a single process with a single address space, I imagine this would be even faster: Instead of moving data around, it should be possible to move only pointers around. It may also be possible to optimize the merge sort when there are only two inputs: i1 = read(input1) i2 = read(input2) while(input1 and input2) { while(i1 <= i2) { print i1 i1 = read(input1) } while(i2 <= i1) { print i2 i2 = read(input2) } } But I cheated a little: I started with 48 evenly sized files. I think this could easily be emulated by having a single reader thread spread blocks of full lines in a round robin style to the 48 threads that will do the initial sorting. Something like: while(read_block()) { find_last_newline(); write_block_to_process(n%48); n++; } If we are reading from a disk, then the reading may be slow, so if the 48 initial sorters can start sorting without having the full input, then we will save some wall clock time. /Ole
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.