GNU bug report logs -
#13354
sort: File Merge Order is Suboptimal with Many Passes
Previous Next
Reported by: Jason Bucata <jbucata <at> windstream.net>
Date: Fri, 4 Jan 2013 07:46:01 UTC
Severity: normal
Tags: moreinfo
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #8 received at 13354 <at> debbugs.gnu.org (full text, mbox):
On 01/04/2013 04:07 AM, Jason Bucata wrote:
> Over a year ago, I was working with sort on some large input files. I
> noticed a performance problem that related to the number of temp files being
> used for sort runs. As I recall, I'd also seen the problem prior to that,
> but this was the first time I took an opportunity to diagnose the problem
> in detail.
>
> (As for why I waited so long to finally report it, more below.)
>
>>From what I recall, it tended to want to do 8-way merges on my system. If
> the input file was just big enough, it would wind up with 9 files left over
> toward the end. It would merge the largest 8 files in another pass, and
> then do one final merge to combine the now-huge temp file with the one,
> likely very tiny temp file still remaining.
>
> I found that the extra pass added a lot of extra time to the sort. If I
> bumped up the memory buffer to increase the initial run sizes so that I got
> rid of one temp file, I could shave at least a few minutes off the total run
> time: The little extra memory made a disproportionate difference, and I
> could attribute it directly to the extra merge pass at the end that was
> avoided.
>
> At the time I reasoned that it would make the most sense to merge temp files
> at the very beginning to reduce the number just enough to have a bunch of
> --batch-size merge passes after that until the end. In this case, if the
> extra temp file at the end was 10 MB, then each merge pass would have to
> process 10 MB more. Over 4 merge passes, say, that's 40 MB extra to read.
> But the benefit is saving the final merge pass with 10 MB plus something a
> WHOLE log bigger than 40 MB, if it's big enough to require 4 merge passes to
> combine it all in the first place!
>
> I held off on reporting the bug because I wanted to see what Knuth had to
> say on the subject. Well, tonight I finally went to the library and looked
> through volume 3 to see his advice. :)
>
> In section 5.4.9 he basically echoes my thinking.
>
> In the simple case where all runs were the same length, he recommends a
> queue to track the files to process, with N files pulled from the front of
> the queue and the merged output added to the back. But, he says to add a
> number of "dummy" empty files to the front queue. The net effect is that
> fewer than N files are merged the first time.
>
> Looking at the merge function in sort.c, it appears that it can merge both
> temp files and input files all at the same time. With that, we probably
> want an enhancement to that algorithm, which Knuth called "Huffman's method"
> (referencing something in chapter 2). For files or sort runs of varying
> sizes, Knuth says to use a priority queue, to merge the smallest N files at
> each pass. He also says to add a number of 0-length dummy files at the
> front of the priority queue so that the first merge pass grabs fewer than N
> files.
>
> If I understand his formula correctly, he suggests to add (1-F) % (N-1)
> dummy runs, where N is --batch-size and F is the starting number of files.
>
> To get it ideal, we'd need a priority queue implementation here, maybe a
> heap or something. (Irony of having to maintain a sorted list of files in a
> sort program is duly noted...) It would be an improvement, though, to
> handle merging a smaller batch of files at the beginning, the equivalent of
> adding 0-byte dummy files at the front. That might be quicker to implement,
> unless there's a GPL'd heap library lying around somewhere...
There is a little heap lib already used by sort:
http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/heap.c;hb=HEAD
Would that suffice?
thanks,
Pádraig.
This bug report was last modified 6 years and 215 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.