#16004 - Multicore Core-utils - GNU bug report logs

GNU bug report logs - #16004
Multicore Core-utils

Reported by: CDR <venefax <at> gmail.com>

Date: Fri, 29 Nov 2013 22:20:03 UTC

Severity: wishlist

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Message #11 received at 16004 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com> To: CDR <venefax <at> gmail.com> Cc: 16004 <at> debbugs.gnu.org Subject: Re: bug#16004: Multicore Core-utils Date: Fri, 29 Nov 2013 23:05:54 +0000

On 11/29/2013 10:18 PM, CDR wrote: > Dear friends > > In case this email is read by Richard M. Stallman and David MacKenzie. > I need a multi-core version of "comm" and "join". The current version > only uses one core and it takes hours to process two files, with 4 > columns and 510 million lines. I need to process those files every > night. > > I wonder if any plan exists to jump to multicore. If not, is there a > volunteer that can do the job, for a reasonable fee? I am one-man > company but I guess we all need a parallel-processing-capable > core-utils. Note comm and join need a sorted file and sort(1) is already multicore aware. Since sorting needs to implicitly handle all the input before generating output, it makes sense for sort(1) to handle that itself. Also the sorting operation itself is relative expensive compared to the corresponding I/O involved, which further justifies the multicore knowledge within sort(1). So if you're dealing with an already sorted file, it then often depends on the I/O for that file which could be a bottleneck. For example if your data file that "takes hours to process" was on a mechanical hard disk, then processing with a single thread/process is probably best, otherwise multiple ones would be just seeking the disk head and slow things down. The increasing prevalence of SSDs changes the game here though, so that separate accesses to the same file could very well be a win. BTW you haven't said whether you're I/O or CPU bound. I presume you're CPU bound given you're mentioning multicore, which is a little surprising given the relatively inexpensive operations done within comm(1) and join(1). It's worth mentioning locales here, because if you don't need the relatively expensive locale matching rules, you can disable those before a run by setting: export LC_ALL=C If that did change things to be I/O bound again then you might consider putting each file on separate devices, to gain from parallel I/O operations. So if you're still CPU bound, a more general technique to consider, is splitting up the file to be processed by separate _processes_. Now this is more sorted to tools that don't have relevance on the relative order of particular lines which unfortunately comm(1) and join(1) do, but perhaps there is some way you could split your data to more files when generating it, which could then be fed to separate join(1) processes. thanks, Pádraig.

This bug report was last modified 6 years and 308 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16004 Multicore Core-utils

GNU bug report logs - #16004
Multicore Core-utils