GNU bug report logs -
#16004
Multicore Core-utils
Previous Next
Reported by: CDR <venefax <at> gmail.com>
Date: Fri, 29 Nov 2013 22:20:03 UTC
Severity: wishlist
Tags: notabug
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #11 received at 16004 <at> debbugs.gnu.org (full text, mbox):
On 11/29/2013 10:18 PM, CDR wrote:
> Dear friends
>
> In case this email is read by Richard M. Stallman and David MacKenzie.
> I need a multi-core version of "comm" and "join". The current version
> only uses one core and it takes hours to process two files, with 4
> columns and 510 million lines. I need to process those files every
> night.
>
> I wonder if any plan exists to jump to multicore. If not, is there a
> volunteer that can do the job, for a reasonable fee? I am one-man
> company but I guess we all need a parallel-processing-capable
> core-utils.
Note comm and join need a sorted file and sort(1)
is already multicore aware. Since sorting needs
to implicitly handle all the input before generating output,
it makes sense for sort(1) to handle that itself.
Also the sorting operation itself is relative expensive
compared to the corresponding I/O involved, which
further justifies the multicore knowledge within sort(1).
So if you're dealing with an already sorted file,
it then often depends on the I/O for that file
which could be a bottleneck. For example if your data file
that "takes hours to process" was on a mechanical hard disk,
then processing with a single thread/process is probably best,
otherwise multiple ones would be just seeking the disk head
and slow things down. The increasing prevalence of SSDs
changes the game here though, so that separate accesses
to the same file could very well be a win.
BTW you haven't said whether you're I/O or CPU bound.
I presume you're CPU bound given you're mentioning multicore,
which is a little surprising given the relatively inexpensive
operations done within comm(1) and join(1).
It's worth mentioning locales here, because if you don't
need the relatively expensive locale matching rules,
you can disable those before a run by setting:
export LC_ALL=C
If that did change things to be I/O bound again then
you might consider putting each file on separate devices,
to gain from parallel I/O operations.
So if you're still CPU bound, a more general technique to consider,
is splitting up the file to be processed by separate _processes_.
Now this is more sorted to tools that don't have relevance on
the relative order of particular lines which unfortunately
comm(1) and join(1) do, but perhaps there is some way you
could split your data to more files when generating it,
which could then be fed to separate join(1) processes.
thanks,
Pádraig.
This bug report was last modified 6 years and 308 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.