GNU bug report logs - #16004
Multicore Core-utils

Previous Next

Package: coreutils;

Reported by: CDR <venefax <at> gmail.com>

Date: Fri, 29 Nov 2013 22:20:03 UTC

Severity: wishlist

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16004 in the body.
You can then email your comments to 16004 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#16004; Package coreutils. (Fri, 29 Nov 2013 22:20:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to CDR <venefax <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Fri, 29 Nov 2013 22:20:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: CDR <venefax <at> gmail.com>
To: coreutils <at> gnu.org, bug-coreutils <at> gnu.org
Subject: Multicore Core-utils
Date: Fri, 29 Nov 2013 17:18:59 -0500
Dear friends

In case this email is read by Richard M. Stallman and David MacKenzie.
I need a multi-core version of "comm" and "join". The current version
only uses one core and it takes hours to process two files, with 4
columns and 510 million lines. I need to process those files every
night.

I wonder if any  plan exists to jump to multicore. If not, is there a
volunteer that can do the job, for a reasonable fee? I am one-man
company but I guess we all need a parallel-processing-capable
core-utils.

Yours

Philip Orleans




Information forwarded to bug-coreutils <at> gnu.org:
bug#16004; Package coreutils. (Fri, 29 Nov 2013 22:30:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-coreutils <at> gnu.org
Subject: Re: bug#16004: Multicore Core-utils
Date: Fri, 29 Nov 2013 14:28:27 -0800
CDR wrote:
> I wonder if any  plan exists to jump to multicore.

There's no specific plan, and it'd be nice to
have coreutils run faster on multicore machines.




Information forwarded to bug-coreutils <at> gnu.org:
bug#16004; Package coreutils. (Fri, 29 Nov 2013 23:07:01 GMT) Full text and rfc822 format available.

Message #11 received at 16004 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: CDR <venefax <at> gmail.com>
Cc: 16004 <at> debbugs.gnu.org
Subject: Re: bug#16004: Multicore Core-utils
Date: Fri, 29 Nov 2013 23:05:54 +0000
On 11/29/2013 10:18 PM, CDR wrote:
> Dear friends
> 
> In case this email is read by Richard M. Stallman and David MacKenzie.
> I need a multi-core version of "comm" and "join". The current version
> only uses one core and it takes hours to process two files, with 4
> columns and 510 million lines. I need to process those files every
> night.
> 
> I wonder if any  plan exists to jump to multicore. If not, is there a
> volunteer that can do the job, for a reasonable fee? I am one-man
> company but I guess we all need a parallel-processing-capable
> core-utils.

Note comm and join need a sorted file and sort(1)
is already multicore aware.  Since sorting needs
to implicitly handle all the input before generating output,
it makes sense for sort(1) to handle that itself.
Also the sorting operation itself is relative expensive
compared to the corresponding I/O involved, which
further justifies the multicore knowledge within sort(1).

So if you're dealing with an already sorted file,
it then often depends on the I/O for that file
which could be a bottleneck.  For example if your data file
that "takes hours to process" was on a mechanical hard disk,
then processing with a single thread/process is probably best,
otherwise multiple ones would be just seeking the disk head
and slow things down.  The increasing prevalence of SSDs
changes the game here though, so that separate accesses
to the same file could very well be a win.

BTW you haven't said whether you're I/O or CPU bound.
I presume you're CPU bound given you're mentioning multicore,
which is a little surprising given the relatively inexpensive
operations done within comm(1) and join(1).
It's worth mentioning locales here, because if you don't
need the relatively expensive locale matching rules,
you can disable those before a run by setting:
  export LC_ALL=C
If that did change things to be I/O bound again then
you might consider putting each file on separate devices,
to gain from parallel I/O operations.

So if you're still CPU bound, a more general technique to consider,
is splitting up the file to be processed by separate _processes_.
Now this is more sorted to tools that don't have relevance on
the relative order of particular lines which unfortunately
comm(1) and join(1) do, but perhaps there is some way you
could split your data to more files when generating it,
which could then be fed to separate join(1) processes.

thanks,
Pádraig.






Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:16:02 GMT) Full text and rfc822 format available.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:16:03 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 16004 <at> debbugs.gnu.org and CDR <venefax <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 22:16:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 09 Nov 2018 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 308 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.