GNU bug report logs -
#15077
Bug in Join
Previous Next
Reported by: CDR <venefax <at> gmail.com>
Date: Mon, 12 Aug 2013 16:15:02 UTC
Severity: normal
Tags: notabug
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #14 received at 15077 <at> debbugs.gnu.org (full text, mbox):
(CC'ing the list so that others could comment)
Hello Federico,
On 08/12/2013 06:50 PM, CDR wrote:
> How do I get latest, latest version, even beta, or join, sort, etc?
I would not recommend using "beta" or "development" versions of GNU coreutils for production code, just to be on the safe side.
The stable releases are available as source code here:
http://ftp.gnu.org/gnu/coreutils/
With more details here:
http://www.gnu.org/software/coreutils/
> One thing that I suggest is to change sort, comm and join to use more
> than one core. I had to use a commercial version of sort because the
> "regular" version tales for ever to sort a 15G file. The commercial
> version is called nsort and it uses all the cores in the machines and
> also you may add a flag to give the program a huge memory block. It
> works like ten times faster than the "regular" sort.
Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see the "--parallel" parameter).
Sort also supports the "--buffer-size" parameter to explicitly specify how much memory to use.
I'm not familiar with "nsort" and can not comment on nsort vs GNU sort's speeds,
I believe that on modern hardware, sorting 15G should take few minutes at most, not "forever" - but that depends on many factors (e.g. cores, memory, disk, etc.).
"join" operates on sorted input, and as such, requires very little CPU and memory.
I do not think much can be gained from making "join" multi-threaded.
I believe the same applies to "comm".
> I am using "comm" a lot for business problem that involves comparing
> daily files that have 550 MM records. I find it extremely slow. Do
> you any suggestion?
>
Others could perhaps comment on ways to improve performance when using GNU coreutils.
I'd assume it very much depends on the technical details you're comparing - perhaps there are ways to improve the workflow.
First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk speed, Algorithm, etc.)
regards,
-gordon
This bug report was last modified 6 years and 284 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.