GNU bug report logs - #15077
Bug in Join

Previous Next

Package: coreutils;

Reported by: CDR <venefax <at> gmail.com>

Date: Mon, 12 Aug 2013 16:15:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #14 received at 15077 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: CDR <venefax <at> gmail.com>
Cc: 15077 <at> debbugs.gnu.org, Coreutils <coreutils <at> gnu.org>
Subject: Re: bug#15077: Clarification
Date: Mon, 12 Aug 2013 21:02:32 -0600
(CC'ing the list so that others could comment)

Hello Federico,

On 08/12/2013 06:50 PM, CDR wrote:
> How do I get latest, latest version, even beta, or join, sort, etc?

I would not recommend using "beta" or "development" versions of GNU coreutils for production code, just to be on the safe side.
The stable releases are available as source code here:
 http://ftp.gnu.org/gnu/coreutils/
With more details here:
 http://www.gnu.org/software/coreutils/

> One thing that I suggest is to change sort, comm and join to use more
> than one core. I had to use a commercial version of sort because the
> "regular" version tales for ever to sort a 15G file. The commercial
> version is called nsort and it uses all the cores in the machines and
> also you may add a flag to give the program a huge memory block. It
> works like ten times faster than the "regular" sort.

Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see the "--parallel" parameter).
Sort also supports the "--buffer-size" parameter to explicitly specify how much memory to use.

I'm not familiar with "nsort" and can not comment on nsort vs GNU sort's speeds,
I believe that on modern hardware, sorting 15G should take few minutes at most, not "forever" - but that depends on many factors (e.g. cores, memory, disk, etc.).

"join" operates on sorted input, and as such, requires very little CPU and memory.
I  do not think much can be gained from making "join" multi-threaded.
I believe the same applies to "comm".

> I am using "comm" a lot for business problem that involves comparing
> daily files that have 550 MM records. I find it extremely slow. Do
> you any suggestion?
>

Others could perhaps comment on ways to improve performance when using GNU coreutils.

I'd assume it very much depends on the technical details you're comparing - perhaps there are ways to improve the workflow.
First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk speed, Algorithm, etc.)


regards,
 -gordon





This bug report was last modified 6 years and 284 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.