GNU bug report logs - #15077
Bug in Join

Reported by: CDR <venefax <at> gmail.com>

Date: Mon, 12 Aug 2013 16:15:02 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 15077 in the body.
You can then email your comments to 15077 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#15077; Package coreutils. (Mon, 12 Aug 2013 16:15:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to CDR <venefax <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 12 Aug 2013 16:15:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: CDR <venefax <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: Bug in Join
Date: Mon, 12 Aug 2013 12:13:35 -0400

[Message part 1 (text/plain, inline)]

Dear Friends

The option -a 1 or -a 2 is not correct.
Suppose you have two comma separated value files with 10 records each,
joined on the first column. The second file has 6 records where there is no
match on the joining column (first).
Let's call the files today.txt and yesterday.txt

If you use this command
"join -t, -1 1 -2 1 -a 2 -o 2.1 today.txt yesterday.txt" it should give
ONLY the non pairing lines from yesterday. Instead, it prints all the lines
from yesterday.

I am copying here the contents of both files.

cat today.txt
2012054455,8624520081,6529,0,201
2012067075,2013106025,6214,0,201
2012087388,8623689800,6214,2,201
2012088887,8623689800,6214,0,201
2012120319,9739789996,392A,0,201
2012121177,9739789996,392A,0,201
2012122869,2013700000,6006,0,201
2012140209,2013700000,6006,0,201
2012143002,2012339982,6529,0,201
2012149116,2012339982,6529,0,201


 cat yesterday.txt
2012067075,2019269533,6664,0,201
2012087388,2012320000,6006,0,201
2012088887,8624520081,6529,0,201
2012140209,9733360000,392A,0,201
2012204272,2019269533,6664,0,201
2012226151,2018209998,954F,0,201
2012299682,2018209998,954F,0,201
2012324322,9733360000,392A,0,201
2012334444,2017809469,6664,0,201
2012389608,2012320000,6006,0,201

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#15077; Package coreutils. (Mon, 12 Aug 2013 18:33:01 GMT) Full text and rfc822 format available.

Message #8 received at 15077 <at> debbugs.gnu.org (full text, mbox):

From: CDR <venefax <at> gmail.com>
To: 15077 <at> debbugs.gnu.org
Subject: Clarification
Date: Mon, 12 Aug 2013 14:31:55 -0400

[Message part 1 (text/plain, inline)]

I just found out that the "v" option does what I need. So in my opinion,
the "a" option is useless, for it gives you no new information.
In terms of new functionality, the "-o" option, format, should allow to add
arbitrary data, like ",A", "4", etc., in addition to the list of fields
(2.1 1.1 etc.)

Yours

Federico Alves

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#15077; Package coreutils. (Tue, 13 Aug 2013 00:02:02 GMT) Full text and rfc822 format available.

Message #11 received at 15077 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: CDR <venefax <at> gmail.com>, 15077 <at> debbugs.gnu.org
Subject: Re: bug#15077: Clarification
Date: Mon, 12 Aug 2013 18:04:34 -0600

Hello Federico Alves,

On 08/12/2013 12:31 PM, CDR wrote:
> I just found out that the "v" option does what I need. So in my opinion,
> the "a" option is useless, for it gives you no new information.

I'm glad to hear you found the combination of options that works for you.
I would humbly disagree that the "-a" option is useless - it simply does something different than what you need.
Especially when combined with output specifier ("-o") - the output of "join" is indeed not what you wanted.

When used without "-o", the "-a1/2" options allow you see which keys are common to both files and which keys are just in one file.

Example:
The following "join" will show which lines are common (will have 9 fields) and which lines are only in the second file ("-a 2"):
---
$ join -t, -1 1 -2 1 -a 2 today.txt yesterday.txt
2012067075,2013106025,6214,0,201,2019269533,6664,0,201
2012087388,8623689800,6214,2,201,2012320000,6006,0,201
2012088887,8623689800,6214,0,201,8624520081,6529,0,201
2012140209,2013700000,6006,0,201,9733360000,392A,0,201
2012204272,2019269533,6664,0,201
2012226151,2018209998,954F,0,201
2012299682,2018209998,954F,0,201
2012324322,9733360000,392A,0,201
2012334444,2017809469,6664,0,201
2012389608,2012320000,6006,0,201
---

If you don't care about the other fields, and just want to see the keys, using "-o 0,1.1,2.1" will give:
----
$ join -t, -1 1 -2 1 -a 2 -o 0,1.1,2.1 today.txt yesterday.txt
2012067075,2012067075,2012067075
2012087388,2012087388,2012087388
2012088887,2012088887,2012088887
2012140209,2012140209,2012140209
2012204272,,2012204272
2012226151,,2012226151
2012299682,,2012299682
2012324322,,2012324322
2012334444,,2012334444
2012389608,,2012389608
----
Which again, quickly shows that lines with empty second field exist only in the second file.

You can combine "-a 1" and "-a 2", to show all combination of items in both files:
---
$ join -t, -1 1 -2 1 -a 1 -a 2 -o 0,1.1,2.1 today.txt yesterday.txt
2012054455,2012054455,
2012067075,2012067075,2012067075
2012087388,2012087388,2012087388
2012088887,2012088887,2012088887
2012120319,2012120319,
2012121177,2012121177,
2012122869,2012122869,
2012140209,2012140209,2012140209
2012143002,2012143002,
2012149116,2012149116,
2012204272,,2012204272
2012226151,,2012226151
2012299682,,2012299682
2012324322,,2012324322
2012334444,,2012334444
2012389608,,2012389608
---
In this example, all lines have three fields:
First field is the combined key, and is always non-empty.
Second field is non-empty if the key exists in the first file.
Third field is non-empty if the key exists in the second file.
(and thus, if both second and third fields are non empty, the key is common to both files).


> In terms of new functionality, the "-o" option, format, should allow to add
> arbitrary data, like ",A", "4", etc., in addition to the list of fields
> (2.1 1.1 etc.)

I would suggest using a different program (perhaps awk or sed), down-stream from the "join" program to add any additional information you need.
Consider combining it with "-o auto" (new in join version 8.10) that will maintain the column ordering of the combined input files, and will allow you to easily add information.

Example with "-a 1 -2" AND "-o auto":
---
$ join -t, -1 1 -2 1 -a 1 -a 2 -o auto today.txt yesterday.txt 2012054455,8624520081,6529,0,201,,,,
2012067075,2013106025,6214,0,201,2019269533,6664,0,201
2012087388,8623689800,6214,2,201,2012320000,6006,0,201
2012088887,8623689800,6214,0,201,8624520081,6529,0,201
2012120319,9739789996,392A,0,201,,,,
2012121177,9739789996,392A,0,201,,,,
2012122869,2013700000,6006,0,201,,,,
2012140209,2013700000,6006,0,201,9733360000,392A,0,201
2012143002,2012339982,6529,0,201,,,,
2012149116,2012339982,6529,0,201,,,,
2012204272,,,,,2019269533,6664,0,201
2012226151,,,,,2018209998,954F,0,201
2012299682,,,,,2018209998,954F,0,201
2012324322,,,,,9733360000,392A,0,201
2012334444,,,,,2017809469,6664,0,201
2012389608,,,,,2012320000,6006,0,201
---

In this example, all lines have nine fields, and are easy to parse:
1. The common key
2-5 - The four fields from the first file (possibly empty)
6-9 - The four fields from the second file (possibly empty).

Adding AWK on the output of "join" is now easy, because the fields are in fixed order.
for example, adding "AA" as a first field and "44" as the last field:
---
$ join -t, -1 1 -2 1 -a 1 -a 2 -o auto today.txt yesterday.txt | awk -F, -v OFS=, '{print "AA", $0, "44"}'
AA,2012054455,8624520081,6529,0,201,,,,,44
AA,2012067075,2013106025,6214,0,201,2019269533,6664,0,201,44
AA,2012087388,8623689800,6214,2,201,2012320000,6006,0,201,44
AA,2012088887,8623689800,6214,0,201,8624520081,6529,0,201,44
AA,2012120319,9739789996,392A,0,201,,,,,44
AA,2012121177,9739789996,392A,0,201,,,,,44
AA,2012122869,2013700000,6006,0,201,,,,,44
AA,2012140209,2013700000,6006,0,201,9733360000,392A,0,201,44
AA,2012143002,2012339982,6529,0,201,,,,,44
AA,2012149116,2012339982,6529,0,201,,,,,44
AA,2012204272,,,,,2019269533,6664,0,201,44
AA,2012226151,,,,,2018209998,954F,0,201,44
AA,2012299682,,,,,2018209998,954F,0,201,44
AA,2012324322,,,,,9733360000,392A,0,201,44
AA,2012334444,,,,,2017809469,6664,0,201,44
AA,2012389608,,,,,2012320000,6006,0,201,44
---

Or something a little more informative:
---
$ join -t, -1 1 -2 1 -a 1 -a 2 -o auto today.txt yesterday.txt |
     awk -F, -v OFS=, '$2=="" && $6!="" { print $1, "Yesterday" }
                       $2!="" && $6=="" { print $1, "Today" }
                       $2!="" && $6!="" { print $1, "Both" }'
2012054455,Today
2012067075,Both
2012087388,Both
2012088887,Both
2012120319,Today
2012121177,Today
2012122869,Today
2012140209,Both
2012143002,Today
2012149116,Today
2012204272,Yesterday
2012226151,Yesterday
2012299682,Yesterday
2012324322,Yesterday
2012334444,Yesterday
2012389608,Yesterday
---


Hope this helps,
 -gordon

Information forwarded to bug-coreutils <at> gnu.org:
bug#15077; Package coreutils. (Tue, 13 Aug 2013 03:00:03 GMT) Full text and rfc822 format available.

Message #14 received at 15077 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: CDR <venefax <at> gmail.com>
Cc: 15077 <at> debbugs.gnu.org, Coreutils <coreutils <at> gnu.org>
Subject: Re: bug#15077: Clarification
Date: Mon, 12 Aug 2013 21:02:32 -0600

(CC'ing the list so that others could comment)

Hello Federico,

On 08/12/2013 06:50 PM, CDR wrote:
> How do I get latest, latest version, even beta, or join, sort, etc?

I would not recommend using "beta" or "development" versions of GNU coreutils for production code, just to be on the safe side.
The stable releases are available as source code here:
 http://ftp.gnu.org/gnu/coreutils/
With more details here:
 http://www.gnu.org/software/coreutils/

> One thing that I suggest is to change sort, comm and join to use more
> than one core. I had to use a commercial version of sort because the
> "regular" version tales for ever to sort a 15G file. The commercial
> version is called nsort and it uses all the cores in the machines and
> also you may add a flag to give the program a huge memory block. It
> works like ten times faster than the "regular" sort.

Starting with sort version 8.6 sort can use multiple cores to improve sorting speed (see the "--parallel" parameter).
Sort also supports the "--buffer-size" parameter to explicitly specify how much memory to use.

I'm not familiar with "nsort" and can not comment on nsort vs GNU sort's speeds,
I believe that on modern hardware, sorting 15G should take few minutes at most, not "forever" - but that depends on many factors (e.g. cores, memory, disk, etc.).

"join" operates on sorted input, and as such, requires very little CPU and memory.
I  do not think much can be gained from making "join" multi-threaded.
I believe the same applies to "comm".

> I am using "comm" a lot for business problem that involves comparing
> daily files that have 550 MM records. I find it extremely slow. Do
> you any suggestion?
>

Others could perhaps comment on ways to improve performance when using GNU coreutils.

I'd assume it very much depends on the technical details you're comparing - perhaps there are ways to improve the workflow.
First step is usually to isolate the real bottle neck (e.g. CPU, Memory, Disk speed, Algorithm, etc.)

regards,
 -gordon

Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 21:38:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 15077 <at> debbugs.gnu.org and CDR <venefax <at> gmail.com> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Thu, 11 Oct 2018 21:38:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 09 Nov 2018 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 284 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #15077 Bug in Join

GNU bug report logs - #15077
Bug in Join