GNU bug report logs - #6903
join: support numeric keys

Previous Next

Package: coreutils;

Reported by: Bernhard Schiffner <bernhard <at> schiffner-limbach.de>

Date: Tue, 24 Aug 2010 19:57:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 6903 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6903; Package coreutils. (Tue, 24 Aug 2010 19:57:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Bernhard Schiffner <bernhard <at> schiffner-limbach.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 24 Aug 2010 19:57:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Schiffner <bernhard <at> schiffner-limbach.de>
To: bug-coreutils <at> gnu.org
Subject: join: improve paralleles to sort?
Date: Tue, 24 Aug 2010 21:39:20 +0200
Hi,

having to work with lists containing lage numbers (i.e. filesizes greater 2GB) 
I have problems.
sort -n works
join dosen't  do as a "newcomer" expects.

Because join uses strtoul() before doing comparisation it is understandable. 
("unpairable" is the result.)

Do you see a chance to extend join with a -n parameter for numeric 
comparisation as sort has already?

In the source / manpage  join claims about combinations with sort. Why not 
expand this? :-)

TIA!

Bernhard




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6903; Package coreutils. (Tue, 24 Aug 2010 21:23:01 GMT) Full text and rfc822 format available.

Message #8 received at 6903 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bernhard Schiffner <bernhard <at> schiffner-limbach.de>
Cc: 6903 <at> debbugs.gnu.org
Subject: Re: bug#6903: join: improve paralleles to sort?
Date: Tue, 24 Aug 2010 14:23:55 -0700
On 08/24/2010 12:39 PM, Bernhard Schiffner wrote:
> Because join uses strtoul() before doing comparisation it is understandable. 
> ("unpairable" is the result.)

No, join doesn't use strtoul.  It compares the numbers as strings.
So if you use plain "sort" on the numbers, join will work, unless the
numbers are numerically equal but textually different (e.g., 0 versus -0).
You can then sort the output of join with "sort -n", if you wish.

> Do you see a chance to extend join with a -n parameter for numeric 
> comparisation as sort has already?

That would be a nice thing to add, if someone had the time to do it.
Generally speaking, any comparison that "sort" can do, "join" should
do too (except for random comparison I suppose).

The comparison code between sort and join should be shared, of course.
Can you write that?




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6903; Package coreutils. (Wed, 25 Aug 2010 06:56:02 GMT) Full text and rfc822 format available.

Message #11 received at 6903 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Schiffner <bernhard <at> schiffner-limbach.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 6903 <at> debbugs.gnu.org
Subject: Re: bug#6903: join: improve paralleles to sort?
Date: Wed, 25 Aug 2010 08:57:21 +0200
Am Dienstag, 24. August 2010, 23:23:55 schrieb Paul Eggert:
> On 08/24/2010 12:39 PM, Bernhard Schiffner wrote:
> > Because join uses strtoul() before doing comparisation it is
> > understandable. ("unpairable" is the result.)
> 
> No, join doesn't use strtoul. 
I was wrong (It is the number of the field to join.)

> It compares the numbers as strings.
> So if you use plain "sort" on the numbers, join will work, unless the
> numbers are numerically equal but textually different (e.g., 0 versus -0).
Not a problem for me.
> You can then sort the output of join with "sort -n", if you wish.

A small testcase is included here.
Do
join a  b
and try to understand, why the lines with
214618118	/temp/marketing_ms/emails.dat
214618118	/temp/bs/marketing_ms/emails.dat
are not in the result.
Do you see any reason?

Perhaps I'am missusing join here a litte bit, but until now I don't 
understand, why it should be wrong.
Before I'am going to blame someone else, I'll try to dig a little bit deeper 
too.

TIA!

Bernhard

File a:
21460	/ElsevierDocuments/EWX0886A/09218181/00220001/99000417/main.raw
21460	/ElsevierDocuments/EWX0889A/00319201/01200001/00001461/main.raw
21464	/apache/xerces/dom/DeferredAttrNSImpl.html
21466	/spam/1206882672_000701c89267_03453ee8_21fcd5a0 <at> jlsvsf
21467	/MINING/MIN0002A/03605442/00230009/98000218/main.raw
21468	/___MRA/___sophos_autoupdate1.dir/1207625107/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1208238697/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1208834890/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209153877/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209404409/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209710971/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209737271/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1214978929/encloa-b.ide
21469	/ElsevierDocuments/EWX0886A/09218181/00370003/02001996/main.raw
21469	/ElsevierDocuments/EWX0890A/00335894/00660002/06000846/main.xml
21469	/ElsevierDocuments/MINING/MIN0001A/01968904/00420007/00000911/main.raw
214602	/ElsevierDocuments/EWX0876A/00370738/01710001/04002477/main.xml
214604	/ElsevierDocuments/EWX0881A/00128252/00700001/04001333/main.xml
214614	/ElsevierDocuments/EWX0887A/02773791/00240020/05000223/main.xml
214666	/ElsevierDocuments/EWX0886A/09218181/00600003/07000240/main.xml
214682	/ElsevierDocuments/EWX0879A/0012821X/02430003/06000367/main.xml
2146369	/marketing/diffferent_Berichtsband_Online_Crossmedia_Kampagnen.pdf
2146427	/LBAtoJM/ROOT/WEB-INF/lib/hibernate-3.2.0.cr3.jar
214618118	/temp/marketing_ms/emails.dat
214618118	/temp/bs/marketing_ms/emails.dat
214618120	/temp/marketing_js/emails.dat

File b:
21460
21468
21469
214618118
215777777





Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6903; Package coreutils. (Wed, 25 Aug 2010 16:22:02 GMT) Full text and rfc822 format available.

Message #14 received at 6903 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bernhard Schiffner <bernhard <at> schiffner-limbach.de>
Cc: 6903 <at> debbugs.gnu.org
Subject: Re: bug#6903: join: improve paralleles to sort?
Date: Wed, 25 Aug 2010 09:22:13 -0700
On 08/24/2010 11:57 PM, Bernhard Schiffner wrote:
> 2146427	/LBAtoJM/ROOT/WEB-INF/lib/hibernate-3.2.0.cr3.jar
> 214618118	/temp/marketing_ms/emails.dat

That won't work, because the two lines are not sorted correctly.
Recall that join uses lexicographic comparison, not numeric.
Its input must be sorted lexicographically.

You can sort its output numerically later, if you prefer numeric
order.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#6903; Package coreutils. (Thu, 26 Aug 2010 19:08:01 GMT) Full text and rfc822 format available.

Message #17 received at 6903 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Schiffner <bernhard <at> schiffner-limbach.de>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 6903 <at> debbugs.gnu.org
Subject: Re: bug#6903: join: improve paralleles to sort?
Date: Thu, 26 Aug 2010 21:08:29 +0200
[Message part 1 (text/plain, inline)]
Am Mittwoch, 25. August 2010, 18:22:13 schrieb Paul Eggert:
> On 08/24/2010 11:57 PM, Bernhard Schiffner wrote:
> > 2146427	/LBAtoJM/ROOT/WEB-INF/lib/hibernate-3.2.0.cr3.jar
> > 214618118	/temp/marketing_ms/emails.dat
> 
> That won't work, because the two lines are not sorted correctly.
> Recall that join uses lexicographic comparison, not numeric.
> Its input must be sorted lexicographically.

Ok.
I solved my problem using the attached patch.

The patch shows that it is possible to use different sortings for keys 
(joinfield) in join.

I integrated some / most of the code from sort.c verbaly  in order to see 
what's needed to compile it successfully in join.c .
I did no tests beside my special usecase mentioned earlier.

It's clear that a user-friendly key-selection needs a lot more work. Same is 
about a unified version of join and sort.

Thanks to Paul and Christian Perle for their valueable help so far.

The FSF can make any use of the code here. 
It was theirs already before  ;-)


Bernhard


[join_proposal_2.diff (text/x-patch, attachment)]

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 05:04:01 GMT) Full text and rfc822 format available.

Changed bug title to 'join: support numeric keys' from 'join: improve paralleles to sort?' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 05:04:01 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 319 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.