GNU bug report logs - #6366
join can't join on numeric fields

Previous Next

Package: coreutils;

Reported by: Alex Shinn <alexshinn <at> gmail.com>

Date: Mon, 7 Jun 2010 05:24:02 UTC

Severity: wishlist

Tags: patch

Merged with 10924, 12264

Full log


Message #18 received at 6366 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Alex Shinn <alexshinn <at> gmail.com>
Cc: Pádraig Brady <P <at> draigbrady.com>, 6366 <at> debbugs.gnu.org
Subject: Re: bug#6366: join can't join on numeric fields
Date: Wed, 09 Jun 2010 08:56:07 +0200
Alex Shinn wrote:

> 2010/6/8 Pádraig Brady <P <at> draigbrady.com>:
>> On 07/06/10 06:19, Alex Shinn wrote:
>>>
>>> Ideally join should be able to handle files sorted in any order
>>> that sort provides, but as a bare minimum it should at least
>>> be able to join files sorted on numeric fields.
>>
>> Well if there were no aliases in the numbers, you could always
>> sort the output numerically after the join if it was important.
>
> By first sorting lexicographically, you mean?
> In the use case I had, the data was already sorted
> numerically.  So whenever I want to join two files,
> currently I have to do:
>
>   sort file1 > file1.tmp
>   sort file2 > file2.tmp
>   join file1.tmp file2.tmp | sort -n > out
>   rm -f file1.tmp file2.tmp
>
> instead of just
>
>   join -n file1 file2 > out
>
> In the small tools philosophy you want to avoid adding
> redundancy, but in this case join isn't doing the same
> thing as sort, it's just working with it better.  Not to mention
> the fact that sort is an expensive operation to have to
> perform multiple times, not just an extra O(n) filter
> to throw in the middle of a pipeline.
>
>> However if you wanted to join "01" and "1" then your patch is required.
>> Are numeric aliases common enough to warrant this? I think so.
>
> Leading zeros may not be so common, but don't forget
> "1.0" and "1" or "1e2" and "100" and "100.0", etc.
>
>> I'd use -g, --general-numeric to correspond with `sort`.
>
> Yes, that's probably better.

There may be a fly in the ointment.

When comparing floating point numbers how would join measure equality?
Should it consider 1.000000000000001e2 to be equal to 100.0 ?
What if the maximum precision available does not
allow us to distinguish those two values?

What about -0 and 0? (with IEEE 754, they'll compare equal)




This bug report was last modified 6 years and 260 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.