GNU bug report logs - #26029
Problems with join

Previous Next

Package: coreutils;

Reported by: "Peter Kluge" <linux-projekt <at> techno.ms>

Date: Wed, 8 Mar 2017 17:43:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Reuti <reuti <at> staff.uni-marburg.de>
Cc: 26029 <at> debbugs.gnu.org
Subject: bug#26029: Problems with join
Date: Thu, 9 Mar 2017 17:20:43 +0000
Hello Reuti and all,

Reuti wrote:
>> […] The strange thing seems to be, that "-j1 2" is handled like "-1
>> 2". 
>
> My investigations revealed: on a Mac the man page of `join` explains
> the behavior. The options -j, -j1 and -j2 are listed with the BSD
> version of `join` as being there for compatibility. This leads to the
> assumption, that nowadays -1 and -2 should better be used.

Thanks for investigating and pointing this out!

Join's manual section was recently expanded, I wish I was aware of 
this nuance before I wrote the patch. I will send a patch with 
improved documentation.

On Thu, Mar 09, 2017 at 05:29:13PM +0100, Reuti wrote:

Reuti wrote:
>> Am 09.03.2017 um 16:32 schrieb Peter Kluge <linux-projekt <at> techno.ms>:
>>
>> I prefer the "POSIX"-Standard teaching to my participants.
>
>Aha, I didn't check this. Then the "-j" option should be moved to a new section "Deprecated" in the man/info page of the coreutils version too. (And mention the special handling of -j1 resp. -j2, while -j3 … works as one expects.)

I would humbly suggest other wording: I'm not sure '-j' is deprecated.
It is useful, and does work as expected in most cases.

But, it should be better documented to warn against this edge-case.

Reuti wrote:
> -j FIELD equivalent to '-1 FIELD -2 FIELD'
> 
> does not work in all cases essentially.

It 'just works' in most cases, but indeed we should improve the 
documentation about edge cases.

First,
this is the relevant section that handles the '-j' parameter:
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/join.c#n1079


Second,
Let's ensure '-jN' works in the common cases,
when it is *not* followed by a number:

Two input files:

   $ cat a.txt
   1 2 3 aaa
   2 3 4 bbb

   $ cat b.txt
   1 2 3 XXX
   2 3 4 YYY

'-j1' alone is equivalent to '-1 1 -2 1':

   $ join -1 1 -2 1 a.txt b.txt
   1 2 3 aaa 2 3 XXX
   2 3 4 bbb 3 4 YYY

   $ join -j1 a.txt b.txt
   1 2 3 aaa 2 3 XXX
   2 3 4 bbb 3 4 YYY

'-j2' alone is equivalent to '-1 2 -2 2':

   $ join -1 2 -2 2 a.txt b.txt
   2 1 3 aaa 1 3 XXX
   3 2 4 bbb 2 4 YYY

   $ join -j2 a.txt b.txt
   2 1 3 aaa 1 3 XXX
   3 2 4 bbb 2 4 YYY

'-j3' alone is equivalent to '-1 3 -2 3':

   $ join -1 3 -2 3 a.txt b.txt
   3 1 2 aaa 1 2 XXX
   4 2 3 bbb 2 3 YYY

   $ join -j3 a.txt b.txt
   3 1 2 aaa 1 2 XXX
   4 2 3 bbb 2 3 YYY

So, in the most common cases, '-jN' works for all Ns
(for "all" being 1,2,3 but really, who needs more than 3 numbers? :) ).
This is perhaps not like BSD's join.


Now comes the tricky part:
If the '-j1' or '-j2' is followed by another parameter,
and that parameter turns out *not* to be an valid field number,
It is treated like '-j 1' (or '-1 1 -2 1'), and join just "does the 
right thing":

   $ join -j2 -i a.txt b.txt
   2 1 3 aaa 1 3 XXX
   3 2 4 bbb 2 4 YYY

This is implemented here:
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/join.c#n1171
And the result is that most of the time, join "just works" (IMHO, but
other opinions welcomed).


If the '-j1' or '-j2' is followed by a number, this is were the 
unexpected behaviour occurs, as it sets the key field for that file 
alone. E.g. '-j1 2' is equivalent to '-1 2' (and the key for the second
file is not set, thus defaults to 1):

   $ join -j1 2 a.txt b.txt
   2 1 3 aaa 3 4 YYY

   $ join -1 2 a.txt b.txt
   2 1 3 aaa 3 4 YYY


Is the above a satisfactory explanation?
If so, it'll be more-or-less what I'll add to the manual.

I see that this has been implemented back in 2005, here:
https://git.savannah.gnu.org/cgit/coreutils.git/commit/src/join.c?id=f9118c1c2e35b
with the comment:
 "Parse obsolete options -j1 and -j2
  so that it is a pure extension to POSIX 1003.1-2001."

I can perhaps guestimate that since this usage is never
mentioned anywhere, it is considered undocumented and discouraged usage
(and indeed, I don't think I've ever encountered it, or previously
saw a bug-report or question about it - so it's rather rare).

We could add a warning to the man page - what do others think?

regards,
- assaf









This bug report was last modified 6 years and 202 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.