GNU bug report logs - #16468
join

Previous Next

Package: coreutils;

Reported by: barry kesner <modockesner <at> gmail.com>

Date: Thu, 16 Jan 2014 17:07:01 UTC

Severity: normal

Tags: notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eric Blake <eblake <at> redhat.com>
To: barry kesner <modockesner <at> gmail.com>, 16468 <at> debbugs.gnu.org
Subject: bug#16468: join
Date: Thu, 16 Jan 2014 11:10:11 -0700
[Message part 1 (text/plain, inline)]
[re-adding the list, with permission]

On 01/16/2014 10:46 AM, barry kesner wrote:
> Eric,
>   Thanks for response.
>  I now realize it wants sorted alpha input not numerical.  999 1000 1001 is
> how it is sorted.

I think there have been requests in the past to enhance 'join' so that
it can have more fine-tuned control over how its fields are selected.
Maybe something like sharing code so that 'join -1 k1,1n' would behave
like it were using 'sort -k1,1n' sorting on file 1.  But right now, that
functionality doesn't exist.

> 
>   How do you tell join this without resorting.  The files are huge!

Unfortunately, there isn't any really good way, short of re-processing
the files to make the data appear sorted in the order join expects.
That said, it certainly appears that for your given data, you can write
a sed filter that can reprocess on a line-by-line basis, and feed that
into join, without the penalty of having to re-sort the entire file and
without having to have the processed file stored in your file system all
at once.  It also seems possible to write a post filter to get back to
the style of the line in the original file.  Here, extensions such as bash's
  join <(infilter file1) <(infilter file2) | outfilter
make it easier to type (where the trick is to now write the correct sed
scripts to serve as infilter and outfilter) than the alternative of
having to use named fifos for limiting yourself to just POSIX semantics.

> 
> I can't find LC_COLLATE?

It's an environment variable, like LC_ALL, that affects your locale.
Running 'locale' will show you your current locale settings, including
LC_COLLATE.  Setting LC_ALL in the environment is shorthand that forces
all other categories to behave the same, so it's easier to test whether
'LC_ALL=C command' has an effect than it is to figure out which locale
category(ies) matter.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 6 years and 224 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.