[re-adding the list, with permission]

On 01/16/2014 10:46 AM, barry kesner wrote:
> Eric,
>   Thanks for response.
>  I now realize it wants sorted alpha input not numerical.  999 1000 1001 is
> how it is sorted.

I think there have been requests in the past to enhance 'join' so that
it can have more fine-tuned control over how its fields are selected.
Maybe something like sharing code so that 'join -1 k1,1n' would behave
like it were using 'sort -k1,1n' sorting on file 1.  But right now, that
functionality doesn't exist.

> 
>   How do you tell join this without resorting.  The files are huge!

Unfortunately, there isn't any really good way, short of re-processing
the files to make the data appear sorted in the order join expects.
That said, it certainly appears that for your given data, you can write
a sed filter that can reprocess on a line-by-line basis, and feed that
into join, without the penalty of having to re-sort the entire file and
without having to have the processed file stored in your file system all
at once.  It also seems possible to write a post filter to get back to
the style of the line in the original file.  Here, extensions such as bash's
  join <(infilter file1) <(infilter file2) | outfilter
make it easier to type (where the trick is to now write the correct sed
scripts to serve as infilter and outfilter) than the alternative of
having to use named fifos for limiting yourself to just POSIX semantics.

> 
> I can't find LC_COLLATE?

It's an environment variable, like LC_ALL, that affects your locale.
Running 'locale' will show you your current locale settings, including
LC_COLLATE.  Setting LC_ALL in the environment is shorthand that forces
all other categories to behave the same, so it's easier to test whether
'LC_ALL=C command' has an effect than it is to figure out which locale
category(ies) matter.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org