GNU bug report logs - #20511
split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

Previous Next

Package: coreutils;

Reported by: Ben Rusholme <rusholme <at> caltech.edu>

Date: Tue, 5 May 2015 20:45:03 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #14 received at 20511 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ben Rusholme <rusholme <at> caltech.edu>, 20511 <at> debbugs.gnu.org
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Wed, 06 May 2015 11:53:23 +0100
On 06/05/15 05:29, Ben Rusholme wrote:
> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
> 
>> Now you could bump the suffix length based on the start number,
>> though I don't think we should as that would impact on future
>> processing (ordering) of the resultant files.  I.E. specifying
>> a FROM value to --numeric-suffixes should only impact the
>> start value, rather than the width.
> 
> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?

There are two use cases supported by specifying FROM.
1. Setting the start for a single run (FROM is usually 1 in this case)
2. Setting the offset for multiple independent split runs.
In the second case we can't infer the size of the total set
in any particular run, and thus require that --suffix-length is specified appropriately.
I.E. for multiple independent runs, the suffix length needs to be
fixed width across the entire set for the total ordering to be correct.


Things we could change are...

1. Special case FROM=1 to assume a single run and thus
enable auto suffix expansion or appropriately sized suffix with CHUNK.
This would be a backwards incompat change and also not
guaranteed a single run, so I'm reluctant to do that.

2. Give an early error with specified FROM and CHUNK
that would overflow the suffix size for CHUNK.
This would save some processing, though doesn't add
any protections against latent issues. I.E. you still get
the error which is dependent on the parameters rather than the input data size.
Therefore it's probably not worth the complication.

3. Leave suffix length at 2 when both FROM and CHUNK are specified.
In retrospect, this would probably have been the best option
to avoid ambiguities like this. However now we'd be breaking
compat with scripts with FROM=1 and CHUNK=200 etc.
While CHUNK values > 100 would be unusual

4. Auto set the suffix len based on FROM + CHUNK.
That would support use case 1 (single run),
but _silently_ break subsequent processing order
of outputs from multiple split runs
(as FROM is increased in multiples of CHUNK size).
We could mitigate the _silent_ breakage though
by limiting this change to when FROM < CHUNK.

5. Document in man page and with more detail in info docs
that -a is recommended when specifying FROM

So I'll do 4 and 5 I think.

cheers,
Pádraig.




This bug report was last modified 10 years and 8 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.