GNU bug report logs - #20511
split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

Previous Next

Package: coreutils;

Reported by: Ben Rusholme <rusholme <at> caltech.edu>

Date: Tue, 5 May 2015 20:45:03 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

Full log


Message #25 received at 20511-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ben Rusholme <rusholme <at> caltech.edu>, 20511-done <at> debbugs.gnu.org
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Wed, 13 May 2015 02:20:27 +0100
[Message part 1 (text/plain, inline)]
On 06/05/15 11:53, Pádraig Brady wrote:
> On 06/05/15 05:29, Ben Rusholme wrote:
>> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
>>
>>> Now you could bump the suffix length based on the start number,
>>> though I don't think we should as that would impact on future
>>> processing (ordering) of the resultant files.  I.E. specifying
>>> a FROM value to --numeric-suffixes should only impact the
>>> start value, rather than the width.
>>
>> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?
> 
> There are two use cases supported by specifying FROM.
> 1. Setting the start for a single run (FROM is usually 1 in this case)
> 2. Setting the offset for multiple independent split runs.
> In the second case we can't infer the size of the total set
> in any particular run, and thus require that --suffix-length is specified appropriately.
> I.E. for multiple independent runs, the suffix length needs to be
> fixed width across the entire set for the total ordering to be correct.
> 
> 
> Things we could change are...
> 
> 1. Special case FROM=1 to assume a single run and thus
> enable auto suffix expansion or appropriately sized suffix with CHUNK.
> This would be a backwards incompat change and also not
> guaranteed a single run, so I'm reluctant to do that.
> 
> 2. Give an early error with specified FROM and CHUNK
> that would overflow the suffix size for CHUNK.
> This would save some processing, though doesn't add
> any protections against latent issues. I.E. you still get
> the error which is dependent on the parameters rather than the input data size.
> Therefore it's probably not worth the complication.
> 
> 3. Leave suffix length at 2 when both FROM and CHUNK are specified.
> In retrospect, this would probably have been the best option
> to avoid ambiguities like this. However now we'd be breaking
> compat with scripts with FROM=1 and CHUNK=200 etc.
> While CHUNK values > 100 would be unusual
> 
> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
> 
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
> 
> So I'll do 4 and 5 I think.

Attached.

cheers,
Pádraig

[split-from-width.patch (text/x-patch, attachment)]

This bug report was last modified 10 years and 8 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.