On 06/05/15 11:53, Pádraig Brady wrote: > On 06/05/15 05:29, Ben Rusholme wrote: >> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs). >> >>> Now you could bump the suffix length based on the start number, >>> though I don't think we should as that would impact on future >>> processing (ordering) of the resultant files. I.E. specifying >>> a FROM value to --numeric-suffixes should only impact the >>> start value, rather than the width. >> >> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order? > > There are two use cases supported by specifying FROM. > 1. Setting the start for a single run (FROM is usually 1 in this case) > 2. Setting the offset for multiple independent split runs. > In the second case we can't infer the size of the total set > in any particular run, and thus require that --suffix-length is specified appropriately. > I.E. for multiple independent runs, the suffix length needs to be > fixed width across the entire set for the total ordering to be correct. > > > Things we could change are... > > 1. Special case FROM=1 to assume a single run and thus > enable auto suffix expansion or appropriately sized suffix with CHUNK. > This would be a backwards incompat change and also not > guaranteed a single run, so I'm reluctant to do that. > > 2. Give an early error with specified FROM and CHUNK > that would overflow the suffix size for CHUNK. > This would save some processing, though doesn't add > any protections against latent issues. I.E. you still get > the error which is dependent on the parameters rather than the input data size. > Therefore it's probably not worth the complication. > > 3. Leave suffix length at 2 when both FROM and CHUNK are specified. > In retrospect, this would probably have been the best option > to avoid ambiguities like this. However now we'd be breaking > compat with scripts with FROM=1 and CHUNK=200 etc. > While CHUNK values > 100 would be unusual > > 4. Auto set the suffix len based on FROM + CHUNK. > That would support use case 1 (single run), > but _silently_ break subsequent processing order > of outputs from multiple split runs > (as FROM is increased in multiples of CHUNK size). > We could mitigate the _silent_ breakage though > by limiting this change to when FROM < CHUNK. > > 5. Document in man page and with more detail in info docs > that -a is recommended when specifying FROM > > So I'll do 4 and 5 I think. Attached. cheers, Pádraig