GNU bug report logs - #20511
split : does not account for --numeric-suffixes=FROM in calculation of suffix length?

Previous Next

Package: coreutils;

Reported by: Ben Rusholme <rusholme <at> caltech.edu>

Date: Tue, 5 May 2015 20:45:03 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20511 in the body.
You can then email your comments to 20511 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#20511; Package coreutils. (Tue, 05 May 2015 20:45:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ben Rusholme <rusholme <at> caltech.edu>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Tue, 05 May 2015 20:45:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ben Rusholme <rusholme <at> caltech.edu>
To: bug-coreutils <at> gnu.org
Subject: split : does not account for --numeric-suffixes=FROM in calculation
 of suffix length?
Date: Tue, 5 May 2015 13:42:12 -0700
[Message part 1 (text/plain, inline)]
Hi,

“split” (in the current GNU coreutils 8.23 release) does not account for the optional start index (“split --numeric-suffixes=FROM”) when calculating suffix length.

I couldn’t find any prior reference to this problem in either the bug tracker or mailing list archive.

Thanks, Ben



$ seq 100 >& input.txt
$ split --numeric-suffixes --number=l/100 input.txt
$ ls
input.txt  x06  x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
x00        x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  x98
x01        x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  x99
x02        x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
x03        x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
x04        x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
x05        x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96


$ rm x*
$ split --numeric-suffixes=1 --number=l/100 input.txt
split: output file suffixes exhausted
$ ls
input.txt  x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  x98
x01        x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  x99
x02        x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
x03        x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
x04        x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
x05        x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96
x06        x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
$ # Should run from x001 to x100!


$ rm x*
$ split --numeric-suffixes=1 --number=l/101 input.txt
$ ls
input.txt  x008  x016  x024  x032  x040  x048  x056  x064  x072  x080  x088  x096
x001       x009  x017  x025  x033  x041  x049  x057  x065  x073  x081  x089  x097
x002       x010  x018  x026  x034  x042  x050  x058  x066  x074  x082  x090  x098
x003       x011  x019  x027  x035  x043  x051  x059  x067  x075  x083  x091  x099
x004       x012  x020  x028  x036  x044  x052  x060  x068  x076  x084  x092  x100
x005       x013  x021  x029  x037  x045  x053  x061  x069  x077  x085  x093  x101
x006       x014  x022  x030  x038  x046  x054  x062  x070  x078  x086  x094
x007       x015  x023  x031  x039  x047  x055  x063  x071  x079  x087  x095

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#20511; Package coreutils. (Tue, 05 May 2015 21:59:02 GMT) Full text and rfc822 format available.

Message #8 received at 20511 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ben Rusholme <rusholme <at> caltech.edu>, 20511 <at> debbugs.gnu.org
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Tue, 05 May 2015 22:58:31 +0100
On 05/05/15 21:42, Ben Rusholme wrote:
> Hi,
> 
> “split” (in the current GNU coreutils 8.23 release) does not account for the optional start index (“split --numeric-suffixes=FROM”) when calculating suffix length.
> 
> I couldn’t find any prior reference to this problem in either the bug tracker or mailing list archive.
> 
> Thanks, Ben
> 
> 
> 
> $ seq 100 >& input.txt
> $ split --numeric-suffixes --number=l/100 input.txt
> $ ls
> input.txt  x06  x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
> x00        x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  x98
> x01        x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  x99
> x02        x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
> x03        x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
> x04        x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
> x05        x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96
> 
> 
> $ rm x*
> $ split --numeric-suffixes=1 --number=l/100 input.txt
> split: output file suffixes exhausted
> $ ls
> input.txt  x07  x14  x21  x28  x35  x42  x49  x56  x63  x70  x77  x84  x91  x98
> x01        x08  x15  x22  x29  x36  x43  x50  x57  x64  x71  x78  x85  x92  x99
> x02        x09  x16  x23  x30  x37  x44  x51  x58  x65  x72  x79  x86  x93
> x03        x10  x17  x24  x31  x38  x45  x52  x59  x66  x73  x80  x87  x94
> x04        x11  x18  x25  x32  x39  x46  x53  x60  x67  x74  x81  x88  x95
> x05        x12  x19  x26  x33  x40  x47  x54  x61  x68  x75  x82  x89  x96
> x06        x13  x20  x27  x34  x41  x48  x55  x62  x69  x76  x83  x90  x97
> $ # Should run from x001 to x100!
> 
> 
> $ rm x*
> $ split --numeric-suffixes=1 --number=l/101 input.txt
> $ ls
> input.txt  x008  x016  x024  x032  x040  x048  x056  x064  x072  x080  x088  x096
> x001       x009  x017  x025  x033  x041  x049  x057  x065  x073  x081  x089  x097
> x002       x010  x018  x026  x034  x042  x050  x058  x066  x074  x082  x090  x098
> x003       x011  x019  x027  x035  x043  x051  x059  x067  x075  x083  x091  x099
> x004       x012  x020  x028  x036  x044  x052  x060  x068  x076  x084  x092  x100
> x005       x013  x021  x029  x037  x045  x053  x061  x069  x077  x085  x093  x101
> x006       x014  x022  x030  x038  x046  x054  x062  x070  x078  x086  x094
> x007       x015  x023  x031  x039  x047  x055  x063  x071  x079  x087  x095

The info docs say about the --numeric-suffixes option:

  Note specifying a FROM value also disables the default auto suffix
  length expansion described above, and so you may also want to
  specify ‘-a’ to allow suffixes beyond ‘99’.

Now also specifying the fixed number of files with --number
auto sets the suffix length based on the number. I.E. when
you specified -nl/101 it bumped the suffix length to 3

Now you could bump the suffix length based on the start number,
though I don't think we should as that would impact on future
processing (ordering) of the resultant files.  I.E. specifying
a FROM value to --numeric-suffixes should only impact the
start value, rather than the width.

In other words if you were to split 2 files into 200 parts like:
  split                        --number=l/100 input1.txt
  split --numeric-suffixes=100 --number=l/100 input2.txt
Then you really need to be specifying -a3 to set
the suffix length appropriately.

We might be able to give an earlier error in this case,
and we should probably clarify the info docs a bit more.
I'll think about it.

cheers,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#20511; Package coreutils. (Wed, 06 May 2015 04:30:06 GMT) Full text and rfc822 format available.

Message #11 received at 20511 <at> debbugs.gnu.org (full text, mbox):

From: Ben Rusholme <rusholme <at> caltech.edu>
To: 20511 <at> debbugs.gnu.org
Cc: Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Tue, 5 May 2015 21:29:19 -0700
Hi,

> The info docs say about the --numeric-suffixes option:
> 
>  Note specifying a FROM value also disables the default auto suffix
>  length expansion described above, and so you may also want to
>  specify ‘-a’ to allow suffixes beyond ‘99’.

This does not seem to be the case, auto suffix works fine beyond 99  (in the current 8.23 release)?

$ seq 1000000 >& input.txt
$ split --numeric-suffixes=1234 --number=l/5678 input.txt
$ ls | tail
x6902
x6903
x6904
x6905
x6906
x6907
x6908
x6909
x6910
x6911

It just fails wherever FROM pushes CHUNKS over a multiple of 10:

$ rm x*
$ split --numeric-suffixes --number=l/10000 input.txt
$ ls | tail -n 3
x9997
x9998
x9999
$
$ rm x*
$ split --numeric-suffixes=1 --number=l/10000 input.txt
split: output file suffixes exhausted
$ ls | tail -n 3
x9997
x9998
x9999
$ ls | head -n 3
input.txt
x0001
x0002
$
$ rm x*
$ split --numeric-suffixes=2 --number=l/9999 input.txt
split: output file suffixes exhausted
$ ls | tail -n 3
x9997
x9998
x9999
$ ls | head -n 3
input.txt
x0002
x0003

As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).


> Now you could bump the suffix length based on the start number,
> though I don't think we should as that would impact on future
> processing (ordering) of the resultant files.  I.E. specifying
> a FROM value to --numeric-suffixes should only impact the
> start value, rather than the width.

Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?  I assume the crucial test is the inverse operation:

$ cat x* >& output.txt
$ diff input.txt output.txt
$

Thanks, Ben





Information forwarded to bug-coreutils <at> gnu.org:
bug#20511; Package coreutils. (Wed, 06 May 2015 10:54:02 GMT) Full text and rfc822 format available.

Message #14 received at 20511 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ben Rusholme <rusholme <at> caltech.edu>, 20511 <at> debbugs.gnu.org
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Wed, 06 May 2015 11:53:23 +0100
On 06/05/15 05:29, Ben Rusholme wrote:
> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
> 
>> Now you could bump the suffix length based on the start number,
>> though I don't think we should as that would impact on future
>> processing (ordering) of the resultant files.  I.E. specifying
>> a FROM value to --numeric-suffixes should only impact the
>> start value, rather than the width.
> 
> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?

There are two use cases supported by specifying FROM.
1. Setting the start for a single run (FROM is usually 1 in this case)
2. Setting the offset for multiple independent split runs.
In the second case we can't infer the size of the total set
in any particular run, and thus require that --suffix-length is specified appropriately.
I.E. for multiple independent runs, the suffix length needs to be
fixed width across the entire set for the total ordering to be correct.


Things we could change are...

1. Special case FROM=1 to assume a single run and thus
enable auto suffix expansion or appropriately sized suffix with CHUNK.
This would be a backwards incompat change and also not
guaranteed a single run, so I'm reluctant to do that.

2. Give an early error with specified FROM and CHUNK
that would overflow the suffix size for CHUNK.
This would save some processing, though doesn't add
any protections against latent issues. I.E. you still get
the error which is dependent on the parameters rather than the input data size.
Therefore it's probably not worth the complication.

3. Leave suffix length at 2 when both FROM and CHUNK are specified.
In retrospect, this would probably have been the best option
to avoid ambiguities like this. However now we'd be breaking
compat with scripts with FROM=1 and CHUNK=200 etc.
While CHUNK values > 100 would be unusual

4. Auto set the suffix len based on FROM + CHUNK.
That would support use case 1 (single run),
but _silently_ break subsequent processing order
of outputs from multiple split runs
(as FROM is increased in multiples of CHUNK size).
We could mitigate the _silent_ breakage though
by limiting this change to when FROM < CHUNK.

5. Document in man page and with more detail in info docs
that -a is recommended when specifying FROM

So I'll do 4 and 5 I think.

cheers,
Pádraig.




Information forwarded to bug-coreutils <at> gnu.org:
bug#20511; Package coreutils. (Wed, 06 May 2015 17:38:02 GMT) Full text and rfc822 format available.

Message #17 received at 20511 <at> debbugs.gnu.org (full text, mbox):

From: Ben Rusholme <rusholme <at> caltech.edu>
To: 20511 <at> debbugs.gnu.org
Cc: Pádraig Brady <P <at> draigBrady.com>
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Wed, 6 May 2015 10:37:41 -0700
Hi,

> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
> 
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
> 
> So I'll do 4 and 5 I think.

Thanks, that would solve the problem I was having.

Please feel free to end this conversation here, but if you can spare the time I’d be very interested in an example of a multiple split run for my own education/understanding/curiosity? I assume you mean processing subsets of the input, but can’t see how to do that (after experimenting on the command line and searching the documentation) except —number=l/k/n which does know the size of the total set?

Thanks again, Ben





Information forwarded to bug-coreutils <at> gnu.org:
bug#20511; Package coreutils. (Wed, 06 May 2015 17:49:02 GMT) Full text and rfc822 format available.

Message #20 received at 20511 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ben Rusholme <rusholme <at> caltech.edu>, 20511 <at> debbugs.gnu.org
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Wed, 06 May 2015 18:48:08 +0100
On 06/05/15 18:37, Ben Rusholme wrote:
> Hi,
> 
>> 4. Auto set the suffix len based on FROM + CHUNK.
>> That would support use case 1 (single run),
>> but _silently_ break subsequent processing order
>> of outputs from multiple split runs
>> (as FROM is increased in multiples of CHUNK size).
>> We could mitigate the _silent_ breakage though
>> by limiting this change to when FROM < CHUNK.
>>
>> 5. Document in man page and with more detail in info docs
>> that -a is recommended when specifying FROM
>>
>> So I'll do 4 and 5 I think.
> 
> Thanks, that would solve the problem I was having.
> 
> Please feel free to end this conversation here, but if you can spare the time I’d be very interested in an example of a multiple split run for my own education/understanding/curiosity? I assume you mean processing subsets of the input, but can’t see how to do that (after experimenting on the command line and searching the documentation) except —number=l/k/n which does know the size of the total set?

Well you could process subsets but even more simply
consider splitting a set of input files in 2,
to a set of output files.

  i=0
  for f in *.dat; do
    split -a4 --numeric=$i $f -n2; i=$(($i+2))
  done

(to be truely generic you would set the -a parameter
 based on the number of files and -n).

cheers,
Pádraig.




Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Wed, 13 May 2015 01:22:02 GMT) Full text and rfc822 format available.

Notification sent to Ben Rusholme <rusholme <at> caltech.edu>:
bug acknowledged by developer. (Wed, 13 May 2015 01:22:02 GMT) Full text and rfc822 format available.

Message #25 received at 20511-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Ben Rusholme <rusholme <at> caltech.edu>, 20511-done <at> debbugs.gnu.org
Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM
 in calculation of suffix length?
Date: Wed, 13 May 2015 02:20:27 +0100
[Message part 1 (text/plain, inline)]
On 06/05/15 11:53, Pádraig Brady wrote:
> On 06/05/15 05:29, Ben Rusholme wrote:
>> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
>>
>>> Now you could bump the suffix length based on the start number,
>>> though I don't think we should as that would impact on future
>>> processing (ordering) of the resultant files.  I.E. specifying
>>> a FROM value to --numeric-suffixes should only impact the
>>> start value, rather than the width.
>>
>> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?
> 
> There are two use cases supported by specifying FROM.
> 1. Setting the start for a single run (FROM is usually 1 in this case)
> 2. Setting the offset for multiple independent split runs.
> In the second case we can't infer the size of the total set
> in any particular run, and thus require that --suffix-length is specified appropriately.
> I.E. for multiple independent runs, the suffix length needs to be
> fixed width across the entire set for the total ordering to be correct.
> 
> 
> Things we could change are...
> 
> 1. Special case FROM=1 to assume a single run and thus
> enable auto suffix expansion or appropriately sized suffix with CHUNK.
> This would be a backwards incompat change and also not
> guaranteed a single run, so I'm reluctant to do that.
> 
> 2. Give an early error with specified FROM and CHUNK
> that would overflow the suffix size for CHUNK.
> This would save some processing, though doesn't add
> any protections against latent issues. I.E. you still get
> the error which is dependent on the parameters rather than the input data size.
> Therefore it's probably not worth the complication.
> 
> 3. Leave suffix length at 2 when both FROM and CHUNK are specified.
> In retrospect, this would probably have been the best option
> to avoid ambiguities like this. However now we'd be breaking
> compat with scripts with FROM=1 and CHUNK=200 etc.
> While CHUNK values > 100 would be unusual
> 
> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
> 
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
> 
> So I'll do 4 and 5 I think.

Attached.

cheers,
Pádraig

[split-from-width.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 10 Jun 2015 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 6 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.