GNU bug report logs -
#20511
split : does not account for --numeric-suffixes=FROM in calculation of suffix length?
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20511 in the body.
You can then email your comments to 20511 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#20511
; Package
coreutils
.
(Tue, 05 May 2015 20:45:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ben Rusholme <rusholme <at> caltech.edu>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Tue, 05 May 2015 20:45:04 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
“split” (in the current GNU coreutils 8.23 release) does not account for the optional start index (“split --numeric-suffixes=FROM”) when calculating suffix length.
I couldn’t find any prior reference to this problem in either the bug tracker or mailing list archive.
Thanks, Ben
$ seq 100 >& input.txt
$ split --numeric-suffixes --number=l/100 input.txt
$ ls
input.txt x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 x97
x00 x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 x91 x98
x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 x92 x99
x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 x93
x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 x94
x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 x95
x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 x96
$ rm x*
$ split --numeric-suffixes=1 --number=l/100 input.txt
split: output file suffixes exhausted
$ ls
input.txt x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 x91 x98
x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 x92 x99
x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 x93
x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 x94
x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 x95
x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 x96
x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 x97
$ # Should run from x001 to x100!
$ rm x*
$ split --numeric-suffixes=1 --number=l/101 input.txt
$ ls
input.txt x008 x016 x024 x032 x040 x048 x056 x064 x072 x080 x088 x096
x001 x009 x017 x025 x033 x041 x049 x057 x065 x073 x081 x089 x097
x002 x010 x018 x026 x034 x042 x050 x058 x066 x074 x082 x090 x098
x003 x011 x019 x027 x035 x043 x051 x059 x067 x075 x083 x091 x099
x004 x012 x020 x028 x036 x044 x052 x060 x068 x076 x084 x092 x100
x005 x013 x021 x029 x037 x045 x053 x061 x069 x077 x085 x093 x101
x006 x014 x022 x030 x038 x046 x054 x062 x070 x078 x086 x094
x007 x015 x023 x031 x039 x047 x055 x063 x071 x079 x087 x095
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20511
; Package
coreutils
.
(Tue, 05 May 2015 21:59:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 20511 <at> debbugs.gnu.org (full text, mbox):
On 05/05/15 21:42, Ben Rusholme wrote:
> Hi,
>
> “split” (in the current GNU coreutils 8.23 release) does not account for the optional start index (“split --numeric-suffixes=FROM”) when calculating suffix length.
>
> I couldn’t find any prior reference to this problem in either the bug tracker or mailing list archive.
>
> Thanks, Ben
>
>
>
> $ seq 100 >& input.txt
> $ split --numeric-suffixes --number=l/100 input.txt
> $ ls
> input.txt x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 x97
> x00 x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 x91 x98
> x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 x92 x99
> x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 x93
> x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 x94
> x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 x95
> x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 x96
>
>
> $ rm x*
> $ split --numeric-suffixes=1 --number=l/100 input.txt
> split: output file suffixes exhausted
> $ ls
> input.txt x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 x91 x98
> x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 x92 x99
> x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 x93
> x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 x94
> x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 x95
> x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 x96
> x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 x97
> $ # Should run from x001 to x100!
>
>
> $ rm x*
> $ split --numeric-suffixes=1 --number=l/101 input.txt
> $ ls
> input.txt x008 x016 x024 x032 x040 x048 x056 x064 x072 x080 x088 x096
> x001 x009 x017 x025 x033 x041 x049 x057 x065 x073 x081 x089 x097
> x002 x010 x018 x026 x034 x042 x050 x058 x066 x074 x082 x090 x098
> x003 x011 x019 x027 x035 x043 x051 x059 x067 x075 x083 x091 x099
> x004 x012 x020 x028 x036 x044 x052 x060 x068 x076 x084 x092 x100
> x005 x013 x021 x029 x037 x045 x053 x061 x069 x077 x085 x093 x101
> x006 x014 x022 x030 x038 x046 x054 x062 x070 x078 x086 x094
> x007 x015 x023 x031 x039 x047 x055 x063 x071 x079 x087 x095
The info docs say about the --numeric-suffixes option:
Note specifying a FROM value also disables the default auto suffix
length expansion described above, and so you may also want to
specify ‘-a’ to allow suffixes beyond ‘99’.
Now also specifying the fixed number of files with --number
auto sets the suffix length based on the number. I.E. when
you specified -nl/101 it bumped the suffix length to 3
Now you could bump the suffix length based on the start number,
though I don't think we should as that would impact on future
processing (ordering) of the resultant files. I.E. specifying
a FROM value to --numeric-suffixes should only impact the
start value, rather than the width.
In other words if you were to split 2 files into 200 parts like:
split --number=l/100 input1.txt
split --numeric-suffixes=100 --number=l/100 input2.txt
Then you really need to be specifying -a3 to set
the suffix length appropriately.
We might be able to give an earlier error in this case,
and we should probably clarify the info docs a bit more.
I'll think about it.
cheers,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20511
; Package
coreutils
.
(Wed, 06 May 2015 04:30:06 GMT)
Full text and
rfc822 format available.
Message #11 received at 20511 <at> debbugs.gnu.org (full text, mbox):
Hi,
> The info docs say about the --numeric-suffixes option:
>
> Note specifying a FROM value also disables the default auto suffix
> length expansion described above, and so you may also want to
> specify ‘-a’ to allow suffixes beyond ‘99’.
This does not seem to be the case, auto suffix works fine beyond 99 (in the current 8.23 release)?
$ seq 1000000 >& input.txt
$ split --numeric-suffixes=1234 --number=l/5678 input.txt
$ ls | tail
x6902
x6903
x6904
x6905
x6906
x6907
x6908
x6909
x6910
x6911
It just fails wherever FROM pushes CHUNKS over a multiple of 10:
$ rm x*
$ split --numeric-suffixes --number=l/10000 input.txt
$ ls | tail -n 3
x9997
x9998
x9999
$
$ rm x*
$ split --numeric-suffixes=1 --number=l/10000 input.txt
split: output file suffixes exhausted
$ ls | tail -n 3
x9997
x9998
x9999
$ ls | head -n 3
input.txt
x0001
x0002
$
$ rm x*
$ split --numeric-suffixes=2 --number=l/9999 input.txt
split: output file suffixes exhausted
$ ls | tail -n 3
x9997
x9998
x9999
$ ls | head -n 3
input.txt
x0002
x0003
As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
> Now you could bump the suffix length based on the start number,
> though I don't think we should as that would impact on future
> processing (ordering) of the resultant files. I.E. specifying
> a FROM value to --numeric-suffixes should only impact the
> start value, rather than the width.
Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order? I assume the crucial test is the inverse operation:
$ cat x* >& output.txt
$ diff input.txt output.txt
$
Thanks, Ben
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20511
; Package
coreutils
.
(Wed, 06 May 2015 10:54:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 20511 <at> debbugs.gnu.org (full text, mbox):
On 06/05/15 05:29, Ben Rusholme wrote:
> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
>
>> Now you could bump the suffix length based on the start number,
>> though I don't think we should as that would impact on future
>> processing (ordering) of the resultant files. I.E. specifying
>> a FROM value to --numeric-suffixes should only impact the
>> start value, rather than the width.
>
> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?
There are two use cases supported by specifying FROM.
1. Setting the start for a single run (FROM is usually 1 in this case)
2. Setting the offset for multiple independent split runs.
In the second case we can't infer the size of the total set
in any particular run, and thus require that --suffix-length is specified appropriately.
I.E. for multiple independent runs, the suffix length needs to be
fixed width across the entire set for the total ordering to be correct.
Things we could change are...
1. Special case FROM=1 to assume a single run and thus
enable auto suffix expansion or appropriately sized suffix with CHUNK.
This would be a backwards incompat change and also not
guaranteed a single run, so I'm reluctant to do that.
2. Give an early error with specified FROM and CHUNK
that would overflow the suffix size for CHUNK.
This would save some processing, though doesn't add
any protections against latent issues. I.E. you still get
the error which is dependent on the parameters rather than the input data size.
Therefore it's probably not worth the complication.
3. Leave suffix length at 2 when both FROM and CHUNK are specified.
In retrospect, this would probably have been the best option
to avoid ambiguities like this. However now we'd be breaking
compat with scripts with FROM=1 and CHUNK=200 etc.
While CHUNK values > 100 would be unusual
4. Auto set the suffix len based on FROM + CHUNK.
That would support use case 1 (single run),
but _silently_ break subsequent processing order
of outputs from multiple split runs
(as FROM is increased in multiples of CHUNK size).
We could mitigate the _silent_ breakage though
by limiting this change to when FROM < CHUNK.
5. Document in man page and with more detail in info docs
that -a is recommended when specifying FROM
So I'll do 4 and 5 I think.
cheers,
Pádraig.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20511
; Package
coreutils
.
(Wed, 06 May 2015 17:38:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 20511 <at> debbugs.gnu.org (full text, mbox):
Hi,
> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
>
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
>
> So I'll do 4 and 5 I think.
Thanks, that would solve the problem I was having.
Please feel free to end this conversation here, but if you can spare the time I’d be very interested in an example of a multiple split run for my own education/understanding/curiosity? I assume you mean processing subsets of the input, but can’t see how to do that (after experimenting on the command line and searching the documentation) except —number=l/k/n which does know the size of the total set?
Thanks again, Ben
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20511
; Package
coreutils
.
(Wed, 06 May 2015 17:49:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 20511 <at> debbugs.gnu.org (full text, mbox):
On 06/05/15 18:37, Ben Rusholme wrote:
> Hi,
>
>> 4. Auto set the suffix len based on FROM + CHUNK.
>> That would support use case 1 (single run),
>> but _silently_ break subsequent processing order
>> of outputs from multiple split runs
>> (as FROM is increased in multiples of CHUNK size).
>> We could mitigate the _silent_ breakage though
>> by limiting this change to when FROM < CHUNK.
>>
>> 5. Document in man page and with more detail in info docs
>> that -a is recommended when specifying FROM
>>
>> So I'll do 4 and 5 I think.
>
> Thanks, that would solve the problem I was having.
>
> Please feel free to end this conversation here, but if you can spare the time I’d be very interested in an example of a multiple split run for my own education/understanding/curiosity? I assume you mean processing subsets of the input, but can’t see how to do that (after experimenting on the command line and searching the documentation) except —number=l/k/n which does know the size of the total set?
Well you could process subsets but even more simply
consider splitting a set of input files in 2,
to a set of output files.
i=0
for f in *.dat; do
split -a4 --numeric=$i $f -n2; i=$(($i+2))
done
(to be truely generic you would set the -a parameter
based on the number of files and -n).
cheers,
Pádraig.
Reply sent
to
Pádraig Brady <P <at> draigBrady.com>
:
You have taken responsibility.
(Wed, 13 May 2015 01:22:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Ben Rusholme <rusholme <at> caltech.edu>
:
bug acknowledged by developer.
(Wed, 13 May 2015 01:22:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 20511-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 06/05/15 11:53, Pádraig Brady wrote:
> On 06/05/15 05:29, Ben Rusholme wrote:
>> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs).
>>
>>> Now you could bump the suffix length based on the start number,
>>> though I don't think we should as that would impact on future
>>> processing (ordering) of the resultant files. I.E. specifying
>>> a FROM value to --numeric-suffixes should only impact the
>>> start value, rather than the width.
>>
>> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order?
>
> There are two use cases supported by specifying FROM.
> 1. Setting the start for a single run (FROM is usually 1 in this case)
> 2. Setting the offset for multiple independent split runs.
> In the second case we can't infer the size of the total set
> in any particular run, and thus require that --suffix-length is specified appropriately.
> I.E. for multiple independent runs, the suffix length needs to be
> fixed width across the entire set for the total ordering to be correct.
>
>
> Things we could change are...
>
> 1. Special case FROM=1 to assume a single run and thus
> enable auto suffix expansion or appropriately sized suffix with CHUNK.
> This would be a backwards incompat change and also not
> guaranteed a single run, so I'm reluctant to do that.
>
> 2. Give an early error with specified FROM and CHUNK
> that would overflow the suffix size for CHUNK.
> This would save some processing, though doesn't add
> any protections against latent issues. I.E. you still get
> the error which is dependent on the parameters rather than the input data size.
> Therefore it's probably not worth the complication.
>
> 3. Leave suffix length at 2 when both FROM and CHUNK are specified.
> In retrospect, this would probably have been the best option
> to avoid ambiguities like this. However now we'd be breaking
> compat with scripts with FROM=1 and CHUNK=200 etc.
> While CHUNK values > 100 would be unusual
>
> 4. Auto set the suffix len based on FROM + CHUNK.
> That would support use case 1 (single run),
> but _silently_ break subsequent processing order
> of outputs from multiple split runs
> (as FROM is increased in multiples of CHUNK size).
> We could mitigate the _silent_ breakage though
> by limiting this change to when FROM < CHUNK.
>
> 5. Document in man page and with more detail in info docs
> that -a is recommended when specifying FROM
>
> So I'll do 4 and 5 I think.
Attached.
cheers,
Pádraig
[split-from-width.patch (text/x-patch, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 10 Jun 2015 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 10 years and 6 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.