From debbugs-submit-bounces@debbugs.gnu.org Tue May 05 16:44:49 2015 Received: (at submit) by debbugs.gnu.org; 5 May 2015 20:44:49 +0000 Received: from localhost ([127.0.0.1]:35623 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ypjhv-0003f2-J0 for submit@debbugs.gnu.org; Tue, 05 May 2015 16:44:48 -0400 Received: from eggs.gnu.org ([208.118.235.92]:41749) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ypjfg-0003b9-Vg for submit@debbugs.gnu.org; Tue, 05 May 2015 16:42:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Ypjfa-0005JI-NE for submit@debbugs.gnu.org; Tue, 05 May 2015 16:42:23 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,HTML_MESSAGE autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:44714) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Ypjfa-0005JE-KI for submit@debbugs.gnu.org; Tue, 05 May 2015 16:42:22 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55256) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YpjfZ-00037P-E3 for bug-coreutils@gnu.org; Tue, 05 May 2015 16:42:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YpjfU-0005Fu-S4 for bug-coreutils@gnu.org; Tue, 05 May 2015 16:42:21 -0400 Received: from outgoing-mail.its.caltech.edu ([131.215.239.19]:45781) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YpjfU-0005En-KM for bug-coreutils@gnu.org; Tue, 05 May 2015 16:42:16 -0400 Received: from smtp02.caltech.edu (localhost [127.0.0.1]) by filter-return (Postfix) with ESMTP id 82FD26C0700 for ; Tue, 5 May 2015 13:42:13 -0700 (PDT) Received: from planck32.ipac.caltech.edu (unknown [134.4.75.112]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: rusholme) by smtp-server.its.caltech.edu (Postfix) with ESMTPSA id 0C3F26C0E25 for ; Tue, 5 May 2015 13:42:12 -0700 (PDT) From: Ben Rusholme Content-Type: multipart/alternative; boundary="Apple-Mail=_168272A4-481B-4933-BD0D-6A04807D87EE" Subject: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? Message-Id: Date: Tue, 5 May 2015 13:42:12 -0700 To: bug-coreutils@gnu.org Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) X-Mailer: Apple Mail (2.1878.6) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [generic] X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Tue, 05 May 2015 16:44:46 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) --Apple-Mail=_168272A4-481B-4933-BD0D-6A04807D87EE Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 Hi, =93split=94 (in the current GNU coreutils 8.23 release) does not account = for the optional start index (=93split --numeric-suffixes=3DFROM=94) = when calculating suffix length. I couldn=92t find any prior reference to this problem in either the bug = tracker or mailing list archive. Thanks, Ben $ seq 100 >& input.txt $ split --numeric-suffixes --number=3Dl/100 input.txt $ ls input.txt x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 = x90 x97 x00 x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 = x91 x98 x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 = x92 x99 x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 = x93 x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 = x94 x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 = x95 x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 = x96 $ rm x* $ split --numeric-suffixes=3D1 --number=3Dl/100 input.txt split: output file suffixes exhausted $ ls input.txt x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 = x91 x98 x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 = x92 x99 x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 = x93 x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 = x94 x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 = x95 x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 = x96 x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 = x97 $ # Should run from x001 to x100! $ rm x* $ split --numeric-suffixes=3D1 --number=3Dl/101 input.txt $ ls input.txt x008 x016 x024 x032 x040 x048 x056 x064 x072 x080 = x088 x096 x001 x009 x017 x025 x033 x041 x049 x057 x065 x073 x081 = x089 x097 x002 x010 x018 x026 x034 x042 x050 x058 x066 x074 x082 = x090 x098 x003 x011 x019 x027 x035 x043 x051 x059 x067 x075 x083 = x091 x099 x004 x012 x020 x028 x036 x044 x052 x060 x068 x076 x084 = x092 x100 x005 x013 x021 x029 x037 x045 x053 x061 x069 x077 x085 = x093 x101 x006 x014 x022 x030 x038 x046 x054 x062 x070 x078 x086 = x094 x007 x015 x023 x031 x039 x047 x055 x063 x071 x079 x087 = x095 --Apple-Mail=_168272A4-481B-4933-BD0D-6A04807D87EE Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252 Hi,

=93split=94 (in the current = GNU coreutils 8.23 release) does not account for the optional start = index (=93split --numeric-suffixes=3DFROM=94) when calculating suffix = length.

I couldn=92t find any prior reference = to this problem in either the bug tracker or mailing list = archive.

Thanks, = Ben



$ seq 100 >& input.txt
$ split = --numeric-suffixes --number=3Dl/100 input.txt
$ ls
input.txt =  x06  x13  x20  x27  x34  x41  x48 =  x55  x62  x69  x76  x83  x90 =  x97
x00        x07  x14  x21 =  x28  x35  x42  x49  x56  x63  x70 =  x77  x84  x91  x98
x01       =  x08  x15  x22  x29  x36  x43  x50 =  x57  x64  x71  x78  x85  x92 =  x99
x02        x09  x16  x23 =  x30  x37  x44  x51  x58  x65  x72 =  x79  x86  x93
x03        x10 =  x17  x24  x31  x38  x45  x52  x59 =  x66  x73  x80  x87  x94
x04     =    x11  x18  x25  x32  x39  x46 =  x53  x60  x67  x74  x81  x88 =  x95
x05        x12  x19  x26 =  x33  x40  x47  x54  x61  x68  x75 =  x82  x89  x96


$ rm = x*
$ split --numeric-suffixes=3D1 --number=3Dl/100 input.txt
split: output file suffixes exhausted
$ = ls
input.txt  x07  x14  x21  x28  x35 =  x42  x49  x56  x63  x70  x77  x84 =  x91  x98
x01        x08  x15 =  x22  x29  x36  x43  x50  x57  x64 =  x71  x78  x85  x92  x99
x02     =    x09  x16  x23  x30  x37  x44 =  x51  x58  x65  x72  x79  x86 =  x93
x03        x10  x17  x24 =  x31  x38  x45  x52  x59  x66  x73 =  x80  x87  x94
x04        x11 =  x18  x25  x32  x39  x46  x53  x60 =  x67  x74  x81  x88  x95
x05     =    x12  x19  x26  x33  x40  x47 =  x54  x61  x68  x75  x82  x89 =  x96
x06        x13  x20  x27 =  x34  x41  x48  x55  x62  x69  x76 =  x83  x90  x97
$ # Should run = from x001 to x100!


$ rm = x*
$ split --numeric-suffixes=3D1 --number=3Dl/101 input.txt
$ = ls
input.txt  x008  x016  x024  x032  x040 =  x048  x056  x064  x072  x080  x088 =  x096
x001       x009  x017  x025 =  x033  x041  x049  x057  x065  x073 =  x081  x089  x097
x002       x010 =  x018  x026  x034  x042  x050  x058 =  x066  x074  x082  x090  x098
x003   =     x011  x019  x027  x035  x043 =  x051  x059  x067  x075  x083  x091 =  x099
x004       x012  x020  x028 =  x036  x044  x052  x060  x068  x076 =  x084  x092  x100
x005       x013 =  x021  x029  x037  x045  x053  x061 =  x069  x077  x085  x093  x101
x006   =     x014  x022  x030  x038  x046 =  x054  x062  x070  x078  x086 =  x094
x007       x015  x023  x031 =  x039  x047  x055  x063  x071  x079 =  x087  x095

= --Apple-Mail=_168272A4-481B-4933-BD0D-6A04807D87EE-- From debbugs-submit-bounces@debbugs.gnu.org Tue May 05 17:58:41 2015 Received: (at 20511) by debbugs.gnu.org; 5 May 2015 21:58:41 +0000 Received: from localhost ([127.0.0.1]:35677 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YpkrQ-0005QW-Tc for submit@debbugs.gnu.org; Tue, 05 May 2015 17:58:41 -0400 Received: from mail2.vodafone.ie ([213.233.128.44]:65198) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YpkrO-0005QF-B2 for 20511@debbugs.gnu.org; Tue, 05 May 2015 17:58:39 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: ArsbAHo8SVVtTISR/2dsb2JhbABcDoJ+zDqCXgKBQEwBAQEBAQGBC0EBAgKDWwEBBDIBOxsLDQsJDAoECwkDAgECAUUGAQwIAQGILAG2a44CAQEIAiCLOYUMCoQjAQSkV45LI2CBBVN/Pj2CdgEBAQ Received: from unknown (HELO localhost.localdomain) ([109.76.132.145]) by mail2.vodafone.ie with ESMTP; 05 May 2015 22:58:31 +0100 Message-ID: <55493D07.3030108@draigBrady.com> Date: Tue, 05 May 2015 22:58:31 +0100 From: =?windows-1252?Q?P=E1draig_Brady?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Ben Rusholme , 20511@debbugs.gnu.org Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? References: In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 20511 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 05/05/15 21:42, Ben Rusholme wrote: > Hi, > > “split” (in the current GNU coreutils 8.23 release) does not account for the optional start index (“split --numeric-suffixes=FROM”) when calculating suffix length. > > I couldn’t find any prior reference to this problem in either the bug tracker or mailing list archive. > > Thanks, Ben > > > > $ seq 100 >& input.txt > $ split --numeric-suffixes --number=l/100 input.txt > $ ls > input.txt x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 x97 > x00 x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 x91 x98 > x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 x92 x99 > x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 x93 > x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 x94 > x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 x95 > x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 x96 > > > $ rm x* > $ split --numeric-suffixes=1 --number=l/100 input.txt > split: output file suffixes exhausted > $ ls > input.txt x07 x14 x21 x28 x35 x42 x49 x56 x63 x70 x77 x84 x91 x98 > x01 x08 x15 x22 x29 x36 x43 x50 x57 x64 x71 x78 x85 x92 x99 > x02 x09 x16 x23 x30 x37 x44 x51 x58 x65 x72 x79 x86 x93 > x03 x10 x17 x24 x31 x38 x45 x52 x59 x66 x73 x80 x87 x94 > x04 x11 x18 x25 x32 x39 x46 x53 x60 x67 x74 x81 x88 x95 > x05 x12 x19 x26 x33 x40 x47 x54 x61 x68 x75 x82 x89 x96 > x06 x13 x20 x27 x34 x41 x48 x55 x62 x69 x76 x83 x90 x97 > $ # Should run from x001 to x100! > > > $ rm x* > $ split --numeric-suffixes=1 --number=l/101 input.txt > $ ls > input.txt x008 x016 x024 x032 x040 x048 x056 x064 x072 x080 x088 x096 > x001 x009 x017 x025 x033 x041 x049 x057 x065 x073 x081 x089 x097 > x002 x010 x018 x026 x034 x042 x050 x058 x066 x074 x082 x090 x098 > x003 x011 x019 x027 x035 x043 x051 x059 x067 x075 x083 x091 x099 > x004 x012 x020 x028 x036 x044 x052 x060 x068 x076 x084 x092 x100 > x005 x013 x021 x029 x037 x045 x053 x061 x069 x077 x085 x093 x101 > x006 x014 x022 x030 x038 x046 x054 x062 x070 x078 x086 x094 > x007 x015 x023 x031 x039 x047 x055 x063 x071 x079 x087 x095 The info docs say about the --numeric-suffixes option: Note specifying a FROM value also disables the default auto suffix length expansion described above, and so you may also want to specify ‘-a’ to allow suffixes beyond ‘99’. Now also specifying the fixed number of files with --number auto sets the suffix length based on the number. I.E. when you specified -nl/101 it bumped the suffix length to 3 Now you could bump the suffix length based on the start number, though I don't think we should as that would impact on future processing (ordering) of the resultant files. I.E. specifying a FROM value to --numeric-suffixes should only impact the start value, rather than the width. In other words if you were to split 2 files into 200 parts like: split --number=l/100 input1.txt split --numeric-suffixes=100 --number=l/100 input2.txt Then you really need to be specifying -a3 to set the suffix length appropriately. We might be able to give an earlier error in this case, and we should probably clarify the info docs a bit more. I'll think about it. cheers, Pádraig. From debbugs-submit-bounces@debbugs.gnu.org Wed May 06 00:29:29 2015 Received: (at 20511) by debbugs.gnu.org; 6 May 2015 04:29:29 +0000 Received: from localhost ([127.0.0.1]:35811 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ypqxd-0000yS-4i for submit@debbugs.gnu.org; Wed, 06 May 2015 00:29:29 -0400 Received: from outgoing-mail.its.caltech.edu ([131.215.239.19]:21700) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ypqxa-0000yA-OT for 20511@debbugs.gnu.org; Wed, 06 May 2015 00:29:27 -0400 Received: from smtp02.caltech.edu (localhost [127.0.0.1]) by filter-return (Postfix) with ESMTP id 94AE66C0B54; Tue, 5 May 2015 21:29:20 -0700 (PDT) X-Spam-Scanned: at Caltech-IMSS on smtp02.caltech.edu by amavisd-new Received: from [10.0.1.18] (cpe-172-250-54-101.socal.res.rr.com [172.250.54.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: rusholme) by smtp-server.its.caltech.edu (Postfix) with ESMTPSA id 1795D6C0B4B; Tue, 5 May 2015 21:29:20 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? From: Ben Rusholme In-Reply-To: <55493D07.3030108@draigBrady.com> Date: Tue, 5 May 2015 21:29:19 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <30C3273B-F301-4756-86AA-268BDFAA1111@caltech.edu> References: <55493D07.3030108@draigBrady.com> To: 20511@debbugs.gnu.org X-Mailer: Apple Mail (2.1878.6) X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20511 Cc: =?windows-1252?Q?P=E1draig_Brady?= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hi, > The info docs say about the --numeric-suffixes option: >=20 > Note specifying a FROM value also disables the default auto suffix > length expansion described above, and so you may also want to > specify =91-a=92 to allow suffixes beyond =9199=92. This does not seem to be the case, auto suffix works fine beyond 99 (in = the current 8.23 release)? $ seq 1000000 >& input.txt $ split --numeric-suffixes=3D1234 --number=3Dl/5678 input.txt $ ls | tail x6902 x6903 x6904 x6905 x6906 x6907 x6908 x6909 x6910 x6911 It just fails wherever FROM pushes CHUNKS over a multiple of 10: $ rm x* $ split --numeric-suffixes --number=3Dl/10000 input.txt $ ls | tail -n 3 x9997 x9998 x9999 $ $ rm x* $ split --numeric-suffixes=3D1 --number=3Dl/10000 input.txt split: output file suffixes exhausted $ ls | tail -n 3 x9997 x9998 x9999 $ ls | head -n 3 input.txt x0001 x0002 $ $ rm x* $ split --numeric-suffixes=3D2 --number=3Dl/9999 input.txt split: output file suffixes exhausted $ ls | tail -n 3 x9997 x9998 x9999 $ ls | head -n 3 input.txt x0002 x0003 As you say, this can always be fixed by the "--suffix-length" argument, = but it=92s only required for certain combinations of FROM and CHUNK, = (and =93split=94 already has all the information it needs). > Now you could bump the suffix length based on the start number, > though I don't think we should as that would impact on future > processing (ordering) of the resultant files. I.E. specifying > a FROM value to --numeric-suffixes should only impact the > start value, rather than the width. Could you clarify this for me? Doesn=92t the zero-padding ensure correct = processing order? I assume the crucial test is the inverse operation: $ cat x* >& output.txt $ diff input.txt output.txt $ Thanks, Ben From debbugs-submit-bounces@debbugs.gnu.org Wed May 06 06:53:34 2015 Received: (at 20511) by debbugs.gnu.org; 6 May 2015 10:53:34 +0000 Received: from localhost ([127.0.0.1]:36009 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YpwxK-0003zy-7I for submit@debbugs.gnu.org; Wed, 06 May 2015 06:53:34 -0400 Received: from mail3.vodafone.ie ([213.233.128.45]:56160) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YpwxH-0003zk-EZ for 20511@debbugs.gnu.org; Wed, 06 May 2015 06:53:32 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0bAAXySVVtTkk8/2dsb2JhbABWBg6CfoN+yBiCXgKBK0wBAQEBAQGBC0EBAgKDWwEBBDIBVgsNCwkMCgQLCQMCAQIBRQYBDAgBAYgsAbY3jhwBK4s5hEJKCoQjAQSYf4tljlEjYIEFU38/PYJ2AQEB Received: from unknown (HELO localhost.localdomain) ([109.78.73.60]) by mail3.vodafone.ie with ESMTP; 06 May 2015 11:53:23 +0100 Message-ID: <5549F2A3.8090800@draigBrady.com> Date: Wed, 06 May 2015 11:53:23 +0100 From: =?windows-1252?Q?P=E1draig_Brady?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Ben Rusholme , 20511@debbugs.gnu.org Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? References: <55493D07.3030108@draigBrady.com> <30C3273B-F301-4756-86AA-268BDFAA1111@caltech.edu> In-Reply-To: <30C3273B-F301-4756-86AA-268BDFAA1111@caltech.edu> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 20511 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 06/05/15 05:29, Ben Rusholme wrote: > As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs). > >> Now you could bump the suffix length based on the start number, >> though I don't think we should as that would impact on future >> processing (ordering) of the resultant files. I.E. specifying >> a FROM value to --numeric-suffixes should only impact the >> start value, rather than the width. > > Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order? There are two use cases supported by specifying FROM. 1. Setting the start for a single run (FROM is usually 1 in this case) 2. Setting the offset for multiple independent split runs. In the second case we can't infer the size of the total set in any particular run, and thus require that --suffix-length is specified appropriately. I.E. for multiple independent runs, the suffix length needs to be fixed width across the entire set for the total ordering to be correct. Things we could change are... 1. Special case FROM=1 to assume a single run and thus enable auto suffix expansion or appropriately sized suffix with CHUNK. This would be a backwards incompat change and also not guaranteed a single run, so I'm reluctant to do that. 2. Give an early error with specified FROM and CHUNK that would overflow the suffix size for CHUNK. This would save some processing, though doesn't add any protections against latent issues. I.E. you still get the error which is dependent on the parameters rather than the input data size. Therefore it's probably not worth the complication. 3. Leave suffix length at 2 when both FROM and CHUNK are specified. In retrospect, this would probably have been the best option to avoid ambiguities like this. However now we'd be breaking compat with scripts with FROM=1 and CHUNK=200 etc. While CHUNK values > 100 would be unusual 4. Auto set the suffix len based on FROM + CHUNK. That would support use case 1 (single run), but _silently_ break subsequent processing order of outputs from multiple split runs (as FROM is increased in multiples of CHUNK size). We could mitigate the _silent_ breakage though by limiting this change to when FROM < CHUNK. 5. Document in man page and with more detail in info docs that -a is recommended when specifying FROM So I'll do 4 and 5 I think. cheers, Pádraig. From debbugs-submit-bounces@debbugs.gnu.org Wed May 06 13:37:52 2015 Received: (at 20511) by debbugs.gnu.org; 6 May 2015 17:37:52 +0000 Received: from localhost ([127.0.0.1]:36495 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yq3GZ-0008Ve-EX for submit@debbugs.gnu.org; Wed, 06 May 2015 13:37:51 -0400 Received: from outgoing-mail.its.caltech.edu ([131.215.239.19]:41055) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yq3GW-0008VK-Jy for 20511@debbugs.gnu.org; Wed, 06 May 2015 13:37:49 -0400 Received: from smtp01.caltech.edu (localhost [127.0.0.1]) by filter-return (Postfix) with ESMTP id 58AAAA01EA; Wed, 6 May 2015 10:37:42 -0700 (PDT) X-Spam-Scanned: at Caltech-IMSS on smtp01.caltech.edu by amavisd-new Received: from planck32.ipac.caltech.edu (unknown [134.4.75.112]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: rusholme) by smtp-server.its.caltech.edu (Postfix) with ESMTPSA id EDA44A0EB4; Wed, 6 May 2015 10:37:41 -0700 (PDT) Content-Type: text/plain; charset=windows-1252 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? From: Ben Rusholme In-Reply-To: <5549F2A3.8090800@draigBrady.com> Date: Wed, 6 May 2015 10:37:41 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: References: <55493D07.3030108@draigBrady.com> <30C3273B-F301-4756-86AA-268BDFAA1111@caltech.edu> <5549F2A3.8090800@draigBrady.com> To: 20511@debbugs.gnu.org X-Mailer: Apple Mail (2.1878.6) X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20511 Cc: =?windows-1252?Q?P=E1draig_Brady?= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hi, > 4. Auto set the suffix len based on FROM + CHUNK. > That would support use case 1 (single run), > but _silently_ break subsequent processing order > of outputs from multiple split runs > (as FROM is increased in multiples of CHUNK size). > We could mitigate the _silent_ breakage though > by limiting this change to when FROM < CHUNK. >=20 > 5. Document in man page and with more detail in info docs > that -a is recommended when specifying FROM >=20 > So I'll do 4 and 5 I think. Thanks, that would solve the problem I was having. Please feel free to end this conversation here, but if you can spare the = time I=92d be very interested in an example of a multiple split run for = my own education/understanding/curiosity? I assume you mean processing = subsets of the input, but can=92t see how to do that (after = experimenting on the command line and searching the documentation) = except =97number=3Dl/k/n which does know the size of the total set? Thanks again, Ben From debbugs-submit-bounces@debbugs.gnu.org Wed May 06 13:48:33 2015 Received: (at 20511) by debbugs.gnu.org; 6 May 2015 17:48:33 +0000 Received: from localhost ([127.0.0.1]:36499 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yq3Qv-0000KP-1e for submit@debbugs.gnu.org; Wed, 06 May 2015 13:48:33 -0400 Received: from mail5.vodafone.ie ([213.233.128.176]:59330) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yq3Qs-0000KB-Ev for 20511@debbugs.gnu.org; Wed, 06 May 2015 13:48:31 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AusdAKRTSlVtTkk8/2dsb2JhbABcDoJ+zB+CXgKBJkwBAQEBAQGBC0EBAgKDWwEBBDIBVgsNCwkMCgQLCQMCAQIBRQYBDAgBARKIGgG3Y412AQEIAiCLOYUMCoQjAQSZAoUig1aCbY5SI2GBBVN/Pz2CdgEBAQ Received: from unknown (HELO localhost.localdomain) ([109.78.73.60]) by mail3.vodafone.ie with ESMTP; 06 May 2015 18:48:09 +0100 Message-ID: <554A53D8.4030107@draigBrady.com> Date: Wed, 06 May 2015 18:48:08 +0100 From: =?windows-1252?Q?P=E1draig_Brady?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Ben Rusholme , 20511@debbugs.gnu.org Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? References: <55493D07.3030108@draigBrady.com> <30C3273B-F301-4756-86AA-268BDFAA1111@caltech.edu> <5549F2A3.8090800@draigBrady.com> In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 20511 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 06/05/15 18:37, Ben Rusholme wrote: > Hi, > >> 4. Auto set the suffix len based on FROM + CHUNK. >> That would support use case 1 (single run), >> but _silently_ break subsequent processing order >> of outputs from multiple split runs >> (as FROM is increased in multiples of CHUNK size). >> We could mitigate the _silent_ breakage though >> by limiting this change to when FROM < CHUNK. >> >> 5. Document in man page and with more detail in info docs >> that -a is recommended when specifying FROM >> >> So I'll do 4 and 5 I think. > > Thanks, that would solve the problem I was having. > > Please feel free to end this conversation here, but if you can spare the time I’d be very interested in an example of a multiple split run for my own education/understanding/curiosity? I assume you mean processing subsets of the input, but can’t see how to do that (after experimenting on the command line and searching the documentation) except —number=l/k/n which does know the size of the total set? Well you could process subsets but even more simply consider splitting a set of input files in 2, to a set of output files. i=0 for f in *.dat; do split -a4 --numeric=$i $f -n2; i=$(($i+2)) done (to be truely generic you would set the -a parameter based on the number of files and -n). cheers, Pádraig. From debbugs-submit-bounces@debbugs.gnu.org Tue May 12 21:21:09 2015 Received: (at 20511-done) by debbugs.gnu.org; 13 May 2015 01:21:09 +0000 Received: from localhost ([127.0.0.1]:42816 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YsLMB-0007Up-O0 for submit@debbugs.gnu.org; Tue, 12 May 2015 21:21:09 -0400 Received: from mail5.vodafone.ie ([213.233.128.176]:25980) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YsLM7-0007Te-HO for 20511-done@debbugs.gnu.org; Tue, 12 May 2015 21:21:05 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AogQAIymUlVtT2sf/2dsb2JhbABWBg6DAVRZBccGhgYBAgKBOUwBAQEBAQGBC0EBBINaAQEBBIEJCw0EAwECAQkMCAIECwkDAgECAT0IBgEMBgIBAYgsAQO7Vo0vAQEBAQEBBAEBAQEBARyLOYRCMgwMCoQjBYs4hGKCDoIwgT5dgkeFYYYUiAuGbCNhgQVUfz89MYJGAQEB Received: from unknown (HELO localhost.localdomain) ([109.79.107.31]) by mail3.vodafone.ie with ESMTP; 13 May 2015 02:20:28 +0100 Message-ID: <5552A6DB.2000602@draigBrady.com> Date: Wed, 13 May 2015 02:20:27 +0100 From: =?windows-1252?Q?P=E1draig_Brady?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Ben Rusholme , 20511-done@debbugs.gnu.org Subject: Re: bug#20511: split : does not account for --numeric-suffixes=FROM in calculation of suffix length? References: <55493D07.3030108@draigBrady.com> <30C3273B-F301-4756-86AA-268BDFAA1111@caltech.edu> <5549F2A3.8090800@draigBrady.com> In-Reply-To: <5549F2A3.8090800@draigBrady.com> Content-Type: multipart/mixed; boundary="------------070109050306060206060104" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 20511-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) This is a multi-part message in MIME format. --------------070109050306060206060104 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit On 06/05/15 11:53, Pádraig Brady wrote: > On 06/05/15 05:29, Ben Rusholme wrote: >> As you say, this can always be fixed by the "--suffix-length" argument, but it’s only required for certain combinations of FROM and CHUNK, (and “split” already has all the information it needs). >> >>> Now you could bump the suffix length based on the start number, >>> though I don't think we should as that would impact on future >>> processing (ordering) of the resultant files. I.E. specifying >>> a FROM value to --numeric-suffixes should only impact the >>> start value, rather than the width. >> >> Could you clarify this for me? Doesn’t the zero-padding ensure correct processing order? > > There are two use cases supported by specifying FROM. > 1. Setting the start for a single run (FROM is usually 1 in this case) > 2. Setting the offset for multiple independent split runs. > In the second case we can't infer the size of the total set > in any particular run, and thus require that --suffix-length is specified appropriately. > I.E. for multiple independent runs, the suffix length needs to be > fixed width across the entire set for the total ordering to be correct. > > > Things we could change are... > > 1. Special case FROM=1 to assume a single run and thus > enable auto suffix expansion or appropriately sized suffix with CHUNK. > This would be a backwards incompat change and also not > guaranteed a single run, so I'm reluctant to do that. > > 2. Give an early error with specified FROM and CHUNK > that would overflow the suffix size for CHUNK. > This would save some processing, though doesn't add > any protections against latent issues. I.E. you still get > the error which is dependent on the parameters rather than the input data size. > Therefore it's probably not worth the complication. > > 3. Leave suffix length at 2 when both FROM and CHUNK are specified. > In retrospect, this would probably have been the best option > to avoid ambiguities like this. However now we'd be breaking > compat with scripts with FROM=1 and CHUNK=200 etc. > While CHUNK values > 100 would be unusual > > 4. Auto set the suffix len based on FROM + CHUNK. > That would support use case 1 (single run), > but _silently_ break subsequent processing order > of outputs from multiple split runs > (as FROM is increased in multiples of CHUNK size). > We could mitigate the _silent_ breakage though > by limiting this change to when FROM < CHUNK. > > 5. Document in man page and with more detail in info docs > that -a is recommended when specifying FROM > > So I'll do 4 and 5 I think. Attached. cheers, Pádraig --------------070109050306060206060104 Content-Type: text/x-patch; name="split-from-width.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="split-from-width.patch" >From 4d5e6c4f4a2ba8407420e56282c0d4e37b2691ee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Wed, 6 May 2015 01:48:40 +0100 Subject: [PATCH] split: auto set suffix len for --numeric-suffixes== number of files +# That's the multi run use case which is invalid to adjust suffix len +# as that would result in an incorrect order for the total output file set +returns_ 1 split --numeric-suffixes=100 --number=r/100 file.in || fail=1 + Exit $fail -- 2.3.4 --------------070109050306060206060104-- From unknown Sat Jun 14 14:29:13 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Wed, 10 Jun 2015 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator