GNU bug report logs - #67593
`split --number=l/N` no longer splits evenly

Previous Next

Package: coreutils;

Reported by: Victor Engmark <victor <at> engmark.name>

Date: Sun, 3 Dec 2023 00:26:01 UTC

Severity: normal

Tags: notabug

To reply to this bug, email your comments to 67593 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#67593; Package coreutils. (Sun, 03 Dec 2023 00:26:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Victor Engmark <victor <at> engmark.name>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sun, 03 Dec 2023 00:26:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Victor Engmark <victor <at> engmark.name>
To: bug-coreutils <at> gnu.org
Subject: `split --number=l/N` no longer splits evenly
Date: Sun, 03 Dec 2023 13:25:01 +1300
Hi all

Commit fb6fc7f3ce6b0b70a5df7f605e71c4f8541e256b (part of v9.2)
introduced a regression in how `split --number=l/N` works.

Test script `tests/split/l-chunk2.sh`:

```
#!/bin/sh

. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src
print_ver_ split

printf 'first\n' > exp1 || framework_failure_
printf 'second\n' > exp2 || framework_failure_
cat exp1 exp2 > in || framework_failure_
split -e -n l/2 in || framework_failure_
compare exp1 xaa || fail=1
compare exp2 xab || fail=1

Exit $fail
```

Relevant test output:

```
+ diff -u exp1 xaa
--- exp1	2023-12-03 12:42:50.511334991 +1300
+++ xaa	2023-12-03 12:42:50.513334908 +1300
@@ -1 +1,2 @@
 first
+second
```

and

```
+ diff -u exp2 xab
diff: xab: No such file or directory
```

In other words, it doesn't split the file at all, despite it containing
two lines of content.

The bug is still present in current master (commit
73d119f4f8052a9fb6cef13cd9e75d5a4e23311a).

Bisected on NixOS 23.11 using the following script:

```
#!/bin/sh

set -e

export CFLAGS=-w # Avoid build failure

git submodule update
git clean -fdx --exclude=bisect.sh --exclude=tests/split/l-chunk2.sh

./bootstrap
autoconf
./configure
make
make check TESTS=tests/split/l-chunk2.sh SUBDIRS=.
```

and these commands:

```
git bisect start master v9.1
git bisect run ./bisect.sh
```

Cheers
Victor




Information forwarded to bug-coreutils <at> gnu.org:
bug#67593; Package coreutils. (Sun, 03 Dec 2023 09:38:01 GMT) Full text and rfc822 format available.

Message #8 received at 67593 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Victor Engmark <victor <at> engmark.name>
Cc: 67593 <at> debbugs.gnu.org
Subject: Re: bug#67593: `split --number=l/N` no longer splits evenly
Date: Sun, 3 Dec 2023 01:37:37 -0800
That's not a bug, in that 'split' is behaving as documented. The first 
input line is one byte shorter than the second one. 'Split' divides the 
input into two regions, and because the first region happens to be one 
byte longer than the second region both input lines are sent to the 
first output file.

In older coreutils, 'split' used a different algorithm to compute region 
sizes, which worked better for your test case but considerably worse in 
others. For example, in older coreutils:

seq 50 >in
split -n l/71 in

created 43 files of size 0, 9 files of size 2, 18 files of size 3, and 
one file of size 69. Current coreutils splits much better: it creates 21 
files of size 0, 9 files of size 2, and 41 files of size 3.




Added tag(s) notabug. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sun, 03 Dec 2023 09:39:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#67593; Package coreutils. (Sun, 03 Dec 2023 13:18:01 GMT) Full text and rfc822 format available.

Message #13 received at 67593 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, Victor Engmark <victor <at> engmark.name>
Cc: 67593 <at> debbugs.gnu.org
Subject: Re: bug#67593: `split --number=l/N` no longer splits evenly
Date: Sun, 3 Dec 2023 13:17:33 +0000
On 03/12/2023 09:37, Paul Eggert wrote:
> That's not a bug, in that 'split' is behaving as documented. The first
> input line is one byte shorter than the second one. 'Split' divides the
> input into two regions, and because the first region happens to be one
> byte longer than the second region both input lines are sent to the
> first output file.
> 
> In older coreutils, 'split' used a different algorithm to compute region
> sizes, which worked better for your test case but considerably worse in
> others. For example, in older coreutils:
> 
> seq 50 >in
> split -n l/71 in
> 
> created 43 files of size 0, 9 files of size 2, 18 files of size 3, and
> one file of size 69. Current coreutils splits much better: it creates 21
> files of size 0, 9 files of size 2, and 41 files of size 3.

Related to this, I think it would be useful to add a new
split --number=L/N` mode (note the capital L), which tries harder
to evenly distribute lines.
It would only be supported when we can determine the number of lines up front,
and so wouldn't be supported when reading from a pipe for e.g.

cheers,
Pádraig.




This bug report was last modified 1 year and 195 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.