GNU bug report logs - #51433
cp 9.0 sometimes fails with SEEK_DATA/SEEK_HOLE

Previous Next

Package: coreutils;

Reported by: Janne Heß <janne+coreutils <at> hess.ooo>

Date: Wed, 27 Oct 2021 11:56:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Janne Heß <janne+coreutils <at> hess.ooo>
Cc: 51433 <at> debbugs.gnu.org
Subject: bug#51433: cp 9.0 sometimes fails with SEEK_DATA/SEEK_HOLE
Date: Thu, 28 Oct 2021 00:56:11 -0700
[Message part 1 (text/plain, inline)]
On 10/27/21 03:00, Janne Heß wrote:
> Building another package (peertube) on x86_64-linux on ext4 also fails with strange errors in the
> test suite, something about "Error: The service is no longer running". This does not happen when the mentioned
> coreutils commit is undone by replacing #ifdef with #if 0 [3].

So the problem is not limited to ZFS? Which means that even if we 
implemented Pádraig's suggestion and disabled SEEK_HOLE on zfs, we'd 
still run into problems? That's really puzzling. Particularly since it's 
not clear what program is generating the diagnostic "The service is no 
longer running", or how it's related to GNU cp.

Anyway, the ZFS issue sounds like a serious bug in lseek+SEEK_DATA that 
really needs to be fixed. This is not just a coreutils issue, as other 
programs use SEEK_DATA.

I assume the ZFS bug (if the bug is related to ZFS, anyway) is a race 
condition of some sort; at least, that's what the trace in 
<https://github.com/openzfs/zfs/issues/11900> suggests.

In particular, I was struck that the depthcharge.config file that 'cp' 
was reading from was created by some other process, this way:

[pid 3014182] openat(AT_FDCWD, 
"/build/guybrush/tmp/portage/sys-boot/depthcharge-0.0.1-r3237/image/firmware/guybrush/depthcharge/depthcharge.config", 
O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 4
[pid 3014182] fstat(4, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid 3014182] ioctl(4, TCGETS, 0x7ffd919d61c0) = -1 ENOTTY 
(Inappropriate ioctl for device)
[pid 3014182] lseek(3, 0, SEEK_CUR)     = 0
[pid 3014182] lseek(3, 0, SEEK_DATA)    = 0
[pid 3014182] lseek(3, 0, SEEK_HOLE)    = 9608
[pid 3014182] copy_file_range(3, [0], 4, [0], 9608, 0) = 9608
[pid 3014182] lseek(3, 0, SEEK_CUR)     = 9608
[pid 3014182] lseek(3, 9608, SEEK_DATA) = -1 ENXIO (No such device or 
address)
[pid 3014182] lseek(3, 0, SEEK_END)     = 9608
[pid 3014182] ftruncate(4, 9608)        = 0
[pid 3014182] close(4)                  = 0

So, one hypothesis is that ZFS's implementation of copy_file_range does 
not set up data structures appropriately for cp's later use of 
lseek+SEEK_DATA when reading depthcharge.config. That is, from cp's 
point of view, the ftruncate(4, 9608) has been executed but the 
copy_file_range(3, [0], 4, [0], 9608, 0) has not been executed yet (it's 
cached somewhere, no doubt).

If my guess is right, then an fdatasync or fsync on cp's input might 
work around  common instances of this ZFS bug. Could you try the 
attached coreutils patch, and see whether it works around the bug? Or 
perhaps change 'fdatasync' with 'fsync' in the attached patch? Thanks.
[0001-cp-attempt-to-work-around-ZFS-bug.patch (text/x-patch, attachment)]

This bug report was last modified 3 years and 173 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.