From unknown Sat Jun 21 10:46:21 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#69982 <69982@debbugs.gnu.org> To: bug#69982 <69982@debbugs.gnu.org> Subject: Status: Setting inodes to 0 leads to incorrect output when extracting with GNU cpio Reply-To: bug#69982 <69982@debbugs.gnu.org> Date: Sat, 21 Jun 2025 17:46:21 +0000 retitle 69982 Setting inodes to 0 leads to incorrect output when extracting= with GNU cpio reassign 69982 guix submitter 69982 Skyler Ferris severity 69982 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Sun Mar 24 12:18:45 2024 Received: (at submit) by debbugs.gnu.org; 24 Mar 2024 16:18:45 +0000 Received: from localhost ([127.0.0.1]:41412 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1roQYT-0001wc-IS for submit@debbugs.gnu.org; Sun, 24 Mar 2024 12:18:45 -0400 Received: from lists.gnu.org ([209.51.188.17]:46202) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1roQYQ-0001wN-2A for submit@debbugs.gnu.org; Sun, 24 Mar 2024 12:18:39 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1roQXj-00067s-Si for bug-guix@gnu.org; Sun, 24 Mar 2024 12:17:55 -0400 Received: from mail-40131.protonmail.ch ([185.70.40.131]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1roQXh-0007ae-3p for bug-guix@gnu.org; Sun, 24 Mar 2024 12:17:55 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=protonmail.com; s=protonmail3; t=1711297055; x=1711556255; bh=uhZQIrkw1ELbCqAv9k8jDPxFsE0yWDDrLis5OpIX6f4=; h=Date:To:From:Subject:Message-ID:Feedback-ID:From:To:Cc:Date: Subject:Reply-To:Feedback-ID:Message-ID:BIMI-Selector; b=yF7iLII6UNO9YRsAnRdLt3swT0gTWfNkVtV3CsWc4EG5GT6lMY81aXSieS6NT4bRT 6qi9mlKIqPmbtimuha1n4Z8i6jqQddXJoFvvQ69B+dfUyGmmE6baH/QnXV6AZhqWy+ +PfAYB7uE5d4H2XkAuFqf13zPppvJEE1YDZImFbNzvxG2S0zHLgEfX08RLEF/f66Eg cD60aEOsL74aDxL0eAPjrbOiVac5IChnnvJVbg6h9rVdMNoVZMH2pb/3ftQuqjv6Fi bUwQ/3Mhfjw/DmEDPsW7CmTPrLVuejwoGPQaEzlZro28+vROsKlFNu3x59CQCNiS0n OL2c88/Nuhwfg== Date: Sun, 24 Mar 2024 16:17:22 +0000 To: bug-guix@gnu.org From: Skyler Ferris Subject: Setting inodes to 0 leads to incorrect output when extracting with GNU cpio Message-ID: <83c29759-4e54-40ef-a9d3-b27c4774cd02@protonmail.com> Feedback-ID: 40635331:user:proton MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=185.70.40.131; envelope-from=skyvine@protonmail.com; helo=mail-40131.protonmail.ch X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hello, I have encountered a bug that is caused by the interaction of=20 write-cpio-archive from (gnu build linux-initrd) writing all inodes as 0=20 and the way that GNU cpio processes file headers. I observed this bug=20 while creating a custom initramfs where init is based on a bash script=20 used by another distribution (but I will provide a minimal reproducer=20 below). This bug only exhibits itself when there are multiple different=20 hard links present in the input directory. This email will contain a=20 short set of reproduction steps, an explanation of what I understand the=20 cause of the bug to be, some possible paths forward, and a disclaimer=20 about my limitations due to my background. To reproduce this bug, run the following commands: ```shell $ mkdir /tmp/source $ cd /tmp/source $ echo contents1 > file1.txt $ ln file1.txt link1.txt $ echo contents2 > file2.txt $ echo contents3 > file3.txt $ ln file3.txt link3.txt $ guix repl > (use-modules (gnu build linux-initrd)) > ; disable compression so we don't waste time on it while debugging,=20 it does not impact reproduction > (write-cpio-archive "." "../archive.cpio" #:compress? #f) > ,q $ cd .. $ mkdir out $ cd out $ cat ../archive.cpio | cpio -i $ cat * ``` After running the final step you will see that all of file1.txt,=20 link1.txt, file3.txt, and link3.txt have the contents "contents1": the=20 files which should contain "contents3" have been created incorrectly. Now I will list the set of steps the relevant programs performed which=20 caused this error, followed by a more verbose explanation with=20 references to source code: 1. Guix creates the archive with the inode and major & minor device=20 numbers set to 0. Number of hard links is reported accurately. 2. CPIO reads the archive and hard links files when the header indicates=20 that there are multiple links. It uses the inode and major & minor=20 device numbers to find the correct file to hard link to. 3. As file3.txt and link3.txt both have multiple links and share their=20 inode and major & minor device numbers with file1.txt, they are all=20 linked to file1.txt This error occurs when the cpio utility processes files with hard link.=20 In `copyin_regular_file`, there is a code block which only runs if the=20 file has multiple hard links and the newascii (or checksummed new ascii)=20 format is in use (1). Within that code block there is a conditional to=20 check if the file size is 0, with a comment explaining that the newascii=20 format only records the data for the final file pointing to the relevant=20 inode rather than repeating the data each time. The=C2=A0 code in=20 guix/cpio.scm does not actually do this, so this code block never=20 executes. Instead, the other code block runs which simply calls=20 `link_to_maj_min_ino` (and checks for an error code) (2). This uses=20 `find_inode_file` which references a hash table that associates the=20 inode/major device/minor device with a file path, and if it finds a=20 match then it creates a hard link on the target file system. However,=20 Guix's `file->cpio-header*` sets all of the inode and device numbers to=20 0 for reproducibility. This causes cpio to hard link every file with=20 multiple links to the first file that has multiple links. I see 3 possible paths forward to address this issue: 1. Provide spoofed inode numbers, tracking hard link data. In (gnu build=20 linux-initrd), the `write-cpio-archive` procedure sorts the files by=20 name so we can provide inode numbers that increase sequentially.=20 However, in order to make sure that the correct hard links are findable=20 by the cpio utility we would need to track the real inode numbers as=20 well and use the correct pseudonym in each place. This would noticeably=20 increase the complexity of the code. 2. Provide spoofed inode numbers and spoofed hard link data. In order to=20 avoid tracking the real hard link numbers we can just report all files=20 as having only a single link, and still provide sequential inode numbers=20 as above. This will not increase the size of the cpio archives we=20 generate compared to current output because we are storing the data for=20 each link anyway. This will add some complexity to the cpio code, but=20 less than option 1. 3. Don't support inputs with multiple hard links and require callers to=20 work around this issue. This avoids any changes to the cpio code. I am in favor of option 2 because I think it strikes a good balance=20 between keeping the cpio code stable and supporting reasonable use=20 cases. The cpio code is used to build the initramfs in Guix systems so a=20 bug here could make some systems unbootable. Guix does provide=20 transactional rollbacks which is helpful but it is still a frustrating=20 experience to reboot and immediately see a crash; debugging issues in=20 this early environment is significantly more difficult than debugging=20 post-boot issues. Hard links are not common on many systems because they=20 add complexity to filesystem analysis, but Guix makes good use of them=20 to save space in the store, where it is common for many files to share=20 data and creating symlinks would prevent the garbage collector from=20 deleting otherwise unused outputs. The limitations I referred to in the beginning of the email are that I=20 am inexperienced in this domain. I have only recently (over the past=20 month or so) started looking at building a custom initramfs, and I have=20 never worked with CPIO archives before. I think that my analysis makes=20 sense based on the code I have read and the behavior I have observed,=20 but take everything I say with a grain of salt. I would appreciate any thoughts that anyone has on this matter. Regards, Skyler (1)=20 https://git.savannah.gnu.org/cgit/cpio.git/tree/src/copyin.c?id=3D900bab656= ff24db5e3099941fb909c79c07962ed#n400 (2)=20 https://git.savannah.gnu.org/cgit/cpio.git/tree/src/copypass.c?id=3D900bab6= 56ff24db5e3099941fb909c79c07962ed#n341