Package: guix-patches;
Reported by: Ludovic Courtès <ludo <at> gnu.org>
Date: Fri, 24 Jan 2025 17:24:02 UTC
Severity: normal
Tags: patch
Done: Ludovic Courtès <ludo <at> gnu.org>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Reepca Russelstein <reepca <at> russelstein.xyz> To: Ludovic Courtès <ludo <at> gnu.org> Cc: 75810 <at> debbugs.gnu.org Subject: [bug#75810] [PATCH 0/6] Rootless guix-daemon Date: Wed, 29 Jan 2025 01:51:33 -0600
[Message part 1 (text/plain, inline)]
Ludovic Courtès <ludo <at> gnu.org> writes: > Hello Reepca, > > Reepca Russelstein <reepca <at> russelstein.xyz> skribis: > >> user <at> debian:~$ /gnu/store/8bjy9g0cssjrw9ljz2r8ww1sma95isfj-hello-2.12.1/bin/hello >> GOOOOOD BYYEEEEEE > > This particular issue is fixed with read-only mounts: > > diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc > index c95bd2821f..e8e4a56e2d 100644 > --- a/nix/libstore/build.cc > +++ b/nix/libstore/build.cc > @@ -2175,7 +2175,7 @@ void DerivationGoal::runChild() > createDirs(dirOf(target)); > writeFile(target, ""); > } > - if (mount(source.c_str(), target.c_str(), "", MS_BIND, 0) == -1) > + if (mount(source.c_str(), target.c_str(), "", MS_BIND | MS_RDONLY, 0) == -1) > throw SysError(format("bind mount from `%1%' to `%2%' failed") % source % target); > } > > > > (I checked that it does the right thing.) > > The fix is trivial, but I’m glad you found the bug in the first place; > it does stress that we have to be careful here. Not quite trivial, consider this section from mount(2): Creating a bind mount If mountflags includes MS_BIND (available since Linux 2.4), then per‐ form a bind mount. A bind mount makes a file or a directory subtree visible at another point within the single directory hierarchy. Bind mounts may cross filesystem boundaries and span chroot(2) jails. The filesystemtype and data arguments are ignored. The remaining bits (other than MS_REC, described below) in the mount‐ flags argument are also ignored. (The bind mount has the same mount options as the underlying mount.) However, see the discussion of re‐ mounting above, for a method of making an existing bind mount read- only. If you run my sneakysneaky example from before, you'll find that it still succeeds at replacing the "hello" binary because of this, even with your MS_RDONLY patch. This can be resolved by instead using MS_RDONLY with a followup mount call using MS_REMOUNT. Note also that store items that are files instead of directories (e.g. source tarballs) are hardlinked if possible. This seems to stem from an old misconception that only directories can be bind-mounted. The hardlinks, of course, do not have any write-protection on them aside from their permission bits. This can be resolved by always bind-mounting them instead. Despite the name, there is actually already support for bind-mounting non-directory files that are listed in dirsInChroot. >> I suppose we could try to perform these bind-mounts with the MS_RDONLY >> flag, but we would need some way to ensure that the builder can't just >> remount them read-write > > The example below tests that; ‘mount’ fails with EPERM when using the > unprivileged daemon (‘./test-env guix build -f …’): > > (use-modules (guix) > (guix modules) > (gnu packages bootstrap)) > > (computed-file "try-to-remount-input-read-write" > (with-imported-modules (source-module-closure > '((guix build syscalls))) > #~(begin > (use-modules (guix build syscalls)) > > (let ((input #$(plain-file "input-that-might-be-tampered-with" > "All good!"))) > (mount "none" input "none" (logior MS_BIND MS_REMOUNT)) > (call-with-output-file input > (lambda (port) > (display "BAD!" port))) > (mkdir #$output)))) > #:guile %bootstrap-guile) > > > This is similar to: > > guix shell -C guile guix -- \ > guile -c '(use-modules (guix build syscalls)) (mount "none" (getenv "GUIX_ENVIRONMENT") "none" (logior MS_BIND MS_REMOUNT))' > > mount(2) has this: > > EPERM An attempt was made to modify (MS_REMOUNT) the MS_RDONLY, MS_NO‐ > SUID, or MS_NOEXEC flag, or one of the "atime" flags (MS_NOAT‐ > IME, MS_NODIRATIME, MS_RELATIME) of an existing mount, but the > mount is locked; see mount_namespaces(7). > > I couldn’t find the definite answer in mount_namespaces(7) as to whether > this applies in this case (same namespace but after chroot); I can only > tell empirically that it does apply. I don't think that's why we're getting EPERM. I think we're running into this, from user_namespaces(7): Note that a call to execve(2) will cause a process's capabilities to be recalculated in the usual way (see capabilities(7)). Consequently, unless the process has a user ID of 0 within the namespace, or the executable file has a nonempty inheritable capabilities mask, the process will lose all capabilities. See the discussion of user and group ID mappings, below. As the builder is in the store, it can't have any associated capability masks, and your added call to prctl to drop ambient capabilities, together with the fact that the mapped UID inside the container is nonzero, should make it so that it therefore wouldn't be able to inherit any. On a tangentially-related note, the ambient capability set didn't come into being until Linux 4.3 (around 2016), which is a fair bit newer than unprivileged user namespaces. Take that for what you will. Now, according to capabilities(7): Per-user-namespace "set-user-ID-root" programs A set-user-ID program whose UID matches the UID that created a user namespace will confer capabilities in the process's permitted and ef‐ fective sets when executed by any process inside that namespace or any descendant user namespace. The rules about the transformation of the process's capabilities during the execve(2) are exactly as described in Transformation of capabili‐ ties during execve() and Capabilities and execution of programs by root above, with the difference that, in the latter subsection, "root" is the UID of the creator of the user namespace. This would seem to suggest that the capabilities within the user namespace could be regained by creating a setuid binary and executing it, but experimentally this doesn't happen, and I am unsure whether this is a bug in the documentation, kernel, or my reading comprehension. At any rate, I am less than confident in relying on this behavior. I think it would be a good idea to, in the no-build-user case, add an extra call to unshare right at the point where the user and group would be changed in the build-user case. This extra call would create a fresh user and mount namespace, ensuring that the mount-locking behavior you referenced applies. My understanding is that the setuid behavior documented above only grants capabilities, it doesn't change the user namespace that the process is in, so it should be impossible for the builder to gain capabilities inside the user namespace owning the bind-mounted store items, even if it somehow gained full capabilities within this fresh user namespace. > - pid = clone(childEntry, > + pid = clone(childEntry, > (char *)(((uintptr_t)stack + sizeof(stack) - 8) & ~(uintptr_t)0xf), > flags, this); > - if (pid == -1) > - throw SysError("cloning builder process"); > + if (pid == -1) { > + if ((flags & CLONE_NEWUSER) != 0 && getuid() != 0) > + /* 'clone' fails with EPERM on distros where unprivileged user > + namespaces are disabled. Error out instead of giving up on > + isolation. */ > + throw SysError("cannot create process in unprivileged user namespace"); > + else > + throw SysError("cloning builder process"); > + } > + > + if ((flags & CLONE_NEWUSER) != 0) { > + /* Initialize the UID/GID mapping of the guest. */ > + if (pid == 0) { > + char str[20] = { '\0' }; > + readFull(readiness.readSide, (unsigned char*)str, 3); > + if (strcmp(str, "go\n") != 0) > + throw Error("failed to initialize process in unprivileged user namespace"); > + } else { > + initializeUserNamespace(pid); > + writeFull(readiness.writeSide, (unsigned char*)"go\n", 3); > + } This doesn't actually do any synchronizing with the child process, because clone never returns 0. It's not like fork where it returns twice with a different return value each time, control in the new thread instead goes straight to childEntry. The parent doesn't get stuck and hang when writing because PIPE_BUF > 3. >> This does raise the question, though, of how these failed build >> directories would get deleted, aside from rebooting the system. > > Note that in the early days (and in current Nix actually), build trees > were not chowned. That’s OK: they’re deleted upon reboot or by the > system administrator. > > Current Nix has this: > > void DerivationGoal::deleteTmpDir(bool force) > { > if (tmpDir != "") { > /* Don't keep temporary directories for builtins because they > might have privileged stuff (like a copy of netrc). */ > if (settings.keepFailed && !force && !drv->isBuiltin()) { > printError("note: keeping build directory '%s'", tmpDir); > chmod(tmpDir.c_str(), 0755); > } > else > deletePath(tmpDir); > tmpDir = ""; > } > } > > We could go back to this. It’s less convenient, but okay. > > In this patch series, it attempts to chown the tree; if it fails to do > so (because it lacks CAP_CHOWN), it prints a warning and keeps going. My concern comes from knowing that I've at times gone through 100 sequential failed builds while trying to package something tricky, and I tend to keep my disk on the low end of free space to minimize how often I need to rebuild stuff. That and the one time I tried tinkering with ungoogled-chromium. I know I'd probably cause a lot of trouble if I tried doing that stuff on a shared system I didn't have administrative access to. A best-effort chown attempt should do fine for now, though. >> We currently remount /gnu/store read-write at LocalStore-creation-time, >> which happens in the newly-forked guix-daemon process at the start of a >> connection. I don't think there's any particularly elevated risk from >> instead doing that before the per-connection process is forked. There >> are a number of ways we could do this: we could make it the >> responsibility of the init system to create the mount namespace and do >> the remounting, or we could have guix-daemon do it immediately on >> startup and subsequently switch its uid and gid to >> guix-daemon:guix-daemon. These lack the slick appeal of "see, you never >> have to give it root, and you can prove it just by looking at the >> service file", but realistically should be just as secure. It may be >> useful to provide a small wrapper around guix-daemon that does the >> remount and privilege-dropping, to more succinctly express this to >> anybody wishing to see for themselves. > > I think I’d prefer to have a systemd (or other) service make a > read-write bind-mount at /gnu/store/.rw-store, and then we’d run > ‘guix-daemon --backing-store=/gnu/store/.rw-store’. > > WDYT? So if I understand correctly, we would have /gnu/store hold all of its usual contents in the usual manner, and a service would bind-mount /gnu/store to /gnu/store/.rw-store without MS_RDONLY, and then it (or another service that depends on it) would bind-mount /gnu/store to itself with MS_RDONLY, and then guix-daemon would, in its own mount namespace, bind-mount /gnu/store/.rw-store to /gnu/store, again without MS_RDONLY. I assume that making /gnu/store read-only wouldn't make the already-bind-mounted /gnu/store/.rw-store read-only too? If it does, it's not going to work, and if it doesn't, it's going to remain writable for footgun appreciators. But I suppose it's at least a little more out-of-the-way. I think it might be simpler to integrate the change if we instead made it /gnu/.rw-store or something like that, since that way we don't have to worry about updating the garbage collector and such to treat it specially. Actually, now that I think about it, another possibility would be having a service that the read-only store-mount service depends on that first creates a persistent user+mount namespace combo which saves a view of the writable store (I don't recall exactly how creating the persistent namespace works, but I know the 'ip netns ...' commands can do something similar to create named network namespaces). The process that creates this namespace would run as the guix-daemon user, and therefore when guix-daemon starts it would have full capabilities within that user namespace, and could setns straight into it. This would leave no writable store in the root mount namespace. >> Personally, I think that if a guix-daemon can use privilege separation >> users, it would probably be a good idea to. We're certainly going to >> need to support them on non-linux systems either way. Could it be >> possible to have guix-install.sh modify /etc/sudoers on systems that use >> it to allow the guix-daemon user to run processes under guix builder >> users? I am currently less worried about arbitrary code execution >> vulnerabilities being found in the daemon than about the possibility of >> malicious builders (but it is possible I am underexposed to the ways >> those can happen in C++). > > What would you put in /etc/sudoers? I’m not sure what you had in > mind. I'm not sure what I had in mind either, I've only seen some opine that it's usually better to configure sudo than to write your own setuid programs, which was the first thing that came to mind for how to use dedicated build users without needing the entire daemon running as root. I recall reading somewhere that it could be configured to allow certain users to run certain commands as certain other users? So maybe it could be configured to allow the guix-daemon user to run any command as any of the guixbuilder users. Although granted, the way that container setup is currently done wouldn't work very well with that, since by the time we're ready to execute the builder we're already fully in the container, where setuid-root binaries should probably not be. I know that "how to use dedicated build users without root" probably isn't what you were asking for feedback on, but it did show up in my thoughts quite a bit. - reepca
[signature.asc (application/pgp-signature, inline)]
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.