GNU bug report logs - #75810
[PATCH 0/6] Rootless guix-daemon

Previous Next

Package: guix-patches;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Fri, 24 Jan 2025 17:24:02 UTC

Severity: normal

Tags: patch

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Reepca Russelstein <reepca <at> russelstein.xyz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 75810 <at> debbugs.gnu.org
Subject: [bug#75810] [PATCH 0/6] Rootless guix-daemon
Date: Wed, 29 Jan 2025 01:51:33 -0600
[Message part 1 (text/plain, inline)]
Ludovic Courtès <ludo <at> gnu.org> writes:

> Hello Reepca,
>
> Reepca Russelstein <reepca <at> russelstein.xyz> skribis:
>
>> user <at> debian:~$ /gnu/store/8bjy9g0cssjrw9ljz2r8ww1sma95isfj-hello-2.12.1/bin/hello
>> GOOOOOD BYYEEEEEE
>
> This particular issue is fixed with read-only mounts:
>
> diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc
> index c95bd2821f..e8e4a56e2d 100644
> --- a/nix/libstore/build.cc
> +++ b/nix/libstore/build.cc
> @@ -2175,7 +2175,7 @@ void DerivationGoal::runChild()
>                      createDirs(dirOf(target));
>                      writeFile(target, "");
>                  }
> -                if (mount(source.c_str(), target.c_str(), "", MS_BIND, 0) == -1)
> +                if (mount(source.c_str(), target.c_str(), "", MS_BIND | MS_RDONLY, 0) == -1)
>                      throw SysError(format("bind mount from `%1%' to `%2%' failed") % source % target);
>              }
>  
>
>
> (I checked that it does the right thing.)
>
> The fix is trivial, but I’m glad you found the bug in the first place;
> it does stress that we have to be careful here.

Not quite trivial, consider this section from mount(2):

    Creating a bind mount
       If mountflags includes MS_BIND (available since Linux 2.4),  then  per‐
       form  a  bind  mount.  A bind mount makes a file or a directory subtree
       visible at another point within the single directory  hierarchy.   Bind
       mounts may cross filesystem boundaries and span chroot(2) jails.

       The filesystemtype and data arguments are ignored.

       The  remaining  bits (other than MS_REC, described below) in the mount‐
       flags argument are also ignored.  (The bind mount has  the  same  mount
       options  as  the underlying mount.)  However, see the discussion of re‐
       mounting above, for a method of making an  existing  bind  mount  read-
       only.

If you run my sneakysneaky example from before, you'll find that it
still succeeds at replacing the "hello" binary because of this, even
with your MS_RDONLY patch.  This can be resolved by instead using
MS_RDONLY with a followup mount call using MS_REMOUNT.

Note also that store items that are files instead of directories (e.g. source
tarballs) are hardlinked if possible.  This seems to stem from an old
misconception that only directories can be bind-mounted.  The hardlinks,
of course, do not have any write-protection on them aside from their
permission bits.

This can be resolved by always bind-mounting them instead.  Despite the
name, there is actually already support for bind-mounting non-directory
files that are listed in dirsInChroot.

>> I suppose we could try to perform these bind-mounts with the MS_RDONLY
>> flag, but we would need some way to ensure that the builder can't just
>> remount them read-write
>
> The example below tests that; ‘mount’ fails with EPERM when using the
> unprivileged daemon (‘./test-env guix build -f …’):
>
> (use-modules (guix)
>              (guix modules)
>              (gnu packages bootstrap))
>
> (computed-file "try-to-remount-input-read-write"
>                (with-imported-modules (source-module-closure
>                                        '((guix build syscalls)))
>                  #~(begin
>                      (use-modules (guix build syscalls))
>
>                      (let ((input #$(plain-file "input-that-might-be-tampered-with"
>                                                 "All good!")))
>                        (mount "none" input "none" (logior MS_BIND MS_REMOUNT))
>                        (call-with-output-file input
>                          (lambda (port)
>                            (display "BAD!" port)))
>                        (mkdir #$output))))
>                #:guile %bootstrap-guile)
>
>
> This is similar to:
>
>   guix shell -C guile guix -- \
>     guile -c '(use-modules (guix build syscalls)) (mount "none" (getenv "GUIX_ENVIRONMENT") "none" (logior MS_BIND MS_REMOUNT))'
>
> mount(2) has this:
>
>    EPERM  An attempt was made to modify (MS_REMOUNT) the MS_RDONLY, MS_NO‐
>           SUID, or MS_NOEXEC flag, or one of the "atime"  flags  (MS_NOAT‐
>           IME,  MS_NODIRATIME,  MS_RELATIME) of an existing mount, but the
>           mount is locked; see mount_namespaces(7).
>
> I couldn’t find the definite answer in mount_namespaces(7) as to whether
> this applies in this case (same namespace but after chroot); I can only
> tell empirically that it does apply.

I don't think that's why we're getting EPERM.  I think we're running
into this, from user_namespaces(7):

   Note that a call to execve(2) will cause a process's capabilities to
   be recalculated in the usual way (see capabilities(7)).
   Consequently, unless the process has a user ID of 0 within the
   namespace, or the executable file has a nonempty inheritable
   capabilities mask, the process will lose all capabilities.  See the
   discussion of user and group ID mappings, below.

As the builder is in the store, it can't have any associated capability
masks, and your added call to prctl to drop ambient capabilities,
together with the fact that the mapped UID inside the container is
nonzero, should make it so that it therefore wouldn't be able to inherit
any.

On a tangentially-related note, the ambient capability set didn't come
into being until Linux 4.3 (around 2016), which is a fair bit newer than
unprivileged user namespaces.  Take that for what you will.

Now, according to capabilities(7):

    Per-user-namespace "set-user-ID-root" programs
       A  set-user-ID  program  whose  UID matches the UID that created a user
       namespace will confer capabilities in the process's permitted  and  ef‐
       fective  sets when executed by any process inside that namespace or any
       descendant user namespace.

       The rules about the transformation of the process's capabilities during
       the  execve(2)  are exactly as described in Transformation of capabili‐
       ties during execve() and Capabilities and execution of programs by root
       above,  with  the  difference that, in the latter subsection, "root" is
       the UID of the creator of the user namespace.

This would seem to suggest that the capabilities within the user
namespace could be regained by creating a setuid binary and executing
it, but experimentally this doesn't happen, and I am unsure whether this
is a bug in the documentation, kernel, or my reading comprehension.  At
any rate, I am less than confident in relying on this behavior.

I think it would be a good idea to, in the no-build-user case, add an
extra call to unshare right at the point where the user and group would
be changed in the build-user case.  This extra call would create a fresh
user and mount namespace, ensuring that the mount-locking behavior you
referenced applies.  My understanding is that the setuid behavior
documented above only grants capabilities, it doesn't change the user
namespace that the process is in, so it should be impossible for the
builder to gain capabilities inside the user namespace owning the
bind-mounted store items, even if it somehow gained full capabilities
within this fresh user namespace.


> -	pid = clone(childEntry,
> + 	pid = clone(childEntry,
>  		    (char *)(((uintptr_t)stack + sizeof(stack) - 8) & ~(uintptr_t)0xf),
>  		    flags, this);
> -	if (pid == -1)
> -	    throw SysError("cloning builder process");
> +	if (pid == -1) {
> +	    if ((flags & CLONE_NEWUSER) != 0 && getuid() != 0)
> +		/* 'clone' fails with EPERM on distros where unprivileged user
> +		   namespaces are disabled.  Error out instead of giving up on
> +		   isolation.  */
> +		throw SysError("cannot create process in unprivileged user namespace");
> +	    else
> +		throw SysError("cloning builder process");
> +	}
> +
> +	if ((flags & CLONE_NEWUSER) != 0) {
> +	    /* Initialize the UID/GID mapping of the guest.  */
> +	    if (pid == 0) {
> +		char str[20] = { '\0' };
> +		readFull(readiness.readSide, (unsigned char*)str, 3);
> +		if (strcmp(str, "go\n") != 0)
> +		    throw Error("failed to initialize process in unprivileged user namespace");
> +	    } else {
> +		initializeUserNamespace(pid);
> +		writeFull(readiness.writeSide, (unsigned char*)"go\n", 3);
> +	    }

This doesn't actually do any synchronizing with the child process,
because clone never returns 0.  It's not like fork where it returns
twice with a different return value each time, control in the new thread
instead goes straight to childEntry.  The parent doesn't get stuck and
hang when writing because PIPE_BUF > 3.

>> This does raise the question, though, of how these failed build
>> directories would get deleted, aside from rebooting the system.
>
> Note that in the early days (and in current Nix actually), build trees
> were not chowned.  That’s OK: they’re deleted upon reboot or by the
> system administrator.
>
> Current Nix has this:
>
> void DerivationGoal::deleteTmpDir(bool force)
> {
>     if (tmpDir != "") {
>         /* Don't keep temporary directories for builtins because they
>            might have privileged stuff (like a copy of netrc). */
>         if (settings.keepFailed && !force && !drv->isBuiltin()) {
>             printError("note: keeping build directory '%s'", tmpDir);
>             chmod(tmpDir.c_str(), 0755);
>         }
>         else
>             deletePath(tmpDir);
>         tmpDir = "";
>     }
> }
>
> We could go back to this.  It’s less convenient, but okay.
>
> In this patch series, it attempts to chown the tree; if it fails to do
> so (because it lacks CAP_CHOWN), it prints a warning and keeps going.

My concern comes from knowing that I've at times gone through 100
sequential failed builds while trying to package something tricky, and I
tend to keep my disk on the low end of free space to minimize how often
I need to rebuild stuff.  That and the one time I tried tinkering with
ungoogled-chromium.  I know I'd probably cause a lot of trouble if I
tried doing that stuff on a shared system I didn't have administrative
access to.

A best-effort chown attempt should do fine for now, though.

>> We currently remount /gnu/store read-write at LocalStore-creation-time,
>> which happens in the newly-forked guix-daemon process at the start of a
>> connection.  I don't think there's any particularly elevated risk from
>> instead doing that before the per-connection process is forked.  There
>> are a number of ways we could do this: we could make it the
>> responsibility of the init system to create the mount namespace and do
>> the remounting, or we could have guix-daemon do it immediately on
>> startup and subsequently switch its uid and gid to
>> guix-daemon:guix-daemon.  These lack the slick appeal of "see, you never
>> have to give it root, and you can prove it just by looking at the
>> service file", but realistically should be just as secure.  It may be
>> useful to provide a small wrapper around guix-daemon that does the
>> remount and privilege-dropping, to more succinctly express this to
>> anybody wishing to see for themselves.
>
> I think I’d prefer to have a systemd (or other) service make a
> read-write bind-mount at /gnu/store/.rw-store, and then we’d run
> ‘guix-daemon --backing-store=/gnu/store/.rw-store’.
>
> WDYT?

So if I understand correctly, we would have /gnu/store hold all of its
usual contents in the usual manner, and a service would bind-mount
/gnu/store to /gnu/store/.rw-store without MS_RDONLY, and then it (or
another service that depends on it) would bind-mount /gnu/store to
itself with MS_RDONLY, and then guix-daemon would, in its own mount
namespace, bind-mount /gnu/store/.rw-store to /gnu/store, again without
MS_RDONLY.

I assume that making /gnu/store read-only wouldn't make the
already-bind-mounted /gnu/store/.rw-store read-only too?  If it does,
it's not going to work, and if it doesn't, it's going to remain writable
for footgun appreciators.  But I suppose it's at least a little more
out-of-the-way.

I think it might be simpler to integrate the change if we instead made
it /gnu/.rw-store or something like that, since that way we don't have
to worry about updating the garbage collector and such to treat it
specially.

Actually, now that I think about it, another possibility would be having
a service that the read-only store-mount service depends on that first
creates a persistent user+mount namespace combo which saves a view of
the writable store (I don't recall exactly how creating the persistent
namespace works, but I know the 'ip netns ...' commands can do something
similar to create named network namespaces).  The process that creates
this namespace would run as the guix-daemon user, and therefore when
guix-daemon starts it would have full capabilities within that user
namespace, and could setns straight into it.  This would leave no
writable store in the root mount namespace.

>> Personally, I think that if a guix-daemon can use privilege separation
>> users, it would probably be a good idea to.  We're certainly going to
>> need to support them on non-linux systems either way.  Could it be
>> possible to have guix-install.sh modify /etc/sudoers on systems that use
>> it to allow the guix-daemon user to run processes under guix builder
>> users?  I am currently less worried about arbitrary code execution
>> vulnerabilities being found in the daemon than about the possibility of
>> malicious builders (but it is possible I am underexposed to the ways
>> those can happen in C++).
>
> What would you put in /etc/sudoers?  I’m not sure what you had in
> mind.

I'm not sure what I had in mind either, I've only seen some opine that
it's usually better to configure sudo than to write your own setuid
programs, which was the first thing that came to mind for how to use
dedicated build users without needing the entire daemon running as root.
I recall reading somewhere that it could be configured to allow certain
users to run certain commands as certain other users?  So maybe it could
be configured to allow the guix-daemon user to run any command as any of
the guixbuilder users.  Although granted, the way that container setup
is currently done wouldn't work very well with that, since by the time
we're ready to execute the builder we're already fully in the container,
where setuid-root binaries should probably not be.

I know that "how to use dedicated build users without root" probably
isn't what you were asking for feedback on, but it did show up in my
thoughts quite a bit.

- reepca
[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 56 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.