#77862 - guix-daemon run as non-root sets up /etc/group incorrectly in build container

GNU bug report logs - #77862
guix-daemon run as non-root sets up /etc/group incorrectly in build container

Package: guix;

Reported by: keinflue <keinflue <at> posteo.net>

Date: Thu, 17 Apr 2025 11:22:03 UTC

Severity: important

Message #64 received at 77862 <at> debbugs.gnu.org (full text, mbox):

From: Reepca Russelstein <reepca <at> russelstein.xyz> To: Ludovic Courtès <ludo <at> gnu.org> Cc: keinflue <keinflue <at> posteo.net>, 77862 <at> debbugs.gnu.org Subject: Re: guix-daemon run as non-root sets up /etc/group incorrectly in build container Date: Tue, 03 Jun 2025 08:05:21 -0500

[Message part 1 (text/plain, inline)]

Ludovic Courtès <ludo <at> gnu.org> writes: > Hey Reepca, > > Reepca Russelstein <reepca <at> russelstein.xyz> writes: > >> Ludovic Courtès <ludo <at> gnu.org> writes: > > [...] > >>> The attached patch tries to do that, by calling out to ‘newuidmap’, and >>> under the assumption that /etc/subgid allows mapping the ‘kvm’ group. >>> >>> It does the job (a build process can chown to ‘kvm’), but I couldn’t get >>> the GID mapping preserved across the ‘unshare’ call (the call that is >>> made to “lock” mounts), hence the “#if 0” there. >>> >>> The problem is that when we call ‘unshare’, the ‘newgidmap’ setuid >>> binary is not longer accessible because we’re already in a chroot, so it >>> seems that we cannot preserve the GID map. >> >> ... and even if we had the setuid binary accessible (for example via a >> saved file descriptor that could be used with execveat, or a >> bind-mount), it wouldn't be of any use at this point because (man >> user_namespaces): >> >> "if either the user or the group ID of the file has no mapping inside >> the namespace, the set-user-ID (set-group- ID) bit is silently ignored: >> the new program is executed, but the process's effective user (group) ID >> is left unchanged." >> >> Naturally, uid 0 isn't going to be mapped! In fact, more generally, >> newuidmap and newgidmap can't ever be used from within an uninitialized >> user namespace, since by definition uid 0 isn't yet mapped in it. >> >> So it falls to the parent process to do the initialization - that is, it >> now has to do the initialization twice. Of course, it's going to need >> some way of knowing when the second user namespace has been created, and >> the child is going to need some way of knowing when it's been >> initialized, so we'll need to either use two pipes or switch to using a >> socketpair. > > That makes sense, thanks for explaining! > > However, I think part of the rules still don’t fit in my head. Here’s > what I have now (parent is 6929): > > 6929 openat(AT_FDCWD, "/proc/6938/uid_map", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17 > 6929 write(17, "30001 1000 1", 12) = 12 > 6929 close(17) = 0 > 6929 openat(AT_FDCWD, "/proc/6938/setgroups", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17 > 6929 write(17, "deny", 4) = 4 > 6929 close(17) = 0 > 6929 openat(AT_FDCWD, "/proc/6938/gid_map", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17 > 6929 write(17, "30000 998 1", 11) = 11 > 6929 close(17) = 0 > […] > 6938 unshare(CLONE_NEWNS|CLONE_NEWUSER) = 0 > 6938 write(20, "reinit\n", 7) = 7 > 6929 <... read resumed>"reinit\n", 20) = 7 > 6938 read(17, <unfinished ...> > 6929 write(4, "gmlo\0\0\0\0+\0\0\0\0\0\0\0| reinitializing UID/GID mapping of 6938\n\0\0\0\0\0", 64) = 64 > 6929 getgid() = 998 > 6929 getuid() = 1000 > 6929 openat(AT_FDCWD, "/proc/6938/uid_map", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17 > 6929 write(17, "30001 1000 1", 12) = -1 EPERM (Operation not permitted) > > Why oh why would we get EPERM the second time? > > These restrictions appear to be respected: > > • The data written to uid_map (gid_map) must consist of a sin‐ > gle line that maps the writing process's effective user ID > (group ID) in the parent user namespace to a user ID (group > ID) in the user namespace. > > • The writing process must have the same effective user ID as > the process that created the user namespace. > > Or is the user namespace of the parent (6929) not the same as “the > parent user namespace”? I'm afraid that is indeed what we're running in to: "the parent user namespace" is the one created by clone. And because the parent user namespace is not the root user namespace, we couldn't run newgidmap from within it, even if we still had a process around that was in it. However, at this point, we actually don't need to use newgidmap, because we have CAP_SETGID in the current user namespace... but in order to use the "no restrictions" version of writing to gid_map, it's necessary for the writing process to have CAP_SETGID in the *parent* user namespace (in hindsight, this is what the actual problem was with the original patch - it specified "true" for haveCapSetGID, so it shouldn't have been using newgidmap anyway). This necessarily requires that the writing process is not in the user namespace being initialized, and because it must either be in the namespace being initialized or its parent user namespace, that means it must be in the parent user namespace. Of course, at this point, no processes exist in the parent user namespace. So if you'll bear with the extreme awkwardness, we could fork a helper process immediately prior to calling unshare, which, upon receiving a notification, will initialize the parent process's user namespace. Note that the naming here is going to be inverted for process ancestry and user namespace ancestry: the child process is in the parent user namespace, and the parent process is in the child user namespace. > Thanks for your patience. :-) > > Ludo’. > > PS: The more I use it, the less I can stand this user namespace soup > presented as an “API”. It certainly has a lot of clauses. I will say, I did some reading of the GNU Mach manual and was awed by how much simpler things could be if every system call dealing with processes just took an explicit task port argument like it does there. The requirement that so many things associated with processes can only be manipulated indirectly, and usually only by the process itself, has caused no end of troubles, and as the number of process-associated attributes in Linux continues to grow, the interactions will likely only get more complicated. It's not the kind of job security that gives any satisfaction. - reepca

[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 67 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #77862 guix-daemon run as non-root sets up /etc/group incorrectly in build container

GNU bug report logs - #77862
guix-daemon run as non-root sets up /etc/group incorrectly in build container