GNU bug report logs - #77862
guix-daemon run as non-root sets up /etc/group incorrectly in build container

Previous Next

Package: guix;

Reported by: keinflue <keinflue <at> posteo.net>

Date: Thu, 17 Apr 2025 11:22:03 UTC

Severity: important

Full log


Message #64 received at 77862 <at> debbugs.gnu.org (full text, mbox):

From: Reepca Russelstein <reepca <at> russelstein.xyz>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: keinflue <keinflue <at> posteo.net>, 77862 <at> debbugs.gnu.org
Subject: Re: guix-daemon run as non-root sets up /etc/group incorrectly in
 build container
Date: Tue, 03 Jun 2025 08:05:21 -0500
[Message part 1 (text/plain, inline)]
Ludovic Courtès <ludo <at> gnu.org> writes:

> Hey Reepca,
>
> Reepca Russelstein <reepca <at> russelstein.xyz> writes:
>
>> Ludovic Courtès <ludo <at> gnu.org> writes:
>
> [...]
>
>>> The attached patch tries to do that, by calling out to ‘newuidmap’, and
>>> under the assumption that /etc/subgid allows mapping the ‘kvm’ group.
>>>
>>> It does the job (a build process can chown to ‘kvm’), but I couldn’t get
>>> the GID mapping preserved across the ‘unshare’ call (the call that is
>>> made to “lock” mounts), hence the “#if 0” there.
>>>
>>> The problem is that when we call ‘unshare’, the ‘newgidmap’ setuid
>>> binary is not longer accessible because we’re already in a chroot, so it
>>> seems that we cannot preserve the GID map.
>>
>> ... and even if we had the setuid binary accessible (for example via a
>> saved file descriptor that could be used with execveat, or a
>> bind-mount), it wouldn't be of any use at this point because (man
>> user_namespaces):
>>
>> "if either the user or the group ID of the file has no mapping inside
>> the namespace, the set-user-ID (set-group- ID) bit is silently ignored:
>> the new program is executed, but the process's effective user (group) ID
>> is left unchanged."
>>
>> Naturally, uid 0 isn't going to be mapped!  In fact, more generally,
>> newuidmap and newgidmap can't ever be used from within an uninitialized
>> user namespace, since by definition uid 0 isn't yet mapped in it.
>>
>> So it falls to the parent process to do the initialization - that is, it
>> now has to do the initialization twice.  Of course, it's going to need
>> some way of knowing when the second user namespace has been created, and
>> the child is going to need some way of knowing when it's been
>> initialized, so we'll need to either use two pipes or switch to using a
>> socketpair.
>
> That makes sense, thanks for explaining!
>
> However, I think part of the rules still don’t fit in my head.  Here’s
> what I have now (parent is 6929):
>
> 6929  openat(AT_FDCWD, "/proc/6938/uid_map", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17
> 6929  write(17, "30001 1000 1", 12)     = 12
> 6929  close(17)                         = 0
> 6929  openat(AT_FDCWD, "/proc/6938/setgroups", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17
> 6929  write(17, "deny", 4)              = 4
> 6929  close(17)                         = 0
> 6929  openat(AT_FDCWD, "/proc/6938/gid_map", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17
> 6929  write(17, "30000 998 1", 11)      = 11
> 6929  close(17)                         = 0
> […]
> 6938  unshare(CLONE_NEWNS|CLONE_NEWUSER) = 0
> 6938  write(20, "reinit\n", 7)          = 7
> 6929  <... read resumed>"reinit\n", 20) = 7
> 6938  read(17,  <unfinished ...>
> 6929  write(4, "gmlo\0\0\0\0+\0\0\0\0\0\0\0|   reinitializing UID/GID mapping of 6938\n\0\0\0\0\0", 64) = 64
> 6929  getgid()                          = 998
> 6929  getuid()                          = 1000
> 6929  openat(AT_FDCWD, "/proc/6938/uid_map", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 17
> 6929  write(17, "30001 1000 1", 12)     = -1 EPERM (Operation not permitted)
>
> Why oh why would we get EPERM the second time?
>
> These restrictions appear to be respected:
>
>    •  The data written to uid_map (gid_map) must consist of a sin‐
>       gle  line  that maps the writing process's effective user ID
>       (group ID) in the parent user namespace to a user ID  (group
>       ID) in the user namespace.
>
>    •  The  writing process must have the same effective user ID as
>       the process that created the user namespace.
>
> Or is the user namespace of the parent (6929) not the same as “the
> parent user namespace”?

I'm afraid that is indeed what we're running in to: "the parent user
namespace" is the one created by clone.  And because the parent user
namespace is not the root user namespace, we couldn't run newgidmap from
within it, even if we still had a process around that was in it.

However, at this point, we actually don't need to use newgidmap, because
we have CAP_SETGID in the current user namespace... but in order to use
the "no restrictions" version of writing to gid_map, it's necessary for
the writing process to have CAP_SETGID in the *parent* user namespace
(in hindsight, this is what the actual problem was with the original
patch - it specified "true" for haveCapSetGID, so it shouldn't have been
using newgidmap anyway).  This necessarily requires that the writing
process is not in the user namespace being initialized, and because it
must either be in the namespace being initialized or its parent user
namespace, that means it must be in the parent user namespace.  Of
course, at this point, no processes exist in the parent user namespace.

So if you'll bear with the extreme awkwardness, we could fork a helper
process immediately prior to calling unshare, which, upon receiving a
notification, will initialize the parent process's user namespace.  Note
that the naming here is going to be inverted for process ancestry and
user namespace ancestry: the child process is in the parent user
namespace, and the parent process is in the child user namespace.

> Thanks for your patience.  :-)
>
> Ludo’.
>
> PS: The more I use it, the less I can stand this user namespace soup
>     presented as an “API”.

It certainly has a lot of clauses.  I will say, I did some reading of
the GNU Mach manual and was awed by how much simpler things could be if
every system call dealing with processes just took an explicit task port
argument like it does there.  The requirement that so many things
associated with processes can only be manipulated indirectly, and
usually only by the process itself, has caused no end of troubles, and
as the number of process-associated attributes in Linux continues to
grow, the interactions will likely only get more complicated.

It's not the kind of job security that gives any satisfaction.

- reepca
[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 9 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.