Ludovic Courtès <ludo@gnu.org> writes:

> I’ve sent a v2 addressing some of the issues you mentioned before.
>
> Crucially, this one remains:
>
>>       #~(let ((guile (string-append (assoc-ref %guile-build-info
>>                                                'bindir)
>>                                     "/guile")))
>>           (chmod "/" #o777)
>>           (copy-file guile "/guile")
>>           (chmod "/guile" #o6755)
>>           (sleep 1000)
>
> That is, / is currently writable inside the build environment, and
> that’s:
>
>   1. a security issue, but it could be addressed with a /top
>      sub-directory as you wrote;
>
>   2. a reproducibility issue because a build process now be able to
>      create/modify files anywhere.
>
> I looked for solutions to this and couldn’t find anything so far.
>
> In particular, re-mounting / read-only makes everything beneath it
> read-only, including mount points that were initially read-write.  It
> might be that the wealth of MS_ options could be used to address that,
> but honestly, it’s a mess and a maze (“shared subtrees”?).

(Note: I've since seen your followup email on this, but I think there's
still some interesting ideas in what I wrote before then)

Unless there is special behavior for /, I don't see this (every mount
point beneath it becoming read-only) happening.  When a bind-mount is
created, it inherits its options from the filesystem that the source is
on ("The bind mount has the same mount options as the underlying mount"
in mount(2)).  This does not prevent MS_REMOUNT from being used with the
MS_RDONLY bit zeroed to subsequently make the newly-created mount point
writable, nor, to my knowledge, does it modify the flags of any existing
mount points underneath the bind-mount when MS_REC is used with MS_BIND.

I expect that it should work to:
1. go through the entire normal chroot setup
2. bind-mount /gnu/store and /tmp to themselves within the chroot using
   MS_REC so that they are treated as distinct filesystems but also
   still have their existing bind-mounts underneath them
3. bind-mount / to itself using MS_REC
4. remount / read-only using MS_RDONLY | MS_REMOUNT | MS_BIND

This should ensure that the only writable files in the chroot are those
either in /tmp, /gnu/store, or in another filesystem inside the chroot
(e.g. /dev, /proc, any of the bind mounts in /gnu/store if we were to
forget to remount them MS_RDONLY, etc).

But note that this will cause open(2) and chmod(2) for filenames in the
same filesystem as / to return EROFS instead of EACCES, and it will
still be visible to builders that it's owned by the build user.  For
that matter the same difference will be observable for bind-mounted
store items, but this should matter less because we are already in the
practice of registered store items being in a store mounted read-only in
practical usage.

We could try setting the user-writable permission bit to 0 for /, so
that it will give EACCES, which might avoid some of the worst of the
unreproducibility.

Another option would be to use a root-owned "template" root directory
that just contains the (empty) subdirectories gnu, gnu/store, tmp, proc,
and dev.  This template directory would become the root directory used
by pivot_root, with individual filesystems and bind mounts created on
top of its subdirectories inside the container's mount namespace.  This
requires no special permissions, the template directory just has to
exist and be publicly-visible.

It does occur to me now, though, that we wouldn't be able to actually
map any other uids within the container to anything without CAP_SETUID,
so / would end up appearing as being owned by the overflow uid.  Aside
from the actual number, though, it should behave like it's owned by
root, EACCES and all.  I suppose the same behavior would also be
observed if the template were owned by any user other than the build
user, not just root.

> Alternatively, I wondered if we could make / owned by the overflow user,
> but that’s probably not possible.
>
> Perhaps yet another option would be to use subordinate IDs to map two
> different users in the container, but that sounds more involved and I’m
> not sure how to get that done.

My still-young understanding of subordinate IDs is that they're not
really a kernel thing, but rather are honored by two setuid programs
from the shadow package, newuidmap and newgidmap, so that would be a bit
like using a configured sudo, albeit probably easier to integrate with
the daemon since they basically just replace the initializeUserNamespace
procedure with running a command.

We would basically just pick a uid and gid for a guixbuild user (there's
no reason not to use the regular user-and-group-adding processes for
this), then add entries in /etc/subuid and /etc/subgid indicating that
guix-daemon is allowed to map exactly that user and exactly that group,
as well as its own user and group.  We would then add a case in
initializeUserNamespace that would fork+exec+wait calls to newuidmap and
newgidmap that map two uids and gids: uid and gid 0 map to the
guix-daemon user and group, and guestUID and guestGID are mapped to the
guixbuild user and group.

In the child, we initially have CAP_SETUID within the user namespace,
and can therefore set our user and group ids to the newly-mapped
guixbuild user / group.  The directories created during the container
setup will all appear to be owned by uid and gid 0.  Note that when
creating the chroot store we'll need to make sure that its group is
guixbuild so that the builder can write to it, and I'm not sure how to
handle chown'ing of build directories in this case (is it even possible
for two cooperating unprivileged users to transfer ownership of an
inode?).

The earliest reference I can find to new*map in the shadow changelog is
from 2013, so it's at least that old.  We should probably keep the
map-single-id case around in initializeUserNamespace as a fallback for
the fully-unprivileged use case, e.g. test-env.

While this adds an external dependency on a setuid program, it is at
least a setuid program that should be fairly common and have a lot of
security-minded attention on it, and be less complex than something like
sudo.  In exchange, we would get the cleanest rootless-with-an-asterisk
daemon configuration I can think of, with no known reproducibility
issues, little modification to the daemon required, and the extra safety
net of a dedicated build user.

It sounds like a pretty decent route to take for the
privileged-but-rootless case.


> @@ -2707,8 +2769,25 @@ void DerivationGoal::deleteTmpDir(bool force)
>              // Change the ownership if clientUid is set. Never change the
>              // ownership or the group to "root" for security reasons.
>              if (settings.clientUid != (uid_t) -1 && settings.clientUid != 0) {
> -                _chown(tmpDir, settings.clientUid,
> -                       settings.clientGid != 0 ? settings.clientGid : -1);
> +		uid_t uid = settings.clientUid;
> +		gid_t gid = settings.clientGid != 0 ? settings.clientGid : -1;
> +		try {
> +		    _chown(tmpDir, uid, gid);
> +
> +		    if (getuid() != 0) {
> +			/* If, without being root, the '_chown' call above
> +			   succeeded, then it means we have CAP_CHOWN.  Retake
> +			   ownership of tmpDir itself so it can be renamed
> +			   below.  */
> +			chown(tmpDir.c_str(), getuid(), getgid());
> +		    }
> +		} catch (SysError & e) {
> +		    /* When running as an unprivileged user and without
> +		       CAP_CHOWN, we cannot chown the build tree.  Print a
> +		       message and keep going.  */
> +		    printMsg(lvlInfo, format("cannot change ownership of build directory '%1%': %2%")
> +			     % tmpDir % strerror(e.errNo));
> +		}
>  
>  		if (top != tmpDir) {
>  		    // Rename tmpDir to its parent, with an intermediate step.

(Note: pedantic aside here, there aren't currently issues with what is
written immediately below as long as the top directory is used to shield
the tmpDir - the top directory is doing a LOT of heavy lifting)

It shouldn't be a problem in practice due to top only being
owner-accessible, but I feel like I should still note that the second
chown here would be of a file that was previously owned by the client
user, and as such could, in the most general case, have been replaced
with anything, such as a setuid program or symlink.  While chown(2)
resets setuid and setgid bits for unprivileged users, it's unspecified
by posix whether this occurs for privileged users.  Linux currently does
this permission resetting for privileged users, but it wouldn't surprise
me if there's still ways to screw things up by chown'ing a symlink.

Also, _chown does the actual chown on descent, not on return, so it
first chowns a directory and then goes through its contents.  This means
that, again, if there weren't the top directory there to block access,
it would be possible to access a setuid program before it was chown'ed.

We seem to be relying entirely on the Linux behavior of chown to reset
setuid / setgid bits.  And the man page isn't even entirely clear on
this: it says those bits are cleared for "an unprivileged user", that a
"privileged" user here means one with CAP_CHOWN, and that in Linux since
2.2.13 "root" is treated like other users.  This doesn't answer the
question of what happens for privileged non-root users.  It's also not
clear what happens when a user chown's a file with a uid and gid that
aren't -1, but are the same as the current owner and group of the file
(experimentally, it still resets the setid bits).  It would probably be
a good idea to explicitly reset these bits in _chown, and perhaps modify
_chown to operate bottom-up instead of top-down.  Alternatively, we
could use secureFilePerms before calling _chown.

Also, not shown here, but there's a chmod(tmpDir.c_str(), 0755) shortly
before all of this, which means there's a window before _chown could be
called in which a setuid program could be exposed, if not for the top
directory shielding tmpDir.  And if settings.clientUid is -1 or 0, then
that window has no end.

Just something to keep in mind.

(End pedantic aside)


Now, for the non-pedantic, significant issue that I came across while
writing all that: previously, it was not possible for the
tmpDir-exposing code to be reached without doing the _chown that also
reset setuid and setgid bits.  But with this patch, in the non-root,
non-CAP_CHOWN case (which is what is currently proposed for Guix
System), it can be reached through the catch clause.  In that case it
will expose tmpDir without changing any permission bits of files beneath
it, allowing anybody who can access a setuid program in tmpDir (which,
due to that 0755 chmod, is "everybody") to take control of the build
user (which in this case would be guix-daemon).

Going by the "Running unprivileged but with CAP_CHOWN" comment, it would
seem that code is meant to only be reached by reaching the end of the
"try" block, not by reaching the end of the "catch" block.  I think it
would be a good idea to call secureFilePerms(tmpDir) before any attempt
at chown'ing.


I still think it would be a good idea to call unshare to create an extra
user and mount namespace just before executing the builder in the
unprivileged case, just to be sure that the mount-locking behavior is
triggered in a way that is documented.

- reepca