Ludovic Courtès writes: > I’ve sent a v2 addressing some of the issues you mentioned before. > > Crucially, this one remains: > >> #~(let ((guile (string-append (assoc-ref %guile-build-info >> 'bindir) >> "/guile"))) >> (chmod "/" #o777) >> (copy-file guile "/guile") >> (chmod "/guile" #o6755) >> (sleep 1000) > > That is, / is currently writable inside the build environment, and > that’s: > > 1. a security issue, but it could be addressed with a /top > sub-directory as you wrote; > > 2. a reproducibility issue because a build process now be able to > create/modify files anywhere. > > I looked for solutions to this and couldn’t find anything so far. > > In particular, re-mounting / read-only makes everything beneath it > read-only, including mount points that were initially read-write. It > might be that the wealth of MS_ options could be used to address that, > but honestly, it’s a mess and a maze (“shared subtrees”?). (Note: I've since seen your followup email on this, but I think there's still some interesting ideas in what I wrote before then) Unless there is special behavior for /, I don't see this (every mount point beneath it becoming read-only) happening. When a bind-mount is created, it inherits its options from the filesystem that the source is on ("The bind mount has the same mount options as the underlying mount" in mount(2)). This does not prevent MS_REMOUNT from being used with the MS_RDONLY bit zeroed to subsequently make the newly-created mount point writable, nor, to my knowledge, does it modify the flags of any existing mount points underneath the bind-mount when MS_REC is used with MS_BIND. I expect that it should work to: 1. go through the entire normal chroot setup 2. bind-mount /gnu/store and /tmp to themselves within the chroot using MS_REC so that they are treated as distinct filesystems but also still have their existing bind-mounts underneath them 3. bind-mount / to itself using MS_REC 4. remount / read-only using MS_RDONLY | MS_REMOUNT | MS_BIND This should ensure that the only writable files in the chroot are those either in /tmp, /gnu/store, or in another filesystem inside the chroot (e.g. /dev, /proc, any of the bind mounts in /gnu/store if we were to forget to remount them MS_RDONLY, etc). But note that this will cause open(2) and chmod(2) for filenames in the same filesystem as / to return EROFS instead of EACCES, and it will still be visible to builders that it's owned by the build user. For that matter the same difference will be observable for bind-mounted store items, but this should matter less because we are already in the practice of registered store items being in a store mounted read-only in practical usage. We could try setting the user-writable permission bit to 0 for /, so that it will give EACCES, which might avoid some of the worst of the unreproducibility. Another option would be to use a root-owned "template" root directory that just contains the (empty) subdirectories gnu, gnu/store, tmp, proc, and dev. This template directory would become the root directory used by pivot_root, with individual filesystems and bind mounts created on top of its subdirectories inside the container's mount namespace. This requires no special permissions, the template directory just has to exist and be publicly-visible. It does occur to me now, though, that we wouldn't be able to actually map any other uids within the container to anything without CAP_SETUID, so / would end up appearing as being owned by the overflow uid. Aside from the actual number, though, it should behave like it's owned by root, EACCES and all. I suppose the same behavior would also be observed if the template were owned by any user other than the build user, not just root. > Alternatively, I wondered if we could make / owned by the overflow user, > but that’s probably not possible. > > Perhaps yet another option would be to use subordinate IDs to map two > different users in the container, but that sounds more involved and I’m > not sure how to get that done. My still-young understanding of subordinate IDs is that they're not really a kernel thing, but rather are honored by two setuid programs from the shadow package, newuidmap and newgidmap, so that would be a bit like using a configured sudo, albeit probably easier to integrate with the daemon since they basically just replace the initializeUserNamespace procedure with running a command. We would basically just pick a uid and gid for a guixbuild user (there's no reason not to use the regular user-and-group-adding processes for this), then add entries in /etc/subuid and /etc/subgid indicating that guix-daemon is allowed to map exactly that user and exactly that group, as well as its own user and group. We would then add a case in initializeUserNamespace that would fork+exec+wait calls to newuidmap and newgidmap that map two uids and gids: uid and gid 0 map to the guix-daemon user and group, and guestUID and guestGID are mapped to the guixbuild user and group. In the child, we initially have CAP_SETUID within the user namespace, and can therefore set our user and group ids to the newly-mapped guixbuild user / group. The directories created during the container setup will all appear to be owned by uid and gid 0. Note that when creating the chroot store we'll need to make sure that its group is guixbuild so that the builder can write to it, and I'm not sure how to handle chown'ing of build directories in this case (is it even possible for two cooperating unprivileged users to transfer ownership of an inode?). The earliest reference I can find to new*map in the shadow changelog is from 2013, so it's at least that old. We should probably keep the map-single-id case around in initializeUserNamespace as a fallback for the fully-unprivileged use case, e.g. test-env. While this adds an external dependency on a setuid program, it is at least a setuid program that should be fairly common and have a lot of security-minded attention on it, and be less complex than something like sudo. In exchange, we would get the cleanest rootless-with-an-asterisk daemon configuration I can think of, with no known reproducibility issues, little modification to the daemon required, and the extra safety net of a dedicated build user. It sounds like a pretty decent route to take for the privileged-but-rootless case. > @@ -2707,8 +2769,25 @@ void DerivationGoal::deleteTmpDir(bool force) > // Change the ownership if clientUid is set. Never change the > // ownership or the group to "root" for security reasons. > if (settings.clientUid != (uid_t) -1 && settings.clientUid != 0) { > - _chown(tmpDir, settings.clientUid, > - settings.clientGid != 0 ? settings.clientGid : -1); > + uid_t uid = settings.clientUid; > + gid_t gid = settings.clientGid != 0 ? settings.clientGid : -1; > + try { > + _chown(tmpDir, uid, gid); > + > + if (getuid() != 0) { > + /* If, without being root, the '_chown' call above > + succeeded, then it means we have CAP_CHOWN. Retake > + ownership of tmpDir itself so it can be renamed > + below. */ > + chown(tmpDir.c_str(), getuid(), getgid()); > + } > + } catch (SysError & e) { > + /* When running as an unprivileged user and without > + CAP_CHOWN, we cannot chown the build tree. Print a > + message and keep going. */ > + printMsg(lvlInfo, format("cannot change ownership of build directory '%1%': %2%") > + % tmpDir % strerror(e.errNo)); > + } > > if (top != tmpDir) { > // Rename tmpDir to its parent, with an intermediate step. (Note: pedantic aside here, there aren't currently issues with what is written immediately below as long as the top directory is used to shield the tmpDir - the top directory is doing a LOT of heavy lifting) It shouldn't be a problem in practice due to top only being owner-accessible, but I feel like I should still note that the second chown here would be of a file that was previously owned by the client user, and as such could, in the most general case, have been replaced with anything, such as a setuid program or symlink. While chown(2) resets setuid and setgid bits for unprivileged users, it's unspecified by posix whether this occurs for privileged users. Linux currently does this permission resetting for privileged users, but it wouldn't surprise me if there's still ways to screw things up by chown'ing a symlink. Also, _chown does the actual chown on descent, not on return, so it first chowns a directory and then goes through its contents. This means that, again, if there weren't the top directory there to block access, it would be possible to access a setuid program before it was chown'ed. We seem to be relying entirely on the Linux behavior of chown to reset setuid / setgid bits. And the man page isn't even entirely clear on this: it says those bits are cleared for "an unprivileged user", that a "privileged" user here means one with CAP_CHOWN, and that in Linux since 2.2.13 "root" is treated like other users. This doesn't answer the question of what happens for privileged non-root users. It's also not clear what happens when a user chown's a file with a uid and gid that aren't -1, but are the same as the current owner and group of the file (experimentally, it still resets the setid bits). It would probably be a good idea to explicitly reset these bits in _chown, and perhaps modify _chown to operate bottom-up instead of top-down. Alternatively, we could use secureFilePerms before calling _chown. Also, not shown here, but there's a chmod(tmpDir.c_str(), 0755) shortly before all of this, which means there's a window before _chown could be called in which a setuid program could be exposed, if not for the top directory shielding tmpDir. And if settings.clientUid is -1 or 0, then that window has no end. Just something to keep in mind. (End pedantic aside) Now, for the non-pedantic, significant issue that I came across while writing all that: previously, it was not possible for the tmpDir-exposing code to be reached without doing the _chown that also reset setuid and setgid bits. But with this patch, in the non-root, non-CAP_CHOWN case (which is what is currently proposed for Guix System), it can be reached through the catch clause. In that case it will expose tmpDir without changing any permission bits of files beneath it, allowing anybody who can access a setuid program in tmpDir (which, due to that 0755 chmod, is "everybody") to take control of the build user (which in this case would be guix-daemon). Going by the "Running unprivileged but with CAP_CHOWN" comment, it would seem that code is meant to only be reached by reaching the end of the "try" block, not by reaching the end of the "catch" block. I think it would be a good idea to call secureFilePerms(tmpDir) before any attempt at chown'ing. I still think it would be a good idea to call unshare to create an extra user and mount namespace just before executing the builder in the unprivileged case, just to be sure that the mount-locking behavior is triggered in a way that is documented. - reepca