Package: guix-patches;
Reported by: Carlo Zancanaro <carlo <at> zancanaro.id.au>
Date: Tue, 27 Feb 2018 21:58:02 UTC
Severity: normal
Done: ludo <at> gnu.org (Ludovic Courtès)
Bug is archived. No further changes may be made.
Message #26 received at 30637 <at> debbugs.gnu.org (full text, mbox):
From: ludo <at> gnu.org (Ludovic Courtès) To: Carlo Zancanaro <carlo <at> zancanaro.id.au> Cc: 30637 <at> debbugs.gnu.org Subject: Re: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Date: Sat, 03 Mar 2018 16:21:07 +0100
Hi Carlo, Overall LGTM! It’s a long reply though, but that’s because there are lots of details to pay attention to in this Unix quagmire. :-) Carlo Zancanaro <carlo <at> zancanaro.id.au> skribis: > I've re-written my patch, and it's attached in two commits. The first > one adds the necessary calls to prctl, and the second adds the > fallback to polling. If possible I would prefer a commit that only adds prctl, and the next one would actually use it. I find it clearer and more convenient if we need to bisect or revert. > On Fri, Mar 02 2018, Ludovic Courtès wrote: >> The ‘prctl’ procedure itself should simply throw to 'system-error on >> GNU/Hurd. But then, in (shepherd), we could add the polling thing >> when (not (string-contains %host-type "linux")). >> >> WDYT? > > I don't like the idea of doing this based on the host type. In my > patch I've done it based on whether the prctl call succeeded. Right, I actually agree with feature-based checks. :-) More on that inline below. > The fallback code still fails in the guix build environment (as my > previous patch did), but when it's using prctl it works properly. This > means that a build on Linux pre-3.4, or on Hurd, will fail, which > probably isn't acceptable given that shepherd is a hard dependency for > starting a GuixSD system. As far as I can tell the test fails because > the processes stick around as zombies, If they’re zombies, that means nobody called waitpid(2). Presumably the polling code would need to do that. So I suppose ‘check-for-dead-services’ should do something like: (when (integer? running) (catch 'system-error (lambda () (match (waitpid* running WNOHANG) (#f #f) ;uh, not a PID? ((0 . _) #f) ;ditto? ((pid . _) (local-output "PID ~a (~a) is dead" running (canonical-name service)) (respawn-service service)))) (lambda args (or (= ECHILD (system-error-errno args)) ;wrong PID? (= EPERM (system-error-errno args)) ;not a child (apply throw args))))) Does that make sense? Please check waitpid(2) carefully though, because it’s pretty gnarly and I might have forgotten or misinterpreted something here. >> We want to set the “reaper” of child processes to Shepherd itself, >> so we must do that in child processes. The shepherd process cannot >> be its own reaper I suppose. > > Reading the manpage, and then running some code, I think you're wrong > about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling > process as a child subreaper. That means that any processes that are > orphaned below the current process get reparented under the current > process (or a closer child subreaper, if there's one further down). If > there are no processes marked as child subreapers, then the orphaned > process is reparented under pid 1. We should only need to call prctl > in two places: when shepherd initially starts, or when we fork for > daemonize. Oh you’re right, sorry for the confusion! > From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro <carlo <at> zancanaro.id.au> > Date: Sat, 3 Mar 2018 17:26:05 +1100 > Subject: [PATCH 1/2] Handle forked process SIGCHLD signals > > * Makefile.am (TESTS): Add tests/forking-service.sh. > * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER. > * modules/shepherd.scm: Set the child subreaper attribute of main shepherd > process (as long as we're not pid 1). > * modules/shepherd/service.scm (root-service)[daemonize]: Set the child > subreaper attribute of newly forked shepherd process. > * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable > and export it. > (prctl): Add new procedure and export it. [...] > --- a/modules/shepherd.scm > +++ b/modules/shepherd.scm > @@ -50,6 +50,8 @@ > ;; Main program. > (define (main . args) > (initialize-cli) > + (when (not (= 1 (getpid))) > + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) I think it’s a good idea to add a comment, like: ;; Register ourselves to get SIGCHLD when child processes terminate. ;; This is necessary for daemons for which we’d otherwise never get ;; SIGCHLD. > +(define prctl > + (let ((proc (syscall->procedure long "prctl" (list int int)))) On GNU/Hurd libc doesn’t have a “prctl” symbol. So you need something like: (if (dynamic-func "prctl" (dynamic-link)) (let ((proc …)) …) (lambda (process operation) ;; We’re running on an OS that lacks ‘prctl’, such as GNU/Hurd. (throw 'system-error "prctl" "~A" (list (strerror ENOSYS)) (list ENOSYS)))) > +function cleanup() { You need either () or “function” but not both (shells other than Bash might complain…). > +shepherd_pid="$(cat $pid)" Likewise, we should use `foo` instead of $(foo) to suppose non-Bash shells (there are several occurrences of $(foo) here.) > From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro <carlo <at> zancanaro.id.au> > Date: Wed, 21 Feb 2018 22:57:59 +1100 > Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services > > * modules/shepherd.scm (open-server-socket): Set socket to be > non-blocking. > (main): Use select with a timeout. If prctl failed when shepherd started > then call check-for-dead-services between connections/timeouts. > * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD as > signal handler. > (respawn-service): Separate logic for respawning services from handling > SIGCHLD. > (handle-SIGCHLD, check-for-dead-services): New exported procedures. > * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with > symbols. [...] > + (define poll-services > + (if (= 1 (getpid)) > + (lambda () #f) > + (catch 'system-error > + (lambda () > + (prctl PR_SET_CHILD_SUBREAPER 1) > + (lambda () #f)) > + (lambda (key . args) > + check-for-dead-services)))) Please add a comment in the handler saying that we resort to polling on OSes that do not support ‘prctl’. However, perhaps we should do: (lambda args (let ((errno (system-error-errno args))) (if (= ENOSYS errno) check-for-dead-services (apply throw args)))) so that important/unexpected errors are not silently ignored. > +(define (respawn-service serv) > + (when serv Please add a docstring and move ‘when’ to the caller. > +(define (check-for-dead-services) Docstring as well :-), and also a comment explaining that this is a last resort for prctl-less OSes. > (register-services > (make <service> > #:provides '(test-loaded) > - #:start (const 42) > + #:start (const 'abc) Perhaps with the ‘check-for-dead-services’ use of ‘waitpid’ I outlined above we can even keep 42 here? If not, we should add in shepherd.texi, under “Slots of services”, a few words saying that when ‘running’ is an integer it is assumed to be a PID. Could you send an updated patch? Thanks for working on this! Ludo’.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.