From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Tue, 27 Feb 2018 21:58:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: 30637@debbugs.gnu.org X-Debbugs-Original-To: guix-patches@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.15197686316362 (code B ref -1); Tue, 27 Feb 2018 21:58:02 +0000 Received: (at submit) by debbugs.gnu.org; 27 Feb 2018 21:57:11 +0000 Received: from localhost ([127.0.0.1]:36067 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eqnFA-0001eQ-7n for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:57:11 -0500 Received: from eggs.gnu.org ([208.118.235.92]:48619) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eqnF4-0001dt-Bc for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:57:03 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eqnEw-0004a8-Se for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:56:53 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:54148) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eqnEw-0004Zs-NQ for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:56:50 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:39578) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eqnEu-0006eU-5d for guix-patches@gnu.org; Tue, 27 Feb 2018 16:56:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eqnEs-0004Y4-2i for guix-patches@gnu.org; Tue, 27 Feb 2018 16:56:48 -0500 Received: from mail-wr0-x22a.google.com ([2a00:1450:400c:c0c::22a]:37299) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eqnEr-0004WN-Mm for guix-patches@gnu.org; Tue, 27 Feb 2018 16:56:46 -0500 Received: by mail-wr0-x22a.google.com with SMTP id z12so331776wrg.4 for ; Tue, 27 Feb 2018 13:56:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:user-agent:from:to:subject:date:message-id:mime-version; bh=9py7Z0DpH6wyAuDMs0n7X1ExAtz8dmoBmitkvrBiU2M=; b=A/ZlqJrJG006k/35y9qHW3KEfNICyUoBjPL2lJT2Uwx6vHEyJuzwe8ShRVwNX3/TsT sysB/8o6LV9girkRw4h1ETrEzFPXvGI9h+Gxp3c710QnNLgFwuA1ynB5tk0RPc0m0A5y NDVddaxXogVwLHMoH0O+fwLzK7bqkOYSJ2VhOvZLhni8krKjzjibDg3JWz/U509euCCz yB9ZP2pb5TkwrQIsow4QDTs3UHxcdEsABMZohK4IQZ7zPS5QXgmTNXfT7cybuYNglR4W xnpBsIgtJenNtSzFRRXvvw8o83MxmuaDZHc9dUHmcQxb8oNqxk5ygcY8VGceuHwiBC1j bG0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:user-agent:from:to:subject:date :message-id:mime-version; bh=9py7Z0DpH6wyAuDMs0n7X1ExAtz8dmoBmitkvrBiU2M=; b=XNoHAz33FpqV2xcyfvBx7knzTNUeYgdBpN+Iq1I3jbUlQdBaOrR0GIfg5a5qMTxu/H EH033wuHZz8eDQLohF2lmGJPpXb+H4F+yHsAS528XYkucN59VF0V3BlQuWhHv7w6vxyL WBFZP/bKy0z4j2qN6p0pJCny5pkDqs5ofI36CfsFX9EDDQsR05EQDb72hLA0SZBZEju5 NqRXHfT5oPJSER4mfn58Uq4mXmw00LcqNKFvHw5UExHMRLQb5tBU5etG/tblT6GPMmhm KU6yvy2pLvleMc+wBlsb2IHjv/Wdr24T6o6NGYoP1G7cK+TZnyjx6u0UrRMpqoFsmHVt sA2w== X-Gm-Message-State: APf1xPA9YPv1qNxpq5DiLztbJv41ypEGrSA5FNs1OAWwQyjpV3zqCgL8 idcYyZ3BBu0aTFKqhAWRF9Z2W66N X-Google-Smtp-Source: AH8x227IJBfYS3+HsrBbvfdSpU+f9EIRCAM71EavblfoROWzdE78qLjmXqODy60oybEt0qCAlbSnlg== X-Received: by 10.223.190.134 with SMTP id i6mr13899355wrh.157.1519768603861; Tue, 27 Feb 2018 13:56:43 -0800 (PST) Received: from pidgey ([2a00:23c0:4e80:d501:8f8:e119:5f6d:87e3]) by smtp.gmail.com with ESMTPSA id p104sm161020wrb.47.2018.02.27.13.56.41 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 27 Feb 2018 13:56:43 -0800 (PST) User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro Date: Wed, 28 Feb 2018 08:56:34 +1100 Message-ID: <878tbe9jvx.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -3.5 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Another patch for shepherd! This one I'm much less happy about, but I'm sending it for feedback. I'll explain the problem it solves, then the (hacky) way that it solves it for now. I'm sure there's a better way to solve this, but it was annoying me enough for me to do this now. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (carlozancanaro[at]gmail.com) 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid 0.2 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; format=flowed Another patch for shepherd! This one I'm much less happy about, but I'm sending it for feedback. I'll explain the problem it solves, then the (hacky) way that it solves it for now. I'm sure there's a better way to solve this, but it was annoying me enough for me to do this now. The problem is that shepherd, when run as a user process, can "lose" services which fork away. Shepherd can still kill them, but a SIGCHLD won't be delivered if they die, so shepherd can't restart/disable them. My prime example is emacs, which I run with --daemon. If I then kill emacs, shepherd will still think that it is running. To solve this problem, I have modified shepherd to poll each process that it thinks is running. I think this is approximately every 0.5s, but I don't quite understand guile's behaviour when it comes to socket timeouts. This involves a subtle change in the meaning of the running field in objects. My patch treats any `integer?` as a pid, and if a corresponding process is not found it will attempt to restart the service. I considered creating a new "pid" datatype to make it clear when a number represents a pid, but didn't want to go overboard without further feedback. So, thoughts? Can anyone suggest a better way to solve this problem? I'm also confused by my new test. If I run it myself then it passes fine, but in a guix build container there is something different which means that the processes don't get reaped properly, so the test doesn't terminate. I'll keep trying to work it out, but if anyone else has an idea I'd love to hear it. Carlo --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-shepherd-Poll-every-0.5s-to-find-dead-forked-service.patch Content-Transfer-Encoding: quoted-printable From=20ac87535ebdcc7490d9bbd552c99337917e42a6e8 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH] shepherd: Poll every 0.5s to find dead forked services * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): Use select with a timeout, and call check-for-dead-services betwe= en connections/timeouts. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. * tests/forking-service.sh: New file. * Makefile.am: Add tests/forking-service.sh. =2D-- Makefile.am | 3 +- modules/shepherd.scm | 19 ++++++-- modules/shepherd/service.scm | 78 ++++++++++++++++++------------ tests/basic.sh | 4 +- tests/forking-service.sh | 111 +++++++++++++++++++++++++++++++++++++++= ++++ tests/status-sexp.sh | 4 +- 6 files changed, 179 insertions(+), 40 deletions(-) create mode 100644 tests/forking-service.sh diff --git a/Makefile.am b/Makefile.am index 1c394e1..5eb058f 100644 =2D-- a/Makefile.am +++ b/Makefile.am @@ -189,7 +189,8 @@ TESTS =3D \ tests/no-home.sh \ tests/pid-file.sh \ tests/status-sexp.sh \ =2D tests/signals.sh + tests/signals.sh \ + tests/forking-service.sh =20 TEST_EXTENSIONS =3D .sh EXTRA_DIST +=3D $(TESTS) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index 650ba63..cdcd328 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -218,11 +220,18 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) 0.5) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (check-for-dead-services) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index 83600e4..22972c5 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1009,39 +1012,42 @@ otherwise by updating its state." =20 ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. =2D (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + (when serv + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f))))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1177,18 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/forking-service.sh b/tests/forking-service.sh new file mode 100644 index 0000000..90c684a =2D-- /dev/null +++ b/tests/forking-service.sh @@ -0,0 +1,111 @@ +# GNU Shepherd --- Test detecting a forked process' termination +# Copyright =C2=A9 2016 Ludovic Court=C3=A8s +# Copyright =C2=A9 2018 Carlo Zancanaro +# +# This file is part of the GNU Shepherd. +# +# The GNU Shepherd is free software; you can redistribute it and/or modify= it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3 of the License, or (at +# your option) any later version. +# +# The GNU Shepherd is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with the GNU Shepherd. If not, see . + +shepherd --version +herd --version + +socket=3D"t-socket-$$" +conf=3D"t-conf-$$" +log=3D"t-log-$$" +pid=3D"t-pid-$$" +service_pid=3D"t-service-pid-$$" +service2_pid=3D"t-service2-pid-$$" +service2_started=3D"t-service2-starts-$$" + +herd=3D"herd -s $socket" + +function cleanup() { + cat $log || true + rm -f $socket $conf $log $service2_started + test -f $pid && kill "$(cat $pid)" || true + rm -f $pid + test -f $service_pid && kill "$(cat $service_pid)" || true + rm -f $service_pid + test -f $service2_pid && kill "$(cat $service2_pid)" || true + rm -f $service2_pid +} + +trap cleanup EXIT + +cat > "$conf"< $PWD/$service_pid")) + +(register-services + (make + ;; A service that forks into a different process. + #:provides '(test) + #:start (make-forkexec-constructor %command + #:pid-file "$PWD/$service_pid") + #:stop (make-kill-destructor) + #:respawn? #f)) + +(define %command2 + '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & ech= o \$! > $PWD/$service2_pid")) + +(register-services + (make + ;; A service that forks into a different process. + #:provides '(test2) + #:start (make-forkexec-constructor %command2 + #:pid-file "$PWD/$service2_pid") + #:stop (make-kill-destructor) + #:respawn? #t)) +EOF +cat $conf + +rm -f "$pid" +shepherd -I -s "$socket" -c "$conf" -l "$log" --pid=3D"$pid" & + +# Wait till it's ready. +while ! test -f "$pid" ; do sleep 0.3 ; done + +shepherd_pid=3D"$(cat $pid)" + +# start both of the services +$herd start test +$herd start test2 + +# make sure "test" is started +until $herd status test | grep started; do sleep 0.3; done +test -f "$service_pid" +service_pid_value=3D"$(cat $service_pid)" +# now kill it +kill "$service_pid_value" +while kill -0 "$service_pid_value"; do sleep 0.3; done +# shepherd should notice that the service has stopped within one second +sleep 1 +$herd status test | grep stopped + + + +# make sure "test2" has started +until $herd status test2 | grep started; do sleep 0.3; done +test -f "$service2_pid" +service2_pid_value=3D"$(cat $service2_pid)" +test "$(cat $PWD/$service2_started)" =3D "started" +# now kill it +rm -f "$service2_pid" +kill $service2_pid_value +while kill -0 "$service2_pid_value"; do sleep 0.3; done +# shepherd should notice that the service has stopped, and restart it, wit= hin one second +sleep 1; +$herd status test2 | grep started +test "$(cat $PWD/$service2_started)" =3D "started +started" diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqV1BIACgkQqdyPv9aw Ibz/Nw/5AdjbhliDDcHtRm/WW9Mf/cMZdbNQU2Dl2+/Ig/yQ4KgtZcIEtGxQV49E 0TYIXnbD0lJ7KWR/+4bttlzBcXe9xZSNObMC5OcM68Kvuikmrp9DsjjttNJ6kLcx 056Meze8bCoY9V9C+3MyESXhkipVA6AdvqSfR+z+0bsbl2xaebZyufMdbooEjTAg Ift3U8A073NyGKnMtEn5JAcVbyn8djFx37tZkBbB+nYRGhzy5sodsMRb0EOREo/8 Wxfrpg8fIxzNE+vppEBPbk5E2F823j+cIVgQmztiT3Ia1epYlKR6ooj6sIzoFI2T U9LT43swEFcsk0VazvjBx1ADGjtdT3cNhVXfdnCrHvBOhHsDg9mA9rJ4/ILpEosP Wbc2W3Yym8N5DKc3yqDgSsJ1enwUhW6XP/RN6p9kdxOBbykhStp0s1xsILmXNzsW 3fDi9h14WVhq+ZbyWbaoHRQZ/hNKBsOxoOAjimI5/NjfGzIMWJWUcwyTSe1HN/MK v3a4Uo04UEMva/CX+B6+1oi6O7X068elROfPmn1BhZkhRrUV9eDugaGqgasGRpgV QEWc5rLidIwuSnXWeJqfMbi8wldcrWh5joqIchW0dKC8kuE4t2iXL8RYzSkEepkA 1Lde+ema8r1VA971yE4xNuftH38AGBhBtWdQ1PseMXaGS7uoCBk= =yPdk -----END PGP SIGNATURE----- --==-=-=-- From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Wed, 28 Feb 2018 22:07:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.151985557529953 (code B ref 30637); Wed, 28 Feb 2018 22:07:01 +0000 Received: (at 30637) by debbugs.gnu.org; 28 Feb 2018 22:06:15 +0000 Received: from localhost ([127.0.0.1]:37985 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1er9ra-0007n3-Mg for submit@debbugs.gnu.org; Wed, 28 Feb 2018 17:06:14 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:33086) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1er9rV-0007mp-Te for 30637@debbugs.gnu.org; Wed, 28 Feb 2018 17:06:10 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id D6CE210C6B; Wed, 28 Feb 2018 23:06:08 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id pfIjDq3VtS_w; Wed, 28 Feb 2018 23:06:07 +0100 (CET) Received: from ribbon (unknown [IPv6:2a01:e0a:1d:7270:af76:b9b:ca24:c465]) by hera.aquilenet.fr (Postfix) with ESMTPSA id D8EBA10157; Wed, 28 Feb 2018 23:06:06 +0100 (CET) From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 10 =?UTF-8?Q?Vent=C3=B4se?= an 226 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Wed, 28 Feb 2018 23:06:06 +0100 In-Reply-To: <878tbe9jvx.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Wed, 28 Feb 2018 08:56:34 +1100") Message-ID: <87y3jcu5v5.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Hi Carlo, Carlo Zancanaro skribis: > Another patch for shepherd! Always welcome! :-) > The problem is that shepherd, when run as a user process, can "lose" > services which fork away. Shepherd can still kill them, but a SIGCHLD > won't be delivered if they die, so shepherd can't restart/disable > them. My prime example is emacs, which I run with --daemon. If I then > kill emacs, shepherd will still think that it is running. There are two issues here, I think. 1. shepherd cannot lose SIGCHLD: if a process dies immediately once it=E2=80=99s been spawned, as is the case with =E2=80=9Cemacs --daemon= =E2=80=9D or any other daemon-style program, it should receive SIGCHLD and process it. However, there could be a race condition: if SIGCHLD is handled before the =E2=80=98running=E2=80=99 value has been set, then we=E2=80= =99ll still get the non-#f =E2=80=98running=E2=80=99 value even though the process died in= the meantime. The code tries to prevent that (see (shepherd service) around line 320), but looking more closely, I think the race is still there. Namely, the whole =E2=80=98let=E2=80=99 block, including the call to = =E2=80=98start=E2=80=99, should be in =E2=80=98call-with-blocked-asyncs=E2=80=99, I think. Cou= ld you check if that helps for you? 2. shepherd currently can=E2=80=99t do much with real daemons. So what w= e do in GuixSD is to either start programs in non-daemon mode, when that=E2=80=99s an option, or pass #:pid-file to retrieve the forked pr= ocess PID. I think you should do one of these as well. WDYT? Thanks, Ludo=E2=80=99. From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Thu, 01 Mar 2018 22:39:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.151994389113434 (code B ref 30637); Thu, 01 Mar 2018 22:39:02 +0000 Received: (at 30637) by debbugs.gnu.org; 1 Mar 2018 22:38:11 +0000 Received: from localhost ([127.0.0.1]:39832 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erWq3-0003Uc-IF for submit@debbugs.gnu.org; Thu, 01 Mar 2018 17:38:11 -0500 Received: from mail-pg0-f49.google.com ([74.125.83.49]:44915) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erWq2-0003UP-ES for 30637@debbugs.gnu.org; Thu, 01 Mar 2018 17:38:11 -0500 Received: by mail-pg0-f49.google.com with SMTP id l4so2925695pgp.11 for <30637@debbugs.gnu.org>; Thu, 01 Mar 2018 14:38:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=DCVqRBozyINNuvxf2SBrm77cC2G13govzoheAYqv+3c=; b=G+SSVGemhBHyB1N0PxwuofRfXG/CBnwhV8GuTwxk+4LKLctn+XkuIYV63oEgLVUvgi pT1+IPeHgef3pJgKZeKJjz+IGT92158PJMM8pxJmQArXzTWU265ygN1Dn90E3//9PLhR e9r8HAPAZDDRbbDAmbt8It1zjEFrkrR5yfhNIaBN9ojnBsh4aNgBZdCMgac0NaywUv36 AzCLnQw4oTHjTbTr6AuazrxvYNx8O5YGKtF1zooq7DYCJjA9B+W2D1exoD8tIhw58Evq nCA/ZcrnNDtrCuIMtuXsVBV0fr073OA87ZO0ZF70U053ykhz8tK3RrlII0+NrE7uNZVo G1rQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=DCVqRBozyINNuvxf2SBrm77cC2G13govzoheAYqv+3c=; b=OJKl8/D6p68aDrkB2Qc4cYxU1LOeQt/OqK3RFQVeiILNZ6z9DExXQb3xu4FVRGapGw C/kuitwiPTb6Mt8tFOKR+IOQ0QjPXzDEw+3Q/HGIsry6mejb75MSersU/CrEFW9DfTx2 RtpUwm3rbAJmAnHkm2KoLI01GU8FxPjirP3PHttzfb+fWqRGvRe2iPnoj+Mm1/A2WZom 3WXYZSJQ0ijRtwRQM9HaMp8Ye47qCy08nun2jbBiFbJzhPvoo94RwMWEsj0KmpI4w1qp LtZcW83McNUJRHMKSa5EC2aEJTJOLPVuemLRN2CyXGES9jAxeHUs39tB48s1eATXZX9s A1UQ== X-Gm-Message-State: APf1xPAYKS3uCnBP6QnhGJOdrgGTjfbEmyLaRXawpbmobr1zgye0ToPx DHTyNZfn1wI0MTGnq/L0gJMjZUpN X-Google-Smtp-Source: AG47ELsQrmC7BYOI74a57xGFQ90DqAwW3TqQ8LLFmT9cTc4qpo0gxp7KyXQ9ow2Jdicf4IDq0dmwDA== X-Received: by 10.101.90.10 with SMTP id y10mr2837243pgs.34.1519943883924; Thu, 01 Mar 2018 14:38:03 -0800 (PST) Received: from pidgey (pa49-195-79-156.pa.nsw.optusnet.com.au. [49.195.79.156]) by smtp.gmail.com with ESMTPSA id v11sm8684731pgf.83.2018.03.01.14.37.59 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 01 Mar 2018 14:38:03 -0800 (PST) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro In-reply-to: <87y3jcu5v5.fsf@gnu.org> Date: Fri, 02 Mar 2018 09:37:50 +1100 Message-ID: <87d10nwhfl.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hey Ludo, On Wed, Feb 28 2018, Ludovic Court=C3=A8s wrote: >> The problem is that shepherd, when run as a user process, can=20 >> "lose" >> services which fork away. Shepherd can still kill them, but a=20 >> SIGCHLD >> won't be delivered if they die, so shepherd can't=20 >> restart/disable >> them. My prime example is emacs, which I run with --daemon. If=20 >> I then >> kill emacs, shepherd will still think that it is running. > > There are two issues here, I think. > > 1. shepherd cannot lose SIGCHLD: if a process dies immediately=20 > once > it=E2=80=99s been spawned, as is the case with =E2=80=9Cemacs --daem= on=E2=80=9D or=20 > any > other daemon-style program, it should receive SIGCHLD and=20 > process > it. Yeah, that's true, but the problem is that shepherd only processes=20 the SIGCHLD if there is a service with its `running` slot set to=20 the pid. When emacs forks, the original process may have its=20 SIGCHLD handled, but that doesn't affect shepherd's service state=20 (as it shouldn't, because it's using #:pid-file to track the=20 forked process). > 2. shepherd currently can=E2=80=99t do much with real daemons. So=20 > what we do > in GuixSD is to either start programs in non-daemon mode,=20 > when > that=E2=80=99s an option, or pass #:pid-file to retrieve the forked= =20 > process > PID. I think you should do one of these as well. I am doing that. The problem is that when a service dies (crashes,=20 quits, etc.) the `respawn?` option cannot be honoured because=20 shepherd is not notified that the process has terminated (because=20 it never receives a SIGCHLD for the forked pid). My patch polls=20 for the processes we expect, to make up for the lack of=20 notification. I would much rather it receive an event/signal to=20 notify that the forked process has died, but I don't know how to=20 do that in a robust, portable way so I chose to poll instead. If you look at my test case in tests/respawn-service.sh (which can=20 be read in its entirety in the diff attached to my previous email)=20 you can see the problem that this patch solves. The test will fail=20 without the rest of my patch, but will pass with them (guix build=20 container issue notwithstanding). Carlo --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqYgL8ACgkQqdyPv9aw IbzJhxAAoJLq07/eAo2bipD6ie8voBqtBfrlqn5PbwcAiHUtN1WEJQ8xyl0fW3bt 7XNbH8DNIkHWXj1KTrCVRVgz8rOtCav1APoHh8n3CMx+jhqQV2dtAON/Fizj1zHB rKYooZt1WprbWzDm6KRm4NokwIFNAYG2vN19DNqtuhHj9rC+tv7ZIN4jSJKb6vdY 5mwWwy1nE1yNI6CUl6x/oIQL0oNfnKKFM0SDLYKihgbjTIjykRVbMOJnDdze+MYI rbrsWmrZaese2Jj2NC2ZiEBFrKx2igvieu2HEI+zijV07QiQG+XPR0pdnf/lCng4 b/wVOZBunW4ouSVvOIaSqo3adOyu6J1mZEK7MXf1+DbpZB8TYB5SeSkEJ0Qlhn8U Chq/Xipggpf4Wwpd6Fu6jrNwQEAk3F+pFqqpDrQLo3bFGlb26hhJpyAScDQ3UGAY ktCPwEyAsIXoiTQtsYEoVaysnmxW3ZnCaSF/zwB56gV50bD45TUGgTcf2R7W1TkH etUHfUQ74P8G2OMuvF1VkuuUFCxjq1bDUgZkxNWeQhBqjhnZy9B6XtUTcas1We/h bMlllsBU3JFVPbGtLbsTTC2FJof7yzyrcJ3XCMew5xutY4GsZKRauNNdYR/5ybEW z9X38N2oWQ8l6U9t4leKjrxsEk5AvFdwBnfUj8CoEFOFRaD1VaI= =TndF -----END PGP SIGNATURE----- --=-=-=-- From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 02 Mar 2018 09:45:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.151998385830400 (code B ref 30637); Fri, 02 Mar 2018 09:45:02 +0000 Received: (at 30637) by debbugs.gnu.org; 2 Mar 2018 09:44:18 +0000 Received: from localhost ([127.0.0.1]:40099 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erhEg-0007uG-Hz for submit@debbugs.gnu.org; Fri, 02 Mar 2018 04:44:18 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:44374) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erhEe-0007u6-JE for 30637@debbugs.gnu.org; Fri, 02 Mar 2018 04:44:17 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id 69763FB15; Fri, 2 Mar 2018 10:44:15 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3jTY2OS77ses; Fri, 2 Mar 2018 10:44:13 +0100 (CET) Received: from ribbon (unknown [193.50.110.134]) by hera.aquilenet.fr (Postfix) with ESMTPSA id 61EF940E; Fri, 2 Mar 2018 10:44:13 +0100 (CET) From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 12 =?UTF-8?Q?Vent=C3=B4se?= an 226 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 02 Mar 2018 10:44:12 +0100 In-Reply-To: <87d10nwhfl.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Fri, 02 Mar 2018 09:37:50 +1100") Message-ID: <87r2p2izgz.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Hi Carlo, Carlo Zancanaro skribis: > On Wed, Feb 28 2018, Ludovic Court=C3=A8s wrote: >>> The problem is that shepherd, when run as a user process, can >>> "lose" >>> services which fork away. Shepherd can still kill them, but a >>> SIGCHLD >>> won't be delivered if they die, so shepherd can't restart/disable >>> them. My prime example is emacs, which I run with --daemon. If I >>> then >>> kill emacs, shepherd will still think that it is running. >> >> There are two issues here, I think. >> >> 1. shepherd cannot lose SIGCHLD: if a process dies immediately >> once >> it=E2=80=99s been spawned, as is the case with =E2=80=9Cemacs --dae= mon=E2=80=9D or >> any >> other daemon-style program, it should receive SIGCHLD and >> process >> it. > > Yeah, that's true, but the problem is that shepherd only processes the > SIGCHLD if there is a service with its `running` slot set to the > pid. Well, it does call =E2=80=98waitpid=E2=80=99 every time it gets a SIGCHLD, = but it=E2=80=99s true that it doesn=E2=80=99t do anything beyond that if it doesn=E2=80=99t know = what service a PID corresponds to. > When emacs forks, the original process may have its SIGCHLD handled, > but that doesn't affect shepherd's service state (as it shouldn't, > because it's using #:pid-file to track the forked process). > >> 2. shepherd currently can=E2=80=99t do much with real daemons. So w= hat >> we do >> in GuixSD is to either start programs in non-daemon mode, >> when >> that=E2=80=99s an option, or pass #:pid-file to retrieve the forked >> process >> PID. I think you should do one of these as well. > > I am doing that. The problem is that when a service dies (crashes, > quits, etc.) the `respawn?` option cannot be honoured because shepherd > is not notified that the process has terminated (because it never > receives a SIGCHLD for the forked pid). My patch polls for the > processes we expect, to make up for the lack of notification. I see. Actually, thinking more about it, we should be using PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly for that. So what about this plan: 1. Add FFI bindings in (shepherd system) for prctl(2). We should arrange for it to throw to 'system-error when the =E2=80=98prctl=E2=80= =99 symbol is missing, as is the case on GNU/Hurd. 2. Use prctl/PR_SET_CHILD_SUBREAPER in =E2=80=98exec-command=E2=80=99. H= ere we must =E2=80=98catch-system-error=E2=80=99 around that call to cater to GNU/= Hurd. That would address the main issue without having to resort to polling. Respawning will work only when #:pid-file is used though, but that=E2=80=99s already an improvement. Thoughts? Thanks, Ludo=E2=80=99. From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 02 Mar 2018 10:15:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.1519985655575 (code B ref 30637); Fri, 02 Mar 2018 10:15:02 +0000 Received: (at 30637) by debbugs.gnu.org; 2 Mar 2018 10:14:15 +0000 Received: from localhost ([127.0.0.1]:40111 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erhhf-00009D-HS for submit@debbugs.gnu.org; Fri, 02 Mar 2018 05:14:15 -0500 Received: from mail-pf0-f181.google.com ([209.85.192.181]:41188) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erhhe-00008z-2f for 30637@debbugs.gnu.org; Fri, 02 Mar 2018 05:14:14 -0500 Received: by mail-pf0-f181.google.com with SMTP id f80so3809500pfa.8 for <30637@debbugs.gnu.org>; Fri, 02 Mar 2018 02:14:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=AgabSCwBa0d51ihrzICaHmIDpNzlBgP7iLFnTiVslzU=; b=BG+Q3WvW6iEc5KJrU42uMwxG0JWBTQcm1RSwfGFLmAF4Rv/hVXjB1YtxMoF253DFR3 QDlRgkeGGqIS2gxHbWXj84uni9Ybng9ja4Rru6t99nJpKCU0HLMBSSUu/zfpvh9vyv/h 6vZ/gfSPI5ZdMUzXkfEmm6hwSA/OnbldpM/M3pRDF51BSMK6dPuZ218knFf86m4TLlfq TJ8EDHBUnXZPAdJY1lmYDzLyKVXFqKpeZ/fIVessIpctddncyFOOvUO5sojmkOhO7AAK Uh77CWTjHFD7BQdNkXtSayNzOme9IUzb1dBtOF8G2uGbJVBqWIH4Kt5ENtmHoPMXRnz7 Uaww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=AgabSCwBa0d51ihrzICaHmIDpNzlBgP7iLFnTiVslzU=; b=c0B4oMwT17rQLa+8vS+6RXuRjhcFJeqXd7y18vqm6dIrDLTTRQ8GjHsAm0iIGOKcr9 tJ8cRB4qz0r+CsAdHlj6e/qjpxuBl8goKv9hb4KOzfAbOlKjflwL0xju+OFhe8J6yRP2 DHh0I3hwtBqMnmr0I8vhD6RMfhhpKU33aFs50F7fkSuPvbu1WqJ4fcubHBirBvu3tZI/ sTvOLgpGIiELEKXEE5TaNK90ZbxK0QjbSJJd8c9QYuDBOKHpeUTOm2oCjlpWbU/kj0cS tpS7iAejEpDMIEhYlNd7mbjmQxaElguGtJNrJq8dxLdiW4p0mkPg3DDTOJq8w87jxAdh 7zUw== X-Gm-Message-State: APf1xPCY9qWYRPNhV5pTBGOiaCr0OgbgnogkvegEbHJn6y8FK3r6X46r liiX0PAz5vOCyTgIAwHTLytHqnsj X-Google-Smtp-Source: AG47ELvN1UIZRj3VpGhG7CD3tnNrEXuDfGSkTA8eCQ8SnCUi2RGzCZirC/Bw8uIS82pEGZ1Ndr8FWQ== X-Received: by 10.101.76.13 with SMTP id u13mr4035659pgq.287.1519985647671; Fri, 02 Mar 2018 02:14:07 -0800 (PST) Received: from pidgey (110-174-2-124.tpgi.com.au. [110.174.2.124]) by smtp.gmail.com with ESMTPSA id q2sm10962883pfh.103.2018.03.02.02.14.04 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 02 Mar 2018 02:14:06 -0800 (PST) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro In-reply-to: <87r2p2izgz.fsf@gnu.org> Date: Fri, 02 Mar 2018 21:13:53 +1100 Message-ID: <87371ihjj2.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hey Ludo, On Fri, Mar 02 2018, Ludovic Court=C3=A8s wrote: >> I am doing that. The problem is that when a service dies=20 >> (crashes, quits, etc.) the `respawn?` option cannot be honoured=20 >> because shepherd is not notified that the process has=20 >> terminated (because it never receives a SIGCHLD for the forked=20 >> pid). My patch polls for the processes we expect, to make up=20 >> for the lack of notification. > > I see. > > Actually, thinking more about it, we should be using=20 > PR_SET_CHILD_SUBREAPER from prctl(2), which is designed exactly=20 > for that. Excellent! This is exactly the information that I needed. This is=20 what I've been looking for, but without enough knowledge to be=20 able to find it. Thanks! > So what about this plan: > > 1. Add FFI bindings in (shepherd system) for prctl(2). We=20 > should arrange for it to throw to 'system-error when the=20 > =E2=80=98prctl=E2=80=99 symbol is missing, as is the case on GNU/Hurd. Are we okay with having this just not work on GNU/Hurd (or kernels=20 older than 3.4, according to the prctl manpage)? We could fall=20 back to a polling approach if prctl isn't available? I don't=20 really like the idea of this working on some kernels but not=20 others, given that process supervision is one of the main jobs of=20 shepherd. > 2. Use prctl/PR_SET_CHILD_SUBREAPER in =E2=80=98exec-command=E2=80=99. = Here we=20 > must =E2=80=98catch-system-error=E2=80=99 around that call to cater to= =20 > GNU/Hurd. Why would we need to set it in exec-command? It looks like it=20 modifies the state of the calling process, which means we'd want=20 to set it in the shepherd service, not in each of the child=20 processes. > That would address the main issue without having to resort to=20 > polling. Respawning will work only when #:pid-file is used=20 > though, but that=E2=80=99s already an improvement. > > Thoughts? I'll try to get this working in the next few days. Hopefully=20 you'll see a patch from me soon. Carlo --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqZI+IACgkQqdyPv9aw Iby6/A/+OoaURMIb/e8YD50Iz/jnEC/KFy8hw629phIITlVHwGIglx25aWdcTgha zuJj+mNBuT09QaX+HY8lyhOiSGQaLvphdXHPykr7CjNsDYES7dwyYTSIuteeYvI0 0Srg1KE44jumPmTdK0jTu4EdSLnfzVRKVyb3UzUGM6GarjRM0DDYYM8dpQ0Tde7P F5YzJ6LvG+T638hDqpz8oideSl9+FWTOJ81TDHbXMHir76JuYhf4ir7JZUQovRTt 1pVa9S0d7L7HYwRun5NWI7v5EvRdAWSiUl6XKjzG25T/YfGEfmHW4yxdSnmGfIit Q9K4vw7U2MYq68DBbPIIt63NUyA5sxpfY5I+sH+9pfqU2cl9ZhiS/y4CkippFASl dYRkYDRnHyUySL3kWtRJNwibK/SdKi7GOKcfJftJuW5tON40OgKGKyUwiDh8Po1y QS/p88uLz6FfsiiEvgVQ+rLF2yIRM489TPZq5vfi7yGyens6eOJ+d7qdYsTGomP8 kI+3+Ul+lFi5BMxlQfvkvaGSc0OqSvf/S9Ith333UzSpWsxIqaO3ZenrVurC7oT1 /d41BLDpy7IAf7jdT6SJRHLVQr6V7bOZ87FwPn217QW6tOXHRnCoLkqSJhURu6VS opTUOp+U95nVBqkCOnHG+iNdW/Uy4pkb/CNoxHjwTHM6EU+Hz2Y= =pyZA -----END PGP SIGNATURE----- --=-=-=-- From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Fri, 02 Mar 2018 12:43:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.151999456621787 (code B ref 30637); Fri, 02 Mar 2018 12:43:01 +0000 Received: (at 30637) by debbugs.gnu.org; 2 Mar 2018 12:42:46 +0000 Received: from localhost ([127.0.0.1]:40178 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erk1O-0005fL-Ib for submit@debbugs.gnu.org; Fri, 02 Mar 2018 07:42:46 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:45354) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1erk1N-0005fC-0A for 30637@debbugs.gnu.org; Fri, 02 Mar 2018 07:42:46 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id C4FC5114CB; Fri, 2 Mar 2018 13:42:43 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9PvETW6b9r2J; Fri, 2 Mar 2018 13:42:42 +0100 (CET) Received: from ribbon (unknown [193.50.110.134]) by hera.aquilenet.fr (Postfix) with ESMTPSA id EC795718B; Fri, 2 Mar 2018 13:42:41 +0100 (CET) From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 12 =?UTF-8?Q?Vent=C3=B4se?= an 226 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 02 Mar 2018 13:42:41 +0100 In-Reply-To: <87371ihjj2.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Fri, 02 Mar 2018 21:13:53 +1100") Message-ID: <87po4mhcn2.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Hello! Carlo Zancanaro skribis: > On Fri, Mar 02 2018, Ludovic Court=C3=A8s wrote: [...] >> So what about this plan: >> >> 1. Add FFI bindings in (shepherd system) for prctl(2). We should >> arrange for it to throw to 'system-error when the =E2=80=98prctl=E2=80= =99 symbol >> is missing, as is the case on GNU/Hurd. > > Are we okay with having this just not work on GNU/Hurd (or kernels > older than 3.4, according to the prctl manpage)? We could fall back to > a polling approach if prctl isn't available? I don't really like the > idea of this working on some kernels but not others, given that > process supervision is one of the main jobs of shepherd. Yeah, I agree. The =E2=80=98prctl=E2=80=99 procedure itself should simply throw to 'system= -error on GNU/Hurd. But then, in (shepherd), we could add the polling thing when (not (string-contains %host-type "linux")). WDYT? >> 2. Use prctl/PR_SET_CHILD_SUBREAPER in =E2=80=98exec-command=E2=80=99.= Here we >> must =E2=80=98catch-system-error=E2=80=99 around that call to cater to = GNU/Hurd. Actually this should be done in =E2=80=98fork+exec-command=E2=80=99 in the = child process. > Why would we need to set it in exec-command? It looks like it modifies > the state of the calling process, which means we'd want to set it in > the shepherd service, not in each of the child processes. We want to set the =E2=80=9Creaper=E2=80=9D of child processes to Shepherd = itself, so we must do that in child processes. The shepherd process cannot be its own reaper I suppose. BTW, we should do PR_SET_CHILD_SUBREAPER only when (not (=3D 1 (getpid))). > I'll try to get this working in the next few days. Hopefully you'll > see a patch from me soon. Awesome, thank you! Ludo=E2=80=99. From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sat, 03 Mar 2018 08:00:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.15200639631915 (code B ref 30637); Sat, 03 Mar 2018 08:00:02 +0000 Received: (at 30637) by debbugs.gnu.org; 3 Mar 2018 07:59:23 +0000 Received: from localhost ([127.0.0.1]:41781 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1es24c-0000Ug-4A for submit@debbugs.gnu.org; Sat, 03 Mar 2018 02:59:23 -0500 Received: from mail-pl0-f54.google.com ([209.85.160.54]:42847) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1es24S-0000UJ-Vh for 30637@debbugs.gnu.org; Sat, 03 Mar 2018 02:59:16 -0500 Received: by mail-pl0-f54.google.com with SMTP id 93-v6so7017482plc.9 for <30637@debbugs.gnu.org>; Fri, 02 Mar 2018 23:59:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=LYla9KvFCHu1mSovpHLmq1gE8mIkqBaypz7bN/P9tqQ=; b=upsT/OW9KPTsGoaM+Yq3liFYMRo9VmSBrnb6guXtoEqX6XO6hnVglZPoJf/Hftlg2O RpDXW/3OhNIzUZ/fNZUjBEhe3c1ErlOzsiVNpdL1nplX+md3e/dUGyGNHZR9pF6ptM09 m2duiAXX+P0UMLQT4Nn/PrfjccmC8+FezCrbDY5ZjMYq84qvhNx3XOykqh8XOSZWMHKx lGoxVrvxZUKBUQ4WZo63NTkrDXuXRu/xcSaYekAHQRmJJKsp4Q/GKCjOM4rnsGkTdV3J 3E8ELyWqIcLPFm39q4zU7h2d/aq7haRdTjv2uXAkrGXg5KOnFjeXVMDBeh6ZF32sV4gc NRZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=LYla9KvFCHu1mSovpHLmq1gE8mIkqBaypz7bN/P9tqQ=; b=HtAUjEXekBq4KxjTL7qXJ304+CQmnJ2TbPTSX6qCiDAIZtnngBIiz5736tXSEti+oW HH7jIz3KeOaGh9dspeXnY/cvlHD/ku46qsjtX3Zq1RdXa4uih+1cw10ODxJm+UZ7AwU3 sQdBaetLMgzaUglVd4F6qsoAmmxAObxduUgO4vmZvaRHSd8EmzOFuTKS1vGimHpWQ/Hk ZhhnHgRxoWIoOQIc1x0v49asCYjHHMrfDwnyW9PMm+f8R6wvWMW1b4C46JrrtnyLjmzV JOgLgPGcCOorFc3ek4lYNFDMxWwx6GbE8YA7DpFmywG1lHzlSjyLOi9ogx3Ml2RYZs1K hEaQ== X-Gm-Message-State: APf1xPBRD29tkzxwVHATHtes66hn6cAIaC8ROhYKF+DlgFek+DPDbHeD d3KVyAofI5ub3jc0wlxCTmqRYAk0 X-Google-Smtp-Source: AG47ELuMhZOwirwuASsDZHKXdr9iZHHEESmx8Rq7YBdWMhZda/IdpMFfJKLqH8FwxLgQJyuZoyncNw== X-Received: by 2002:a17:902:c5:: with SMTP id a63-v6mr7771099pla.391.1520063942205; Fri, 02 Mar 2018 23:59:02 -0800 (PST) Received: from pidgey (110-174-2-124.tpgi.com.au. [110.174.2.124]) by smtp.gmail.com with ESMTPSA id u28sm16663507pfl.19.2018.03.02.23.58.59 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 02 Mar 2018 23:59:01 -0800 (PST) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro In-reply-to: <87po4mhcn2.fsf@gnu.org> Date: Sat, 03 Mar 2018 18:58:50 +1100 Message-ID: <87inadr3np.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hey Ludo, I've re-written my patch, and it's attached in two commits. The=20 first one adds the necessary calls to prctl, and the second adds=20 the fallback to polling. On Fri, Mar 02 2018, Ludovic Court=C3=A8s wrote: > The =E2=80=98prctl=E2=80=99 procedure itself should simply throw to=20 > 'system-error on GNU/Hurd. But then, in (shepherd), we could add=20 > the polling thing when (not (string-contains %host-type=20 > "linux")). > > WDYT? I don't like the idea of doing this based on the host type. In my=20 patch I've done it based on whether the prctl call succeeded. If=20 the prctl call throws a system-error then we poll, otherwise we=20 rely on SIGCHLD. I don't have a system set up with another kernel,=20 though, so I don't know how I can easily test whether the fallback=20 logic is working properly. When I replaced the prctl call with=20 (throw 'system-error) it seemed to work. The fallback code still fails in the guix build environment (as my=20 previous patch did), but when it's using prctl it works properly.=20 This means that a build on Linux pre-3.4, or on Hurd, will fail,=20 which probably isn't acceptable given that shepherd is a hard=20 dependency for starting a GuixSD system. As far as I can tell the=20 test fails because the processes stick around as zombies, so I=20 assume that pid 1 in the build container isn't properly reaping=20 zombie processes. Do you have any ideas how I can force it to do=20 so? > We want to set the =E2=80=9Creaper=E2=80=9D of child processes to Shepher= d=20 > itself, so we must do that in child processes. The shepherd=20 > process cannot be its own reaper I suppose. Reading the manpage, and then running some code, I think you're=20 wrong about this. Using prctl with PR_SET_CHILD_SUBREAPER marks=20 the calling process as a child subreaper. That means that any=20 processes that are orphaned below the current process get=20 reparented under the current process (or a closer child subreaper,=20 if there's one further down). If there are no processes marked as=20 child subreapers, then the orphaned process is reparented under=20 pid 1. We should only need to call prctl in two places: when=20 shepherd initially starts, or when we fork for daemonize. Carlo --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-Handle-forked-process-SIGCHLD-signals.patch Content-Transfer-Encoding: quoted-printable From=205f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Sat, 3 Mar 2018 17:26:05 +1100 Subject: [PATCH 1/2] Handle forked process SIGCHLD signals * Makefile.am (TESTS): Add tests/forking-service.sh. * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER. * modules/shepherd.scm: Set the child subreaper attribute of main shepherd process (as long as we're not pid 1). * modules/shepherd/service.scm (root-service)[daemonize]: Set the child subreaper attribute of newly forked shepherd process. * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable and export it. (prctl): Add new procedure and export it. =2D-- Makefile.am | 1 + configure.ac | 4 ++ modules/shepherd.scm | 2 + modules/shepherd/service.scm | 4 +- modules/shepherd/system.scm.in | 17 ++++++- tests/forking-service.sh | 111 +++++++++++++++++++++++++++++++++++++= ++++ 6 files changed, 137 insertions(+), 2 deletions(-) create mode 100644 tests/forking-service.sh diff --git a/Makefile.am b/Makefile.am index eafa308..8dad006 100644 =2D-- a/Makefile.am +++ b/Makefile.am @@ -190,6 +190,7 @@ TESTS =3D \ tests/no-home.sh \ tests/pid-file.sh \ tests/status-sexp.sh \ + tests/forking-service.sh \ tests/signals.sh =20 TEST_EXTENSIONS =3D .sh diff --git a/configure.ac b/configure.ac index bb5058d..fbe16f4 100644 =2D-- a/configure.ac +++ b/configure.ac @@ -72,7 +72,11 @@ esac AC_SUBST([RB_AUTOBOOT]) AC_SUBST([RB_HALT_SYSTEM]) AC_SUBST([RB_POWER_OFF]) +AC_MSG_RESULT([done]) =20 +AC_MSG_CHECKING([ constants]) +AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#inclu= de ]) +AC_SUBST([PR_SET_CHILD_SUBREAPER]) AC_MSG_RESULT([done]) =20 dnl Manual pages. diff --git a/modules/shepherd.scm b/modules/shepherd.scm index df5420f..ab59e08 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -50,6 +50,8 @@ ;; Main program. (define (main . args) (initialize-cli) + (when (not (=3D 1 (getpid))) + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) =20 (let ((config-file #f) (socket-file default-socket-file) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index 2224932..b6394f2 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -1274,7 +1274,9 @@ we want to receive these signals." (local-output "Running as PID 1, so not daemonizing.")) (else (if (zero? (primitive-fork)) =2D #t + (begin + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)) + #t) (primitive-exit 0)))))) (persistency "Safe the current state of running and non-running services. diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in index a54dca7..55806cb 100644 =2D-- a/modules/shepherd/system.scm.in +++ b/modules/shepherd/system.scm.in @@ -23,7 +23,9 @@ #:export (reboot halt power-off =2D max-file-descriptors)) + max-file-descriptors + prctl + PR_SET_CHILD_SUBREAPER)) =20 ;; The constants. (define RB_AUTOBOOT @RB_AUTOBOOT@) @@ -130,6 +132,19 @@ the returned procedure is called." (list err)) result))))) =20 +(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@) + +(define prctl + (let ((proc (syscall->procedure long "prctl" (list int int)))) + (lambda (process operation) + "Perform an operation on the given process" + (let-values (((result err) (proc process operation))) + (if (=3D -1 result) + (throw 'system-error "prctl" "~A: ~S" + (list (strerror err) name) + (list err)) + result))))) + (define (max-file-descriptors) "Return the maximum number of open file descriptors allowed." (sysconf _SC_OPEN_MAX)) diff --git a/tests/forking-service.sh b/tests/forking-service.sh new file mode 100644 index 0000000..90c684a =2D-- /dev/null +++ b/tests/forking-service.sh @@ -0,0 +1,111 @@ +# GNU Shepherd --- Test detecting a forked process' termination +# Copyright =C2=A9 2016 Ludovic Court=C3=A8s +# Copyright =C2=A9 2018 Carlo Zancanaro +# +# This file is part of the GNU Shepherd. +# +# The GNU Shepherd is free software; you can redistribute it and/or modify= it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3 of the License, or (at +# your option) any later version. +# +# The GNU Shepherd is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with the GNU Shepherd. If not, see . + +shepherd --version +herd --version + +socket=3D"t-socket-$$" +conf=3D"t-conf-$$" +log=3D"t-log-$$" +pid=3D"t-pid-$$" +service_pid=3D"t-service-pid-$$" +service2_pid=3D"t-service2-pid-$$" +service2_started=3D"t-service2-starts-$$" + +herd=3D"herd -s $socket" + +function cleanup() { + cat $log || true + rm -f $socket $conf $log $service2_started + test -f $pid && kill "$(cat $pid)" || true + rm -f $pid + test -f $service_pid && kill "$(cat $service_pid)" || true + rm -f $service_pid + test -f $service2_pid && kill "$(cat $service2_pid)" || true + rm -f $service2_pid +} + +trap cleanup EXIT + +cat > "$conf"< $PWD/$service_pid")) + +(register-services + (make + ;; A service that forks into a different process. + #:provides '(test) + #:start (make-forkexec-constructor %command + #:pid-file "$PWD/$service_pid") + #:stop (make-kill-destructor) + #:respawn? #f)) + +(define %command2 + '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & ech= o \$! > $PWD/$service2_pid")) + +(register-services + (make + ;; A service that forks into a different process. + #:provides '(test2) + #:start (make-forkexec-constructor %command2 + #:pid-file "$PWD/$service2_pid") + #:stop (make-kill-destructor) + #:respawn? #t)) +EOF +cat $conf + +rm -f "$pid" +shepherd -I -s "$socket" -c "$conf" -l "$log" --pid=3D"$pid" & + +# Wait till it's ready. +while ! test -f "$pid" ; do sleep 0.3 ; done + +shepherd_pid=3D"$(cat $pid)" + +# start both of the services +$herd start test +$herd start test2 + +# make sure "test" is started +until $herd status test | grep started; do sleep 0.3; done +test -f "$service_pid" +service_pid_value=3D"$(cat $service_pid)" +# now kill it +kill "$service_pid_value" +while kill -0 "$service_pid_value"; do sleep 0.3; done +# shepherd should notice that the service has stopped within one second +sleep 1 +$herd status test | grep stopped + + + +# make sure "test2" has started +until $herd status test2 | grep started; do sleep 0.3; done +test -f "$service2_pid" +service2_pid_value=3D"$(cat $service2_pid)" +test "$(cat $PWD/$service2_started)" =3D "started" +# now kill it +rm -f "$service2_pid" +kill $service2_pid_value +while kill -0 "$service2_pid_value"; do sleep 0.3; done +# shepherd should notice that the service has stopped, and restart it, wit= hin one second +sleep 1; +$herd status test2 | grep started +test "$(cat $PWD/$service2_started)" =3D "started +started" =2D-=20 2.16.1 --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0002-Poll-every-0.5s-to-find-dead-forked-services.patch Content-Transfer-Encoding: quoted-printable From=20ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): Use select with a timeout. If prctl failed when shepherd started then call check-for-dead-services between connections/timeouts. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. =2D-- modules/shepherd.scm | 31 ++++++++++++++---- modules/shepherd/service.scm | 78 +++++++++++++++++++++++++++-------------= ---- tests/basic.sh | 4 +-- tests/status-sexp.sh | 4 +-- 4 files changed, 76 insertions(+), 41 deletions(-) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index ab59e08..b824546 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -49,9 +51,17 @@ ;; Main program. (define (main . args) + (define poll-services + (if (=3D 1 (getpid)) + (lambda () #f) + (catch 'system-error + (lambda () + (prctl PR_SET_CHILD_SUBREAPER 1) + (lambda () #f)) + (lambda (key . args) + check-for-dead-services)))) + (initialize-cli) =2D (when (not (=3D 1 (getpid))) =2D (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) =20 (let ((config-file #f) (socket-file default-socket-file) @@ -220,11 +230,18 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) 0.5) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (poll-services) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index b6394f2..fc53d76 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1009,39 +1012,42 @@ otherwise by updating its state." =20 ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. =2D (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + (when serv + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f))))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1177,18 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqaVbsACgkQqdyPv9aw IbwCGg//RSh6Wfle+Rmt4NpdKbr/qstpFYk0KWawZbWGxrWAjyFVYx3b4V3nNGt5 Lol8VwTUeNyeV88iE/7wL5HUEF+P/tzExMUzr6TWCJVGNHYUrjAYMG6FV0nXX6dz 2tHUeCUHBlicYIgqN6XyK3vkccv/hTRANq3W4ZEI5Jrwf43FV9CHZ8aXsG4tRoq9 nQGS84OS//vGOBE7kvS4o/+IAvVAn10Uuz41EYhSFpsiJGtgSO+IA8EIHWmL8lZn /6jPPZ7P6WQSOeDf7P5/3mcvrXP7S9azIed6q8/EYz+75GkV/EeQbR6uCYuqP+WN nSg6Et/FhACgxIRmvLqkVxeC+ijHRP9SvtYeg5Fx1H6yz9HV6bHMZa35FrN7HrGC WjmL6l+0+xRBMC0DuOiOk3W0qdRDR8c88tWD7OCopfsv8Lcx3Ft0YNTOYoV6Lver u33m7KCfEqOo8l5XQv9QXBzJt6Kh7cDEPi+rW02x6LLXTHg0vdr72qhpZMaXrTaD J6WSYpBTFW408SqIYvJ23dmxNMB47cDs5zeowQZH/scNE7VHvo88GA7LGz8Yg0qz /p7bpm4V3kWC++df5yahwoNnfEB9Gmp555t1Ag5JHY0PI3KJbuu0FG6M0xbZI/Fo LyBoLtXvTO7VBOhvs4WkluzYiQxdR2BfcPPjHODIZnSV7qNbCBs= =pSfE -----END PGP SIGNATURE----- --==-=-=-- From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sat, 03 Mar 2018 15:22:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.1520090476335 (code B ref 30637); Sat, 03 Mar 2018 15:22:02 +0000 Received: (at 30637) by debbugs.gnu.org; 3 Mar 2018 15:21:16 +0000 Received: from localhost ([127.0.0.1]:42946 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1es8yJ-00005J-GT for submit@debbugs.gnu.org; Sat, 03 Mar 2018 10:21:15 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:52690) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1es8yE-00004z-I8 for 30637@debbugs.gnu.org; Sat, 03 Mar 2018 10:21:11 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id 04F2311B37; Sat, 3 Mar 2018 16:21:10 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fZSBEDQTQhHo; Sat, 3 Mar 2018 16:21:08 +0100 (CET) Received: from ribbon (unknown [IPv6:2a01:e0a:1d:7270:af76:b9b:ca24:c465]) by hera.aquilenet.fr (Postfix) with ESMTPSA id 62D1410495; Sat, 3 Mar 2018 16:21:08 +0100 (CET) From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 13 =?UTF-8?Q?Vent=C3=B4se?= an 226 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sat, 03 Mar 2018 16:21:07 +0100 In-Reply-To: <87inadr3np.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Sat, 03 Mar 2018 18:58:50 +1100") Message-ID: <87h8px9od8.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Hi Carlo, Overall LGTM! It=E2=80=99s a long reply though, but that=E2=80=99s because= there are lots of details to pay attention to in this Unix quagmire. :-) Carlo Zancanaro skribis: > I've re-written my patch, and it's attached in two commits. The first > one adds the necessary calls to prctl, and the second adds the > fallback to polling. If possible I would prefer a commit that only adds prctl, and the next one would actually use it. I find it clearer and more convenient if we need to bisect or revert. > On Fri, Mar 02 2018, Ludovic Court=C3=A8s wrote: >> The =E2=80=98prctl=E2=80=99 procedure itself should simply throw to 'sys= tem-error on >> GNU/Hurd. But then, in (shepherd), we could add the polling thing >> when (not (string-contains %host-type "linux")). >> >> WDYT? > > I don't like the idea of doing this based on the host type. In my > patch I've done it based on whether the prctl call succeeded. Right, I actually agree with feature-based checks. :-) More on that inline below. > The fallback code still fails in the guix build environment (as my > previous patch did), but when it's using prctl it works properly. This > means that a build on Linux pre-3.4, or on Hurd, will fail, which > probably isn't acceptable given that shepherd is a hard dependency for > starting a GuixSD system. As far as I can tell the test fails because > the processes stick around as zombies, If they=E2=80=99re zombies, that means nobody called waitpid(2). Presumabl= y the polling code would need to do that. So I suppose =E2=80=98check-for-dead-services=E2=80=99 should do something = like: (when (integer? running) (catch 'system-error (lambda () (match (waitpid* running WNOHANG) (#f #f) ;uh, not a PID? ((0 . _) #f) ;ditto? ((pid . _) (local-output "PID ~a (~a) is dead" running (canonical-n= ame service)) (respawn-service service)))) (lambda args (or (=3D ECHILD (system-error-errno args)) ;wrong PID? (=3D EPERM (system-error-errno args)) ;not a child (apply throw args))))) Does that make sense? Please check waitpid(2) carefully though, because it=E2=80=99s pretty gnarly and I might have forgotten or misinterpreted something here. >> We want to set the =E2=80=9Creaper=E2=80=9D of child processes to Shephe= rd itself, >> so we must do that in child processes. The shepherd process cannot >> be its own reaper I suppose. > > Reading the manpage, and then running some code, I think you're wrong > about this. Using prctl with PR_SET_CHILD_SUBREAPER marks the calling > process as a child subreaper. That means that any processes that are > orphaned below the current process get reparented under the current > process (or a closer child subreaper, if there's one further down). If > there are no processes marked as child subreapers, then the orphaned > process is reparented under pid 1. We should only need to call prctl > in two places: when shepherd initially starts, or when we fork for > daemonize. Oh you=E2=80=99re right, sorry for the confusion! > From 5f26da2ce6a26c8412368900987ac5438f3e70cd Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Sat, 3 Mar 2018 17:26:05 +1100 > Subject: [PATCH 1/2] Handle forked process SIGCHLD signals > > * Makefile.am (TESTS): Add tests/forking-service.sh. > * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER. > * modules/shepherd.scm: Set the child subreaper attribute of main shepherd > process (as long as we're not pid 1). > * modules/shepherd/service.scm (root-service)[daemonize]: Set the child > subreaper attribute of newly forked shepherd process. > * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variab= le > and export it. > (prctl): Add new procedure and export it. [...] > --- a/modules/shepherd.scm > +++ b/modules/shepherd.scm > @@ -50,6 +50,8 @@ > ;; Main program. > (define (main . args) > (initialize-cli) > + (when (not (=3D 1 (getpid))) > + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) I think it=E2=80=99s a good idea to add a comment, like: ;; Register ourselves to get SIGCHLD when child processes terminate. ;; This is necessary for daemons for which we=E2=80=99d otherwise never g= et ;; SIGCHLD. > +(define prctl > + (let ((proc (syscall->procedure long "prctl" (list int int)))) On GNU/Hurd libc doesn=E2=80=99t have a =E2=80=9Cprctl=E2=80=9D symbol. So= you need something like: (if (dynamic-func "prctl" (dynamic-link)) (let ((proc =E2=80=A6)) =E2=80=A6) (lambda (process operation) ;; We=E2=80=99re running on an OS that lacks =E2=80=98prctl=E2=80= =99, such as GNU/Hurd. (throw 'system-error "prctl" "~A" (list (strerror ENOSYS)) (list ENOSYS)))) > +function cleanup() { You need either () or =E2=80=9Cfunction=E2=80=9D but not both (shells other= than Bash might complain=E2=80=A6). > +shepherd_pid=3D"$(cat $pid)" Likewise, we should use `foo` instead of $(foo) to suppose non-Bash shells (there are several occurrences of $(foo) here.) > From ec47fa189c7d47f1d9444d939b084850f0a7186c Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Wed, 21 Feb 2018 22:57:59 +1100 > Subject: [PATCH 2/2] Poll every 0.5s to find dead forked services > > * modules/shepherd.scm (open-server-socket): Set socket to be > non-blocking. > (main): Use select with a timeout. If prctl failed when shepherd started > then call check-for-dead-services between connections/timeouts. > * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHL= D as > signal handler. > (respawn-service): Separate logic for respawning services from handling > SIGCHLD. > (handle-SIGCHLD, check-for-dead-services): New exported procedures. > * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with > symbols. [...] > + (define poll-services > + (if (=3D 1 (getpid)) > + (lambda () #f) > + (catch 'system-error > + (lambda () > + (prctl PR_SET_CHILD_SUBREAPER 1) > + (lambda () #f)) > + (lambda (key . args) > + check-for-dead-services)))) Please add a comment in the handler saying that we resort to polling on OSes that do not support =E2=80=98prctl=E2=80=99. However, perhaps we should do: (lambda args (let ((errno (system-error-errno args))) (if (=3D ENOSYS errno) check-for-dead-services (apply throw args)))) so that important/unexpected errors are not silently ignored. > +(define (respawn-service serv) > + (when serv Please add a docstring and move =E2=80=98when=E2=80=99 to the caller. > +(define (check-for-dead-services) Docstring as well :-), and also a comment explaining that this is a last resort for prctl-less OSes. > (register-services > (make > #:provides '(test-loaded) > - #:start (const 42) > + #:start (const 'abc) Perhaps with the =E2=80=98check-for-dead-services=E2=80=99 use of =E2=80=98= waitpid=E2=80=99 I outlined above we can even keep 42 here? If not, we should add in shepherd.texi, under =E2=80=9CSlots of services=E2= =80=9D, a few words saying that when =E2=80=98running=E2=80=99 is an integer it is assume= d to be a PID. Could you send an updated patch? Thanks for working on this! Ludo=E2=80=99. From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sat, 03 Mar 2018 20:50:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.152011019811366 (code B ref 30637); Sat, 03 Mar 2018 20:50:01 +0000 Received: (at 30637) by debbugs.gnu.org; 3 Mar 2018 20:49:58 +0000 Received: from localhost ([127.0.0.1]:43036 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1esE6K-0002xA-Uh for submit@debbugs.gnu.org; Sat, 03 Mar 2018 15:49:58 -0500 Received: from mail-pf0-f175.google.com ([209.85.192.175]:36638) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1esE6B-0002wr-VV for 30637@debbugs.gnu.org; Sat, 03 Mar 2018 15:49:51 -0500 Received: by mail-pf0-f175.google.com with SMTP id 68so5517263pfx.3 for <30637@debbugs.gnu.org>; Sat, 03 Mar 2018 12:49:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=TXE49f2k8k5lDtZgpLiNsXHOWkc3pk+lH/0orUPrxUk=; b=DoiAnIerB+sHY/+cVadRcxisuHqNXLNPOrcfXgdc9inrWt4sivLB1A4URtqH/Z3SCQ Y1DYDTcrSgpm89XodazUoH90cAW88juP/b5/hCvq0iIl+06lkmftxznsyqbc/FlwUYQ+ PQCl8jff8zQIGUT5PoIbucwJ6zSD/MDZ5i8dowzzK9SUhiPY5VUlf9G6UxZeHFE7YzC1 ORZyxOS0IkeasqxzuHfGxSktzi1tBlO2ZV3PfypIvZU7xeg/+mmp1h2PjtdgO6vX2ju2 SJWo19O/2l2FKXdxDsUvpt9hZHlt6y/Vy7fk/DvClBbuXBj/eQwwqFSh3tQzXQAKiEZg Wi3A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=TXE49f2k8k5lDtZgpLiNsXHOWkc3pk+lH/0orUPrxUk=; b=otl54+CDJV5F83axP9U8yxPoxbKFJk2jP9nuN64LXUBTDljlkJwN1kLCZoI5FwXA7m 1Zdfn9KABhtl21a2wam1y8gzxbpjDiqHHVaA7m3VMwhetHiMchlcN9FY+cXYl8tR50We vnbgjSL1zkbqmuX3Hjl1IsGFgkOh/0Z+ysmP/YmXmhGDJjf1IbwuyXuS71cHFDTuRrgK oJCIrn+h/q2xq5ged7lWZwN0YOMJNPXk14gwvYQLnCXqplt8LQdiuazZXruWgqgEL5mN ibZronOf/GI4MslNrmQegwi/oXQgnBLGPMyVtIF+YR1xCnoPC09hdQ/lEPbt5watjo3S S5mw== X-Gm-Message-State: APf1xPCgH2WIYpAT5UrWEKW6herRciFwKe/cobin10N+HW1161pVqjNY r5Y4/HWoBq98ztT2/uVKm5s4+b1a X-Google-Smtp-Source: AG47ELtWffm+Wda8dcnTO03uLZMyI7UrNQ2aQq3mGf7Ds+ewcbDBS4oM6dycx+vwV6KLEoMAmZd42Q== X-Received: by 10.101.92.66 with SMTP id v2mr8106355pgr.341.1520110177507; Sat, 03 Mar 2018 12:49:37 -0800 (PST) Received: from pidgey (110-174-2-124.tpgi.com.au. [110.174.2.124]) by smtp.gmail.com with ESMTPSA id h69sm19631575pfe.97.2018.03.03.12.49.33 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sat, 03 Mar 2018 12:49:36 -0800 (PST) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro In-reply-to: <87h8px9od8.fsf@gnu.org> Date: Sun, 04 Mar 2018 07:49:26 +1100 Message-ID: <87woys9961.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hey Ludo, On Sat, Mar 03 2018, Ludovic Court=C3=A8s wrote: > If they=E2=80=99re zombies, that means nobody called waitpid(2).=20 > Presumably the > polling code would need to do that. > > So I suppose =E2=80=98check-for-dead-services=E2=80=99 should do somethin= g like: > > [ ... ] > > Does that make sense? Please check waitpid(2) carefully though,=20 > because > it=E2=80=99s pretty gnarly and I might have forgotten or misinterpreted > something here. Unfortunately we can't do that. We fall back to the polling=20 approach to handle the fact that the processes that we care about=20 aren't our children (hence we don't get SIGCHLD). The waitpid=20 system call only waits for processes which are children of the=20 calling process. I looked into the zombie problem a bit more, and I found what the=20 problem actually is. In the build environment a guile process is=20 running as pid 1 (the *-guile-builder script for that job). This=20 guile process never handles SIGCHLD, and never calls wait/waitpid,=20 so any orphaned processes become zombies. I tried modifying=20 derivations.scm, but it wanted to rebuild a lot of things, so I=20 gave up. I think we need to add something like this to the=20 *-guile-builder script: (sigaction SIGCHLD (lambda () (let loop () (match (waitpid WAIT_ANY WNOHANG) ((0 . _) #f) ((pid . _) (loop)) (_ #f)))) SA_NOCLDSTOP) I've attached the output of `ps axjf` inside the build container,=20 so you can see why I think that this is the problem. It's a bit of=20 a shame that this is different to `guix environment --container`,=20 where /bin/sh is pid 1, because it meant that it would build=20 successfully in my container, but would fail in the build=20 container (which is a confusing experience). --=-=-= Content-Type: text/plain Content-Disposition: attachment; filename=process-tree.txt Content-Transfer-Encoding: quoted-printable PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 0 1 1 1 ? -1 Ssl 30001 0:00 guile --no-auto-co= mpile -L /gnu/store/71d3rwa514j7vy5l4vfivf68g5yxibvl-module-import /gnu/sto= re/nln71c8hv82c4vkrssi12qmapp1ryk58-shepherd-0.0.0-guile-builder 1 989 1 1 ? -1 S 30001 0:00 make check -j 4 989 990 1 1 ? -1 S 30001 0:00 \_ make check-rec= ursive 990 991 1 1 ? -1 S 30001 0:00 \_ /gnu/store= /icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -c fail=3D; = \ if (target_option=3Dk; case ${target_option-} in ?) ;; *) echo "am__make_= running_with_option: internal error: invalid" "target option '${target_opti= on-}' specified" >&2; exit 1;; esac; has_opt=3Dno; sane_makeflags=3D$MAKEFL= AGS; if { if test -z '1'; then false; elif test -n 'x86_64-unknown-linux-gn= u'; then true; elif test -n '4.2.1' && test -n '/tmp/guix-build-shepherd-0.= 0.0.drv-0/source'; then true; else false; fi; }; then sane_makeflags=3D$MFL= AGS; else case $MAKEFLAGS in *\\[\ \.]*) bs=3D\\; sane_makeflags=3D`printf = '%s\n' "$MAKEFLAGS" | sed "s/$bs$bs[$bs $bs.]*//g"`;; esac; fi; skip_next= =3Dno; strip_trailopt () { flg=3D`printf '%s\n' "$flg" | sed "s/$1.*$//"`; = }; for flg in $sane_makeflags; do test $skip_next =3D yes && { skip_next=3D= no; continue; }; case $flg in *=3D*|--*) continue;; -*I) strip_trailopt 'I'= ; skip_next=3Dyes;; -*I?*) strip_trailopt 'I';; -*O) strip_trailopt 'O'; sk= ip_next=3Dyes;; -*O?*) strip_trailopt 'O';; -*l) strip_trailopt 'l'; skip_n= ext=3Dyes;; -*l?*) strip_trailopt 'l';; -[dEDm]) skip_next=3Dyes;; -[JT]) s= kip_next=3Dyes;; esac; case $flg in *$target_option*) has_opt=3Dyes; break;= ; esac; done; test $has_opt =3D yes); then \ failcom=3D'fail=3Dyes'; \ el= se \ failcom=3D'exit 1'; \ fi; \ dot_seen=3Dno; \ target=3D`echo check-re= cursive | sed s/-recursive//`; \ case "check-recursive" in \ distclean-* = | maintainer-clean-*) list=3D'po' ;; \ *) list=3D'po' ;; \ esac; \ for su= bdir in $list; do \ echo "Making $target in $subdir"; \ if test "$subdi= r" =3D "."; then \ dot_seen=3Dyes; \ local_target=3D"$target-am"; \= else \ local_target=3D"$target"; \ fi; \ (CDPATH=3D"${ZSH_VERSIO= N+.}:" && cd $subdir && make $local_target) \ || eval $failcom; \ done; = \ if test "$dot_seen" =3D "no"; then \ make "$target-am" || exit 1; \ fi= ; test -z "$fail" 991 998 1 1 ? -1 S 30001 0:00 \_ make c= heck-am 998 999 1 1 ? -1 S 30001 0:00 \_ ma= ke check-TESTS 999 1005 1 1 ? -1 S 30001 0:00 \= _ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash = -c set +e; bases=3D'tests/basic.log tests/respawn.log tests/respawn-throttl= ing.log tests/misbehaved-client.log tests/no-home.log tests/pid-file.log te= sts/status-sexp.log tests/forking-service.log tests/signals.log'; bases=3D`= for i in $bases; do echo $i; done | sed 's/\.log$//'`; bases=3D`echo $bases= `; \ log_list=3D`for i in $bases; do echo $i.log; done`; \ trs_list=3D`for = i in $bases; do echo $i.trs; done`; \ log_list=3D`echo $log_list`; trs_list= =3D`echo $trs_list`; \ make test-suite.log TEST_LOGS=3D"$log_list"; \ exit= $?; 1005 1014 1 1 ? -1 S 30001 0:00 = \_ make test-suite.log TEST_LOGS=3Dtests/basic.log tests/respawn.log tes= ts/respawn-throttling.log tests/misbehaved-client.log tests/no-home.log tes= ts/pid-file.log tests/status-sexp.log tests/forking-service.log tests/signa= ls.log 1014 1554 1 1 ? -1 S 30001 0:00 = \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/b= in/bash -c p=3D'tests/forking-service.sh'; \ case 'tests/forking-service.lo= g' in */*) case 'tests/forking-service' in */*) b=3D'tests/forking-service'= ;; *) b=3D`echo 'tests/forking-service.log' | sed 's/\.log$//'`; esac;; *) = b=3D'tests/forking-service';; esac; \ case $- in *e*) set +e;; esac; srcdir= strip=3D`echo "." | sed 's|.|.|g'`; case $p in ./*) f=3D`echo "$p" | sed "s= |^$srcdirstrip/||"`;; *) f=3D$p;; esac; { mgn=3D red=3D grn=3D lgn=3D blu= =3D brg=3D std=3D; am__color_tests=3Dno; if test "X" =3D Xno; then am__colo= r_tests=3Dno; elif test "X" =3D Xalways; then am__color_tests=3Dyes; elif t= est "X$TERM" !=3D Xdumb && { test -t 1; } 2>/dev/null; then am__color_tests= =3Dyes; fi; if test $am__color_tests =3D yes; then red=3D'.[0;31m'; grn=3D'= .[0;32m'; lgn=3D'.[1;32m'; blu=3D'.[1;34m'; mgn=3D'.[0;35m'; brg=3D'.[1m'; = std=3D'.[m'; fi; }; srcdir=3D.; export srcdir; case "tests/forking-service.= log" in */*) am__odir=3D`echo "./tests/forking-service.log" | sed 's|/[^/]*= $||'`;; *) am__odir=3D.;; esac; test "x$am__odir" =3D x"." || test -d "$am_= _odir" || /gnu/store/6i33ik7haav0hd5a797l3llkq04ghx6g-coreutils-8.28/bin/mk= dir -p "$am__odir" || exit $?; if test -f "./$f"; then dir=3D./; elif test = -f "$f"; then dir=3D; else dir=3D"./"; fi; tst=3D$dir$f; log=3D'tests/forki= ng-service.log'; if test -n ''; then am__enable_hard_errors=3Dno; else am__= enable_hard_errors=3Dyes; fi; case " " in *[\ \.]$f[\ \.]* | *[\ \.]$dir$f= [\ \.]*) am__expect_failure=3Dyes;; *) am__expect_failure=3Dno;; esac; unse= t XDG_CONFIG_HOME; unset LANGUAGE; LC_ALL=3DC LC_MESSAGES=3DC PATH=3D"/tmp/= guix-build-shepherd-0.0.0.drv-0/source:$PATH" SHELL=3D"/gnu/store/icz3hd36a= qpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash" GUILE=3D"/gnu/store/3= 8553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2.3/bin/guile" GUILE_LOAD_PATH=3D"/= tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:/tmp/guix-build-shepherd= -0.0.0.drv-0/source/modules:$GUILE_LOAD_PATH" GUILE_LOAD_COMPILED_PATH=3D"/= tmp/guix-build-shepherd-0.0.0.drv-0/source/modules:/tmp/guix-build-shepherd= -0.0.0.drv-0/source/modules:$GUILE_LOAD_COMPILED_PATH" /gnu/store/icz3hd36= aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash ./build-aux/test-driv= er --test-name "$f" \ --log-file $b.log --trs-file $b.trs \ --color-tests "= $am__color_tests" --enable-hard-errors "$am__enable_hard_errors" --expect-f= ailure "$am__expect_failure" -- /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgim= l1-bash-minimal-4.4.12/bin/bash -x -e \ "$tst"=20 1554 1561 1 1 ? -1 S 30001 0:00 = \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.= 12/bin/bash ./build-aux/test-driver --test-name tests/forking-service.sh --= log-file tests/forking-service.log --trs-file tests/forking-service.trs --c= olor-tests no --enable-hard-errors yes --expect-failure no -- /gnu/store/ic= z3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-4.4.12/bin/bash -x -e ./tests/f= orking-service.sh 1561 1562 1 1 ? -1 S 30001 0:00 = \_ /gnu/store/icz3hd36aqpjz5slyp4hhr8wsfbgiml1-bash-minimal-= 4.4.12/bin/bash -x -e ./tests/forking-service.sh 1562 1594 1 1 ? -1 Sl 30001 0:00 = \_ /gnu/store/38553wfz0jwlgbw13pk99xl79pbfx58d-guile-2.2= .3/bin/guile --no-auto-compile /tmp/guix-build-shepherd-0.0.0.drv-0/source/= shepherd -I -s t-socket-1562 -c t-conf-1562 -l t-log-1562 --pid=3Dt-pid-1562 1562 2005 1 1 ? -1 R 30001 0:00 = \_ ps axjf 1 1091 1 1 ? -1 Z 30001 0:00 [shepherd] 1 1114 1 1 ? -1 Z 30001 0:00 [shepherd] 1 1403 1123 1123 ? -1 Z 30001 0:00 [sleep] 1 1415 1 1 ? -1 Z 30001 0:00 [shepherd] 1 1433 1193 1193 ? -1 Z 30001 0:00 [sleep] 1 1480 1479 1479 ? -1 S 30001 0:00 sleep 600 1 1541 1427 1427 ? -1 Z 30001 0:00 [sleep] 1 1544 1471 1471 ? -1 Z 30001 0:00 [sleep] 1 1644 1638 1638 ? -1 Z 30001 0:00 [sleep] 1 1659 1658 1658 ? -1 S 30001 0:00 sleep 600 1 1705 1577 1577 ? -1 Z 30001 0:00 [sleep] 1 1707 1618 1618 ? -1 Z 30001 0:00 [sleep] 1 1770 1770 1770 ? -1 Zs 30001 0:00 [bash] 1 1826 1736 1736 ? -1 Z 30001 0:00 [sleep] 1 1888 1770 1770 ? -1 Z 30001 0:00 [sleep] --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable > Please add a comment in the handler saying that we resort to=20 > polling on > OSes that do not support =E2=80=98prctl=E2=80=99. > > However, perhaps we should do: > > (lambda args > (let ((errno (system-error-errno args))) > (if (=3D ENOSYS errno) > check-for-dead-services > (apply throw args)))) > > so that important/unexpected errors are not silently ignored. I had quite liked the idea that it would just ignore any error and=20 do the fallback, because really all we care about is "prctl=20 failed" when deciding on our fallback logic. I've decided to just=20 handle ENOSYS (prctl not available) and EINVAL (which is returned=20 when PR_SET_CHILD_SUBREAPER not available), and throw for=20 everything else. I'd love to be able to test this on platforms=20 where prctl will actually fail, though, because I don't like the=20 idea of committing code that I haven't actually been able to run. > If not, we should add in shepherd.texi, under =E2=80=9CSlots of=20 > services=E2=80=9D, a few > words saying that when =E2=80=98running=E2=80=99 is an integer it is assu= med to=20 > be a > PID. I've done this, but while doing it I realised that this has always=20 been true. The SIGCHLD handler has always assumed that a number=20 indicates a running process, my modifications haven't changed the=20 assumption, they've just widened its scope. Carlo --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=0001-Add-prctl-syscall-wrapper-along-with-with-PR_SET_CHI.patch Content-Transfer-Encoding: quoted-printable From=20e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Sun, 4 Mar 2018 07:01:30 +1100 Subject: [PATCH 1/3] Add prctl syscall wrapper along with with PR_SET_CHILD_SUBREAPER. * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER. * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variable and export it. (prctl): Add new procedure and export it. =2D-- configure.ac | 4 ++++ modules/shepherd/system.scm.in | 21 ++++++++++++++++++++- 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index bb5058d..fbe16f4 100644 =2D-- a/configure.ac +++ b/configure.ac @@ -72,7 +72,11 @@ esac AC_SUBST([RB_AUTOBOOT]) AC_SUBST([RB_HALT_SYSTEM]) AC_SUBST([RB_POWER_OFF]) +AC_MSG_RESULT([done]) =20 +AC_MSG_CHECKING([ constants]) +AC_COMPUTE_INT([PR_SET_CHILD_SUBREAPER], [PR_SET_CHILD_SUBREAPER], [#inclu= de ]) +AC_SUBST([PR_SET_CHILD_SUBREAPER]) AC_MSG_RESULT([done]) =20 dnl Manual pages. diff --git a/modules/shepherd/system.scm.in b/modules/shepherd/system.scm.in index a54dca7..09d45bf 100644 =2D-- a/modules/shepherd/system.scm.in +++ b/modules/shepherd/system.scm.in @@ -23,7 +23,9 @@ #:export (reboot halt power-off =2D max-file-descriptors)) + max-file-descriptors + prctl + PR_SET_CHILD_SUBREAPER)) =20 ;; The constants. (define RB_AUTOBOOT @RB_AUTOBOOT@) @@ -130,6 +132,23 @@ the returned procedure is called." (list err)) result))))) =20 +(define PR_SET_CHILD_SUBREAPER @PR_SET_CHILD_SUBREAPER@) + +(define prctl + (if (dynamic-func "prctl" (dynamic-link)) + (let ((proc (syscall->procedure long "prctl" (list int int)))) + (lambda (process operation) + "Perform an operation on the given process" + (let-values (((result err) (proc process operation))) + (if (=3D -1 result) + (throw 'system-error "prctl" "~A: ~S" + (list (strerror err) name) + (list err)) + result)))) + (lambda (process operation) + (throw 'system-error "prctl" "~A" (list strerror ENOSYS) + (list ENOSYS))))) + (define (max-file-descriptors) "Return the maximum number of open file descriptors allowed." (sysconf _SC_OPEN_MAX)) =2D-=20 2.16.1 --=-=-= Content-Type: text/x-patch Content-Disposition: inline; filename=0002-Handle-forked-process-SIGCHLD-signals.patch Content-Transfer-Encoding: quoted-printable From=20b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Sun, 4 Mar 2018 07:46:13 +1100 Subject: [PATCH 2/3] Handle forked process SIGCHLD signals * Makefile.am (TESTS): Add tests/forking-service.sh. * modules/shepherd.scm: Set the child subreaper attribute of main shepherd process (as long as we're not pid 1). * modules/shepherd/service.scm (root-service)[daemonize]: Set the child subreaper attribute of newly forked shepherd process. =2D-- Makefile.am | 1 + modules/shepherd.scm | 7 +++++++ modules/shepherd/service.scm | 4 +++- 3 files changed, 11 insertions(+), 1 deletion(-) diff --git a/Makefile.am b/Makefile.am index eafa308..8dad006 100644 =2D-- a/Makefile.am +++ b/Makefile.am @@ -190,6 +190,7 @@ TESTS =3D \ tests/no-home.sh \ tests/pid-file.sh \ tests/status-sexp.sh \ + tests/forking-service.sh \ tests/signals.sh =20 TEST_EXTENSIONS =3D .sh diff --git a/modules/shepherd.scm b/modules/shepherd.scm index df5420f..faa1e47 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -51,6 +51,13 @@ (define (main . args) (initialize-cli) =20 + (when (not (=3D 1 (getpid))) + ;; Register for orphaned processes to be reparented onto us when their + ;; original parent dies. This lets us handle SIGCHLD from daemon proce= sses + ;; that would otherwise have been reparented under pid 1. This is + ;; unnecessary when we are pid 1. + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) + (let ((config-file #f) (socket-file default-socket-file) (pid-file #f) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index 2224932..b6394f2 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -1274,7 +1274,9 @@ we want to receive these signals." (local-output "Running as PID 1, so not daemonizing.")) (else (if (zero? (primitive-fork)) =2D #t + (begin + (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1)) + #t) (primitive-exit 0)))))) (persistency "Safe the current state of running and non-running services. =2D-=20 2.16.1 --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0003-Poll-every-0.5s-to-find-dead-forked-services.patch Content-Transfer-Encoding: quoted-printable From=203d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): Use select with a timeout. If prctl failed when shepherd started then call check-for-dead-services between connections/timeouts. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. * doc/shepherd.texi (Slots of services): Add note about service running slot being a process id. =2D-- doc/shepherd.texi | 4 ++- modules/shepherd.scm | 47 ++++++++++++++++++------- modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++------------= ---- tests/basic.sh | 4 +-- tests/status-sexp.sh | 4 +-- 5 files changed, 95 insertions(+), 46 deletions(-) diff --git a/doc/shepherd.texi b/doc/shepherd.texi index 815091f..47005d5 100644 =2D-- a/doc/shepherd.texi +++ b/doc/shepherd.texi @@ -608,7 +608,9 @@ way. The default value is @code{#f}, which indicates t= hat the service is not running. When an attempt is made to start the service, it will be set to the return value of the procedure in the @code{start} slot. It will also be passed as an argument to the procedure in the =2D@code{stop} slot. This slot can not be initialized with a keyword. +@code{stop} slot. If it is set a value that is an integer, it is +assumed to be a process id, and shepherd will monitor the process for +unexpected exits. This slot can not be initialized with a keyword. =20 @item @vindex respawn? (slot of ) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index faa1e47..e912d21 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -49,14 +51,28 @@ ;; Main program. (define (main . args) =2D (initialize-cli) + (define poll-services + (if (=3D 1 (getpid)) + (lambda () #f) ;; If we're pid 1 then we don't need to set + ;; PR_SET_CHILD_SUBREAPER + (catch 'system-error + (lambda () + ;; Register for orphaned processes to be reparented onto us wh= en + ;; their original parent dies. This lets us handle SIGCHLD from + ;; daemon processes that would otherwise have been reparented + ;; under pid 1. This is unnecessary when we are pid 1. + (prctl PR_SET_CHILD_SUBREAPER 1) + (lambda () #f)) + (lambda args + ;; We fall back to polling for services on systems that don't + ;; support prctl/PR_SET_CHILD_SUBREAPER + (let ((errno (system-error-errno args))) + (if (or (=3D ENOSYS errno) ;; prctl not available + (=3D EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not av= ailable + check-for-dead-services ;; poll + (apply throw args))))))) =20 =2D (when (not (=3D 1 (getpid))) =2D ;; Register for orphaned processes to be reparented onto us when the= ir =2D ;; original parent dies. This lets us handle SIGCHLD from daemon pro= cesses =2D ;; that would otherwise have been reparented under pid 1. This is =2D ;; unnecessary when we are pid 1. =2D (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) + (initialize-cli) =20 (let ((config-file #f) (socket-file default-socket-file) @@ -225,11 +241,18 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) 0.5) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (poll-services) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index b6394f2..056483a 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1010,38 +1013,44 @@ otherwise by updating its state." ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv)) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + "Respawn a service that has stopped running unexpectedly. If we have +attempted to respawn the service a number of times already and it keeps dy= ing, +then disable it." + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f)))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1180,21 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + "Poll each process that we expect to be running, and respawn any which h= ave +unexpectedly stopped running. This procedure is used as a fallback on syst= ems +where prctl/PR_SET_CHILD_SUBREAPER is unsupported." + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqbClYACgkQqdyPv9aw Ibx21Q//fnbHM9kvJxr+wOxD5Lgqta9xifp0H6XSIxXziEKcqW0Lcvc6fqlZ8Nue hv+nRY/UZCcPOO9NYEaI2ut0jsebnEcuVskgB8H0bmzOntlLQVhhsYpYDusIH8ZI KdQHQd1gr3lDbPm7ABuz5SJqgNYbgAr9qyEPrdbC8BOmkMNUy5XQXzk2c65Ftxlp iQUJYdJ7i4YOgDTi9Mu24f1OZLqaenOpJ36y0S+j0rGS0Wt01c7dPBz+eIfaY1Us VwrxhSPlnvVsv8wpqy66zkvdsFz1nnO5xIr7TvteNuii2bXE5T5bKpN2OqS+Qf5Y txLrXmqz2AGykQsvs1xjsC7ygk0qjTfB682eXQ5bNCdgs88w0V+vTrEpEGAskeok EehE0h/m/PX9cDxAANuiJmOI1wA1tAAHrLLA0QQL1ZKDCwSJpemqPVSpiJXK/1GZ GBN9z6GLUR/Nt3M8iQCUiu9VnsenEq6QZ9LnklJ7YsbyElPbIGJppsc1kEVO9WQS xJ8dlCeq1NOP83kxfTEwPCiz84JRnuXPZmuZmtjMnxbMhsmyM8mOxj5W+gnm7kdX VifbAG4EBxRGv/u9kftrB0uN36r/ckVcXfry8hinwRmKi4p+bJhY9HgpRU06xK1A GigSQ4Efnnm0L6vsrUyX1i9Hju9YQgjTLh/maZkBj3sMJe2uCd0= =j13L -----END PGP SIGNATURE----- --==-=-=-- From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sun, 04 Mar 2018 22:12:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.1520201494777 (code B ref 30637); Sun, 04 Mar 2018 22:12:02 +0000 Received: (at 30637) by debbugs.gnu.org; 4 Mar 2018 22:11:34 +0000 Received: from localhost ([127.0.0.1]:44781 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1esbqv-0000CS-JU for submit@debbugs.gnu.org; Sun, 04 Mar 2018 17:11:33 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:33094) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1esbqt-0000CK-Re for 30637@debbugs.gnu.org; Sun, 04 Mar 2018 17:11:32 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id 4831211D75; Sun, 4 Mar 2018 23:11:31 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id jqMsctwHQnBC; Sun, 4 Mar 2018 23:11:29 +0100 (CET) Received: from ribbon (unknown [IPv6:2a01:e0a:1d:7270:af76:b9b:ca24:c465]) by hera.aquilenet.fr (Postfix) with ESMTPSA id 11C7510A74; Sun, 4 Mar 2018 23:11:28 +0100 (CET) From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> <87woys9961.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 14 =?UTF-8?Q?Vent=C3=B4se?= an 226 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sun, 04 Mar 2018 23:11:28 +0100 In-Reply-To: <87woys9961.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Sun, 04 Mar 2018 07:49:26 +1100") Message-ID: <87371f4hkf.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Hello, Carlo Zancanaro skribis: > On Sat, Mar 03 2018, Ludovic Court=C3=A8s wrote: >> If they=E2=80=99re zombies, that means nobody called waitpid(2). Presuma= bly >> the >> polling code would need to do that. >> >> So I suppose =E2=80=98check-for-dead-services=E2=80=99 should do somethi= ng like: >> >> [ ... ] >> >> Does that make sense? Please check waitpid(2) carefully though, >> because >> it=E2=80=99s pretty gnarly and I might have forgotten or misinterpreted >> something here. > > Unfortunately we can't do that. We fall back to the polling approach > to handle the fact that the processes that we care about aren't our > children (hence we don't get SIGCHLD). The waitpid system call only > waits for processes which are children of the calling process. Oh right! > I looked into the zombie problem a bit more, and I found what the > problem actually is. In the build environment a guile process is > running as pid 1 (the *-guile-builder script for that job). This guile > process never handles SIGCHLD, and never calls wait/waitpid, so any > orphaned processes become zombies. I tried modifying derivations.scm, > but it wanted to rebuild a lot of things, so I gave up. I think we > need to add something like this to the *-guile-builder script: > > (sigaction SIGCHLD > (lambda () > (let loop () > (match (waitpid WAIT_ANY WNOHANG) > ((0 . _) #f) > ((pid . _) (loop)) > (_ #f)))) > SA_NOCLDSTOP) Good catch. We could add this in gnu-build-system.scm in core-updates, though it=E2=80=99s no big deal anyway since these are throw-away environme= nts. Thoughts? > From e529e4035eec147f448804dd10fdbca13a17f523 Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Sun, 4 Mar 2018 07:01:30 +1100 > Subject: [PATCH 1/3] Add prctl syscall wrapper along with with > PR_SET_CHILD_SUBREAPER. > > * configure.ac: Detect and substitute PR_SET_CHILD_SUBREAPER. > * modules/shepherd/system.scm.in (PR_SET_CHILD_SUBREAPER): Add new variab= le > and export it. > (prctl): Add new procedure and export it. Applied with a copyright line for you. > From b43c128d8a175a9a123eb7b1af465fb3747a5393 Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Sun, 4 Mar 2018 07:46:13 +1100 > Subject: [PATCH 2/3] Handle forked process SIGCHLD signals > > * Makefile.am (TESTS): Add tests/forking-service.sh. > * modules/shepherd.scm: Set the child subreaper attribute of main shepherd > process (as long as we're not pid 1). > * modules/shepherd/service.scm (root-service)[daemonize]: Set the child > subreaper attribute of newly forked shepherd process. Applied. However tests/forking-service.sh was missing, so I took it from the previous version of this patch and added it. I also changed the =E2=80=9CBashishms=E2=80=9D that were in that file, as discussed earlie= r. Let me know if anything=E2=80=99s wrong! > From 3d3c091660bbbd529af0058b0ba9b5ddbfc6b481 Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Wed, 21 Feb 2018 22:57:59 +1100 > Subject: [PATCH 3/3] Poll every 0.5s to find dead forked services > > * modules/shepherd.scm (open-server-socket): Set socket to be > non-blocking. > (main): Use select with a timeout. If prctl failed when shepherd started > then call check-for-dead-services between connections/timeouts. > * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHL= D as > signal handler. > (respawn-service): Separate logic for respawning services from handling > SIGCHLD. > (handle-SIGCHLD, check-for-dead-services): New exported procedures. > * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with > symbols. > * doc/shepherd.texi (Slots of services): Add note about service running s= lot > being a process id. [...] > --- a/modules/shepherd.scm > +++ b/modules/shepherd.scm > @@ -42,6 +42,8 @@ > (with-fluids ((%default-port-encoding "UTF-8")) > (let ((sock (socket PF_UNIX SOCK_STREAM 0)) > (address (make-socket-address AF_UNIX file-name))) > + (fcntl sock F_SETFL (logior O_NONBLOCK > + (fcntl sock F_GETFL))) > (bind sock address) > (listen sock 10) > sock))) > @@ -49,14 +51,28 @@ > > ;; Main program. > (define (main . args) > - (initialize-cli) > + (define poll-services > + (if (=3D 1 (getpid)) > + (lambda () #f) ;; If we're pid 1 then we don't need to set > + ;; PR_SET_CHILD_SUBREAPER [...] > + (match (select (list sock) (list) (list) 0.5) > + (((sock) _ _) > + (read-from sock)) > + (_ > + #f)) > + (poll-services) Here everyone ends up paying some overhead (the 0.5 second timeout), which isn=E2=80=99t great. How about something like: (define poll-services (and (not (=3D 1 (getpid))) =E2=80=A6)) (match (select (list sock) '() '() (if poll-services 0.5 0)) =E2=80=A6) ? Thanks, Ludo=E2=80=99. From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sun, 04 Mar 2018 22:37:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.15202029773065 (code B ref 30637); Sun, 04 Mar 2018 22:37:02 +0000 Received: (at 30637) by debbugs.gnu.org; 4 Mar 2018 22:36:17 +0000 Received: from localhost ([127.0.0.1]:44798 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1escEq-0000nM-Oa for submit@debbugs.gnu.org; Sun, 04 Mar 2018 17:36:17 -0500 Received: from mail-pg0-f44.google.com ([74.125.83.44]:41467) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1escEo-0000n9-OY for 30637@debbugs.gnu.org; Sun, 04 Mar 2018 17:36:15 -0500 Received: by mail-pg0-f44.google.com with SMTP id q27so6073471pgn.8 for <30637@debbugs.gnu.org>; Sun, 04 Mar 2018 14:36:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=sV9TKbbzrWm2QFfmr251bkIv5dB5idyItWT1d7vf2FA=; b=CwkgqWh3l78TRIKrIKazNw2gW6XpCRx+08JTgcZkNb0Vv5RGYe+VONlLvcHvDOMsFx EVN8MSdXkVHWkSktENCUL5JE2svEM5auGlE1JwO3wFCObpfL0bqpOndX3SkT/q1qUeh2 zySz+JmzU+k44aVAqbISiP5COjyX4/Nu+zSk1Tnl15zOjLjb2TuxwA+3WVOJWp363hZP oGBQvl99HEYuAltp6bPl4pI0b4MwSeD8fwyNF1u3D6t/qSSnM82yt0haIPx0XnRS0uXa Ob+Fq8RaqrANDYakibT5GtBA2VDfk9wFjktpc9pzxsEE62Xe0oH13qpg06C2ff5BuP+s B5XQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=sV9TKbbzrWm2QFfmr251bkIv5dB5idyItWT1d7vf2FA=; b=p1dWphzsau/5HdsUVtEuojbDJDahK7HfscrQKbRdunO0I9Tb3jvjFzld20X/4qRZ/j R8I8iYVeCz5pTQgGQThZqEnC/tnHn1e2RknzkJH+OLplQu9oE9dhD284AXhsqaZxRG8H 3VL4uiJBzyneZSjkb4sQDiKPeQI9Tq6F07Vsjcc33RASbd9ZTLOSBhNi1e4Z+iVzYgBY jFKJl1bGX6xjpXFk68DEll11mSlyW0zJZvJYYrG81Mrlj1uW521nYsj/qpbpx48EVsS+ AbEOpp/TZP0IWmqjJQWGMxOFuISbB4hkGc2ELLpyHoL+shXJIP5B7gTKipZuSyrmL0kP EsGA== X-Gm-Message-State: APf1xPACUDTSbmG3eGWdG9kPzAGZuYaxTXCg47mOsUOckzWDnM/+1QdL mAM5ya17ztEAZapPeBpdwyCpDFDz X-Google-Smtp-Source: AG47ELueddMz1E5bDhGagaiMkhyt+SXrGtkL6QW7JH7g7QzIeMvZ80ly6TA7lQLJ0nXdkGFc0sjufQ== X-Received: by 10.98.231.26 with SMTP id s26mr13230836pfh.210.1520202968114; Sun, 04 Mar 2018 14:36:08 -0800 (PST) Received: from pidgey (110-174-2-124.tpgi.com.au. [110.174.2.124]) by smtp.gmail.com with ESMTPSA id a21sm24205085pfl.137.2018.03.04.14.36.05 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 04 Mar 2018 14:36:07 -0800 (PST) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> <87woys9961.fsf@zancanaro.id.au> <87371f4hkf.fsf@gnu.org> User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro In-reply-to: <87371f4hkf.fsf@gnu.org> Date: Mon, 05 Mar 2018 09:35:58 +1100 Message-ID: <871sgzpiy9.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On Sun, Mar 04 2018, Ludovic Court=C3=A8s wrote: > Good catch. We could add this in gnu-build-system.scm in=20 > core-updates, though it=E2=80=99s no big deal anyway since these are=20 > throw-away environments. > > Thoughts? The current forking-service.sh test fails in that environment, so=20 we won't be able to build shepherd on Hurd, or systems with Linux=20 pre 3.4. This is already the case without my third commit, though,=20 because the prctl fallback logic isn't in place yet. I think we should add it in core-updates. It does affect the=20 behaviour of processes within the build environment, and can lead=20 to test failures if people rely on pid 1 to reap zombie processes=20 (which, from what I understand, they should be able to). This=20 could even be leading to test failures in other packages which we=20 have just disabled. >> + (match (select (list sock) (list) (list) 0.5) >> + (((sock) _ _) >> + (read-from sock)) >> + (_ >> + #f)) >> + (poll-services) > > Here everyone ends up paying some overhead (the 0.5 second=20 > timeout), > which isn=E2=80=99t great. > > How about something like: > > (define poll-services > (and (not (=3D 1 (getpid))) > =E2=80=A6)) > > (match (select (list sock) '() '() (if poll-services 0.5 0)) > =E2=80=A6) The wait for 0.5 seconds is only an upper-bound for the timeout.=20 Changing it to a 0 would actually be worse, because it would spend=20 longer polling for running services. The `select` procedure waits=20 for `sock` to be ready to read from. When it's ready it returns=20 immediately, but if `sock` takes more than 0.5 seconds to be ready=20 then it will return anyway (and take the second branch in the=20 match, which does nothing). This should incur no (or extremely minuscule) overhead in how long=20 it takes to respond to a socket, but provides an opportunity every=20 half a second (at most) for shepherd to poll the running services. On reflection, we should also change the commit message for this=20 commit. I have attached a patch with a more accurate commit=20 message. Carlo --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-Poll-every-0.5s-to-find-dead-forked-services-if-prct.patch Content-Transfer-Encoding: quoted-printable From=205b01f79522c815dd8277298e87eef0506c2e8612 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fail= s. * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): Use select with a timeout. If prctl failed when shepherd started then call check-for-dead-services between connections/timeouts. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. * doc/shepherd.texi (Slots of services): Add note about service running slot being a process id. =2D-- doc/shepherd.texi | 4 ++- modules/shepherd.scm | 47 ++++++++++++++++++------- modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++------------= ---- tests/basic.sh | 4 +-- tests/status-sexp.sh | 4 +-- 5 files changed, 95 insertions(+), 46 deletions(-) diff --git a/doc/shepherd.texi b/doc/shepherd.texi index 815091f..47005d5 100644 =2D-- a/doc/shepherd.texi +++ b/doc/shepherd.texi @@ -608,7 +608,9 @@ way. The default value is @code{#f}, which indicates t= hat the service is not running. When an attempt is made to start the service, it will be set to the return value of the procedure in the @code{start} slot. It will also be passed as an argument to the procedure in the =2D@code{stop} slot. This slot can not be initialized with a keyword. +@code{stop} slot. If it is set a value that is an integer, it is +assumed to be a process id, and shepherd will monitor the process for +unexpected exits. This slot can not be initialized with a keyword. =20 @item @vindex respawn? (slot of ) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index faa1e47..e912d21 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -49,14 +51,28 @@ ;; Main program. (define (main . args) =2D (initialize-cli) + (define poll-services + (if (=3D 1 (getpid)) + (lambda () #f) ;; If we're pid 1 then we don't need to set + ;; PR_SET_CHILD_SUBREAPER + (catch 'system-error + (lambda () + ;; Register for orphaned processes to be reparented onto us wh= en + ;; their original parent dies. This lets us handle SIGCHLD from + ;; daemon processes that would otherwise have been reparented + ;; under pid 1. This is unnecessary when we are pid 1. + (prctl PR_SET_CHILD_SUBREAPER 1) + (lambda () #f)) + (lambda args + ;; We fall back to polling for services on systems that don't + ;; support prctl/PR_SET_CHILD_SUBREAPER + (let ((errno (system-error-errno args))) + (if (or (=3D ENOSYS errno) ;; prctl not available + (=3D EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not av= ailable + check-for-dead-services ;; poll + (apply throw args))))))) =20 =2D (when (not (=3D 1 (getpid))) =2D ;; Register for orphaned processes to be reparented onto us when the= ir =2D ;; original parent dies. This lets us handle SIGCHLD from daemon pro= cesses =2D ;; that would otherwise have been reparented under pid 1. This is =2D ;; unnecessary when we are pid 1. =2D (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) + (initialize-cli) =20 (let ((config-file #f) (socket-file default-socket-file) @@ -225,11 +241,18 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) 0.5) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (poll-services) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index b6394f2..056483a 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1010,38 +1013,44 @@ otherwise by updating its state." ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv)) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + "Respawn a service that has stopped running unexpectedly. If we have +attempted to respawn the service a number of times already and it keeps dy= ing, +then disable it." + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f)))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1180,21 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + "Poll each process that we expect to be running, and respawn any which h= ave +unexpectedly stopped running. This procedure is used as a fallback on syst= ems +where prctl/PR_SET_CHILD_SUBREAPER is unsupported." + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqcdM4ACgkQqdyPv9aw Ibwy9BAAqFvXHBS+P7tUIBC+Espr9Sk3pc3M4Aw4jTxSmtCeE9W6RJnBEu/gRpf8 q8f7mHaWKDaJ4yUvrfudO/OqVVgbuacuqHxaZg5/hVV9O7vw45wnVE/OkgdJNILT hrMGiETzy8nOperMcGgo9Al5zxxO19vKUtXWCi8J11+y57xhlcgvGb+xlNX/rcNu hpnR0leS7zmepkQ6BzNEe5Byn+YqqnUxk7MM4pE/hUlTzQiJW4RR/6lUvEkke1ya KBZp8GRCkpwNSs6OdnU3i0XYhicVZWoLjBHS43vE8xLtdgJcEh8iQQgAqLF6nPvx lYInJKjAyUYuN4eBcdcZ7tMY19brp1aud5X/AW9YQmJZSpHyVMNOP8FTwKfYJ//N 04h1JTqE9YP5Uu8asGeGcllSvxF4po2TCELAj6e2DZ2+IX+4WQ4LWGo+pF1Uifr4 ut5nREw5tvWA9Q6Bmexkvw5jvUpVjk/xYXtp4+m2mXWN9NBlnW6Z/f1sTWQucnbN u7YCET2Jf8oDy7VF5w1iL8OXfQIunpy+1Qf4GVADV9zgVvvOa7Eugd5oG+WGeLYA 44MpMqFRjl9ByjF1O5BhqDa9m/JaDwIG3pJyiYq0Ynn/Py6yPzsHkXIDs5y/HLGU AOtVMrpQX0ngETtsDeq8ZGuJpd44F6Xn2pV70ZTlnb5ciUZBj7k= =zUD8 -----END PGP SIGNATURE----- --==-=-=-- From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sun, 04 Mar 2018 22:50:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Carlo Zancanaro Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.15202037504237 (code B ref 30637); Sun, 04 Mar 2018 22:50:02 +0000 Received: (at 30637) by debbugs.gnu.org; 4 Mar 2018 22:49:10 +0000 Received: from localhost ([127.0.0.1]:44816 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1escRK-00016H-Ax for submit@debbugs.gnu.org; Sun, 04 Mar 2018 17:49:10 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:33240) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1escRI-000168-KK for 30637@debbugs.gnu.org; Sun, 04 Mar 2018 17:49:08 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id CE32B11ED0; Sun, 4 Mar 2018 23:49:07 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id YvHPHQXnxw3j; Sun, 4 Mar 2018 23:49:07 +0100 (CET) Received: from ribbon (unknown [IPv6:2a01:e0a:1d:7270:af76:b9b:ca24:c465]) by hera.aquilenet.fr (Postfix) with ESMTPSA id D340710A74; Sun, 4 Mar 2018 23:49:06 +0100 (CET) From: ludo@gnu.org (Ludovic =?UTF-8?Q?Court=C3=A8s?=) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> <87woys9961.fsf@zancanaro.id.au> <87371f4hkf.fsf@gnu.org> <871sgzpiy9.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 14 =?UTF-8?Q?Vent=C3=B4se?= an 226 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sun, 04 Mar 2018 23:49:06 +0100 In-Reply-To: <871sgzpiy9.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Mon, 05 Mar 2018 09:35:58 +1100") Message-ID: <87o9k33199.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Carlo Zancanaro skribis: > On Sun, Mar 04 2018, Ludovic Court=C3=A8s wrote: >> Good catch. We could add this in gnu-build-system.scm in >> core-updates, though it=E2=80=99s no big deal anyway since these are >> throw-away environments. >> >> Thoughts? > > The current forking-service.sh test fails in that environment, so we > won't be able to build shepherd on Hurd, or systems with Linux pre > 3.4. This is already the case without my third commit, though, because > the prctl fallback logic isn't in place yet. > > I think we should add it in core-updates. It does affect the behaviour > of processes within the build environment, and can lead to test > failures if people rely on pid 1 to reap zombie processes (which, from > what I understand, they should be able to). This could even be leading > to test failures in other packages which we have just disabled. Yeah, makes sense. >>> + (match (select (list sock) (list) (list) 0.5) >>> + (((sock) _ _) >>> + (read-from sock)) >>> + (_ >>> + #f)) >>> + (poll-services) >> >> Here everyone ends up paying some overhead (the 0.5 second timeout), >> which isn=E2=80=99t great. >> >> How about something like: >> >> (define poll-services >> (and (not (=3D 1 (getpid))) >> =E2=80=A6)) >> >> (match (select (list sock) '() '() (if poll-services 0.5 0)) >> =E2=80=A6) > > The wait for 0.5 seconds is only an upper-bound for the > timeout. Changing it to a 0 would actually be worse, because it would > spend longer polling for running services. The `select` procedure > waits for `sock` to be ready to read from. When it's ready it returns > immediately, but if `sock` takes more than 0.5 seconds to be ready > then it will return anyway (and take the second branch in the match, > which does nothing). Sorry, I didn=E2=80=99t mean 0 but rather #f (indefinite wait). My point is: we shouldn=E2=80=99t wake up every 0.5 seconds for no reason. = IOW, we should wake up periodically only in the non-pid-1-no-prctl case. Does that make sense? Ludo=E2=80=99. From unknown Fri Aug 15 15:38:34 2025 X-Loop: help-debbugs@gnu.org Subject: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services Resent-From: Carlo Zancanaro Original-Sender: "Debbugs-submit" Resent-CC: guix-patches@gnu.org Resent-Date: Sun, 04 Mar 2018 23:09:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 30637 X-GNU-PR-Package: guix-patches X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: 30637@debbugs.gnu.org Received: via spool by 30637-submit@debbugs.gnu.org id=B30637.15202049025979 (code B ref 30637); Sun, 04 Mar 2018 23:09:03 +0000 Received: (at 30637) by debbugs.gnu.org; 4 Mar 2018 23:08:22 +0000 Received: from localhost ([127.0.0.1]:44826 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1escjs-0001YM-9q for submit@debbugs.gnu.org; Sun, 04 Mar 2018 18:08:22 -0500 Received: from mail-pg0-f53.google.com ([74.125.83.53]:44909) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1escjq-0001Y8-FB for 30637@debbugs.gnu.org; Sun, 04 Mar 2018 18:08:19 -0500 Received: by mail-pg0-f53.google.com with SMTP id l4so6092695pgp.11 for <30637@debbugs.gnu.org>; Sun, 04 Mar 2018 15:08:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:references:user-agent:from:to:cc:subject:in-reply-to:date :message-id:mime-version; bh=qTOkvvM3Isxjnr+lA8MqxLBSva3p0yecxMe7nr2T9ig=; b=JduqzKuUqeJj2du4DoavnRDcYInRfBBgjB0c1PpGDbTMCdlnFtnUKRtJiilNWImWFX mzTSUbNQ7uwpa4pr2wc83YsH+AjJOS7bV3Yg6+hNqglofJK1htPogzQVEQN6h+seZyLg QoBRy5gIMROYrG6Tzvrao0jw69rcT+/N5Q8h+iJH/1qe8RysH/O5rPTlpFRaWajpJAYo MOKEkqlNxGMk+HIFKqxku0j8dqDot5rt7N7qQLBo4DsoaTzY3uzuCjS6o2/8wj8p6s0Q 8zTjj2enRjEj/Jwuji4EWsPQr749Ky2lL7vjJhD3pC2JTUzDmycsOeI8b2B7D8jRi6LZ tpJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=qTOkvvM3Isxjnr+lA8MqxLBSva3p0yecxMe7nr2T9ig=; b=Yl6k1lp346pfLOqZzHYGG/tjkutVlcZEt6gxH4sFUCfY8QM26QqT50ZFXE3HxmH4mf 7fHciI+XycAVnd5xLLJtYGzyJ2j1FsjMuFYHiRkFPPVXEcoEhLGvFKOPK1W2nW1AkTCY i5zdt+bGn21SejgIH+J5mogYQUdagGqyyaQAQBoOVN5e4taum2uvxZflzSLIaNoMYkt+ Mep0NZjFqu23HV5d59yDqjqoCipWt+WXpS7trdyMMFmZeqee2cvNBVWkSf48YdZowlvu kDhXo3+dwf1LJKheEF+z/UttR+oEXjGCZ6opXZCkIwhO+AnExMhoVtpNjEzDNbd7UyaZ 7Vdg== X-Gm-Message-State: APf1xPCXGuDEgWYXYc7dlqKj8V8NFm7liLGzYZxk+ILhou9wQPno8os7 39QDHit2elq74Si/UtPqJom2h2ej X-Google-Smtp-Source: AG47ELsW4kFRMMuohQGMdKr5QyE3AE1WFKtSCYxy9Fy1bXyEY0BUM9R9B5zOHyfICrb2wh4wv79yGw== X-Received: by 10.99.110.70 with SMTP id j67mr10590122pgc.202.1520204892374; Sun, 04 Mar 2018 15:08:12 -0800 (PST) Received: from pidgey (110-174-2-124.tpgi.com.au. [110.174.2.124]) by smtp.gmail.com with ESMTPSA id v12sm25644353pfd.141.2018.03.04.15.08.10 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Sun, 04 Mar 2018 15:08:11 -0800 (PST) References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> <87woys9961.fsf@zancanaro.id.au> <87371f4hkf.fsf@gnu.org> <871sgzpiy9.fsf@zancanaro.id.au> <87o9k33199.fsf@gnu.org> User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro In-reply-to: <87o9k33199.fsf@gnu.org> Date: Mon, 05 Mar 2018 10:08:02 +1100 Message-ID: <87y3j7o2wd.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On Sun, Mar 04 2018, Ludovic Court=C3=A8s wrote: > Sorry, I didn=E2=80=99t mean 0 but rather #f (indefinite wait). > > My point is: we shouldn=E2=80=99t wake up every 0.5 seconds for no=20 > reason. IOW, > we should wake up periodically only in the non-pid-1-no-prctl=20 > case. > > Does that make sense? Yep! That makes a lot more sense. Patch attached. Carlo --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-Poll-every-0.5s-to-find-dead-forked-services-if-prct.patch Content-Transfer-Encoding: quoted-printable From=20be442ea64e4fd8e235378a5f04d38296c0af9cf3 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fail= s. * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): If we are unable to use prctl/PR_SET_CHILD_SUBREAPER, then poll f= or service processes between client connections, or every 0.5 seconds. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. * doc/shepherd.texi (Slots of services): Add note about service running slot being a process id. =2D-- doc/shepherd.texi | 4 ++- modules/shepherd.scm | 46 ++++++++++++++++++------- modules/shepherd/service.scm | 82 ++++++++++++++++++++++++++++------------= ---- tests/basic.sh | 4 +-- tests/status-sexp.sh | 4 +-- 5 files changed, 94 insertions(+), 46 deletions(-) diff --git a/doc/shepherd.texi b/doc/shepherd.texi index 815091f..47005d5 100644 =2D-- a/doc/shepherd.texi +++ b/doc/shepherd.texi @@ -608,7 +608,9 @@ way. The default value is @code{#f}, which indicates t= hat the service is not running. When an attempt is made to start the service, it will be set to the return value of the procedure in the @code{start} slot. It will also be passed as an argument to the procedure in the =2D@code{stop} slot. This slot can not be initialized with a keyword. +@code{stop} slot. If it is set a value that is an integer, it is +assumed to be a process id, and shepherd will monitor the process for +unexpected exits. This slot can not be initialized with a keyword. =20 @item @vindex respawn? (slot of ) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index faa1e47..9d94881 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -49,14 +51,26 @@ ;; Main program. (define (main . args) =2D (initialize-cli) + (define poll-services? + (and (not (=3D 1 (getpid))) ;; if we're pid 1 we don't need to do anyt= hing + (catch 'system-error + (lambda () + ;; Register for orphaned processes to be reparented onto us w= hen + ;; their original parent dies. This lets us handle SIGCHLD fr= om + ;; daemon processes that would otherwise have been reparented + ;; under pid 1. Obviously this is unnecessary when we are pid= 1. + (prctl PR_SET_CHILD_SUBREAPER 1) + #f) ;; don't poll + (lambda args + ;; We fall back to polling for services on systems that don't + ;; support prctl/PR_SET_CHILD_SUBREAPER + (let ((errno (system-error-errno args))) + (if (or (=3D ENOSYS errno) ;; prctl not available + (=3D EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not a= vailable + #t ;; poll + (apply throw args))))))) =20 =2D (when (not (=3D 1 (getpid))) =2D ;; Register for orphaned processes to be reparented onto us when the= ir =2D ;; original parent dies. This lets us handle SIGCHLD from daemon pro= cesses =2D ;; that would otherwise have been reparented under pid 1. This is =2D ;; unnecessary when we are pid 1. =2D (catch-system-error (prctl PR_SET_CHILD_SUBREAPER 1))) + (initialize-cli) =20 (let ((config-file #f) (socket-file default-socket-file) @@ -225,11 +239,19 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) (if poll-services? 0.= 5 #f)) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (when poll-services? + (check-for-dead-services)) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index b6394f2..056483a 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1010,38 +1013,44 @@ otherwise by updating its state." ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv)) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + "Respawn a service that has stopped running unexpectedly. If we have +attempted to respawn the service a number of times already and it keeps dy= ing, +then disable it." + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f)))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1180,21 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + "Poll each process that we expect to be running, and respawn any which h= ave +unexpectedly stopped running. This procedure is used as a fallback on syst= ems +where prctl/PR_SET_CHILD_SUBREAPER is unsupported." + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqcfFIACgkQqdyPv9aw IbxffQ//SgthHJ/omxGO/YFzMtxFsg7tfrovt/8+coizpTKmnZKUTfmzh9WXxM1J zZ0efczqW6/0guxFBWZjtHKDkP7LSmXWxNuUq144R7aiubDF+mCBL3KnvaYKqR8E jnJoq1z3XUOKkaTsmo4SJqajkeANCPhX+1Qdb8A0GqDlrBJwElqm1cS1R91b8Joy nHVMc0LmQp7HXM+5x63hHFm3GChgFFN9l49azdmnSDIDGOQXXSRHnLcBEUVqBTVI CokCqXKYJlVs1NXoHXz2d6GBLKLtMw3BPTlbRCtMmQW8imGTeRbrJbNgMa3ao9fJ MJ2wAWLO6cLgoyx8TcZwheari985J/wQ0+s6oZCQPquEiU+2ealJLN14jMRz3Z+2 n4RWBLhW3+viHrZn84UL69+Q+KdK6g8KWICR2qcWUOiVQ7HyINf9V35mXjRj1X6D ARCQFDDsa8mXJQyP7df0yA8rgmBMPYYdvwozNeeo6fIzgRJp0/3qYv6AuA8eTFPL veikvZiI6/DTZNvSE5Z5VDKxBRnZjd3wAPweL0l4Mb5U3KsR/qk6TfQyImF1b0KN Le6QHQ0frj6BVBSRpT31KFeIFpUlO+OhbJUj+oDnLUrAO+zLvEEh+BeaOjchAW/Y FtgOcgPO5PJF1DJvqW8ZBtGYknNFlaveHZJoG9Yg/gSUvD9Q8Ic= =gmH2 -----END PGP SIGNATURE----- --==-=-=-- From unknown Fri Aug 15 15:38:34 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Carlo Zancanaro Subject: bug#30637: closed (Re: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services) Message-ID: References: <87d10ia9sh.fsf@gnu.org> <878tbe9jvx.fsf@zancanaro.id.au> X-Gnu-PR-Message: they-closed 30637 X-Gnu-PR-Package: guix-patches Reply-To: 30637@debbugs.gnu.org Date: Mon, 05 Mar 2018 14:16:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1520259362-570-1" This is a multi-part message in MIME format... ------------=_1520259362-570-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #30637: [WIP] shepherd: Poll every 0.5s to find dead forked services which was filed against the guix-patches package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 30637@debbugs.gnu.org. --=20 30637: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D30637 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1520259362-570-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 30637-done) by debbugs.gnu.org; 5 Mar 2018 14:15:17 +0000 Received: from localhost ([127.0.0.1]:45258 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1esqtZ-00008G-A7 for submit@debbugs.gnu.org; Mon, 05 Mar 2018 09:15:17 -0500 Received: from hera.aquilenet.fr ([185.233.100.1]:38594) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1esqtV-000085-Jk for 30637-done@debbugs.gnu.org; Mon, 05 Mar 2018 09:15:15 -0500 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id 9F628112CA; Mon, 5 Mar 2018 15:15:12 +0100 (CET) X-Virus-Scanned: Debian amavisd-new at aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mti1HnQmq72V; Mon, 5 Mar 2018 15:15:10 +0100 (CET) Received: from ribbon (unknown [193.50.110.134]) by hera.aquilenet.fr (Postfix) with ESMTPSA id 85D97CA23; Mon, 5 Mar 2018 15:15:10 +0100 (CET) From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) To: Carlo Zancanaro Subject: Re: [bug#30637] [WIP] shepherd: Poll every 0.5s to find dead forked services References: <878tbe9jvx.fsf@zancanaro.id.au> <87y3jcu5v5.fsf@gnu.org> <87d10nwhfl.fsf@zancanaro.id.au> <87r2p2izgz.fsf@gnu.org> <87371ihjj2.fsf@zancanaro.id.au> <87po4mhcn2.fsf@gnu.org> <87inadr3np.fsf@zancanaro.id.au> <87h8px9od8.fsf@gnu.org> <87woys9961.fsf@zancanaro.id.au> <87371f4hkf.fsf@gnu.org> <871sgzpiy9.fsf@zancanaro.id.au> <87o9k33199.fsf@gnu.org> <87y3j7o2wd.fsf@zancanaro.id.au> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 15 =?utf-8?Q?Vent=C3=B4se?= an 226 de la =?utf-8?Q?R?= =?utf-8?Q?=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 05 Mar 2018 15:15:10 +0100 In-Reply-To: <87y3j7o2wd.fsf@zancanaro.id.au> (Carlo Zancanaro's message of "Mon, 05 Mar 2018 10:08:02 +1100") Message-ID: <87d10ia9sh.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (gnu/linux) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: 30637-done Cc: 30637-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Carlo Zancanaro skribis: > From be442ea64e4fd8e235378a5f04d38296c0af9cf3 Mon Sep 17 00:00:00 2001 > From: Carlo Zancanaro > Date: Wed, 21 Feb 2018 22:57:59 +1100 > Subject: [PATCH] Poll every 0.5s to find dead forked services if prctl fa= ils. > > * modules/shepherd.scm (open-server-socket): Set socket to be > non-blocking. > (main): If we are unable to use prctl/PR_SET_CHILD_SUBREAPER, then poll= for > service processes between client connections, or every 0.5 seconds. > * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHL= D as > signal handler. > (respawn-service): Separate logic for respawning services from handling > SIGCHLD. > (handle-SIGCHLD, check-for-dead-services): New exported procedures. > * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with > symbols. > * doc/shepherd.texi (Slots of services): Add note about service running s= lot > being a process id. Awesome. Applied with minor cosmetic changes (see below). Thank you! Ludo=E2=80=99. --=-=-= Content-Type: text/x-patch Content-Disposition: inline diff --git a/modules/shepherd.scm b/modules/shepherd.scm index 9d94881..2b4a7b5 100644 --- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -52,7 +52,8 @@ ;; Main program. (define (main . args) (define poll-services? - (and (not (= 1 (getpid))) ;; if we're pid 1 we don't need to do anything + ;; Do we need polling to find out whether services died? + (and (not (= 1 (getpid))) ;if we're pid 1, we don't (catch 'system-error (lambda () ;; Register for orphaned processes to be reparented onto us when @@ -60,14 +61,13 @@ ;; daemon processes that would otherwise have been reparented ;; under pid 1. Obviously this is unnecessary when we are pid 1. (prctl PR_SET_CHILD_SUBREAPER 1) - #f) ;; don't poll + #f) ;don't poll (lambda args ;; We fall back to polling for services on systems that don't - ;; support prctl/PR_SET_CHILD_SUBREAPER + ;; support prctl/PR_SET_CHILD_SUBREAPER. (let ((errno (system-error-errno args))) - (if (or (= ENOSYS errno) ;; prctl not available - (= EINVAL errno)) ;; PR_SET_CHILD_SUBREAPER not available - #t ;; poll + (or (= ENOSYS errno) ;prctl unavailable + (= EINVAL errno) ;PR_SET_CHILD_SUBREAPER unavailable (apply throw args))))))) (initialize-cli) --=-=-=-- ------------=_1520259362-570-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 27 Feb 2018 21:57:11 +0000 Received: from localhost ([127.0.0.1]:36067 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eqnFA-0001eQ-7n for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:57:11 -0500 Received: from eggs.gnu.org ([208.118.235.92]:48619) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eqnF4-0001dt-Bc for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:57:03 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eqnEw-0004a8-Se for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:56:53 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:54148) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eqnEw-0004Zs-NQ for submit@debbugs.gnu.org; Tue, 27 Feb 2018 16:56:50 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:39578) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eqnEu-0006eU-5d for guix-patches@gnu.org; Tue, 27 Feb 2018 16:56:50 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eqnEs-0004Y4-2i for guix-patches@gnu.org; Tue, 27 Feb 2018 16:56:48 -0500 Received: from mail-wr0-x22a.google.com ([2a00:1450:400c:c0c::22a]:37299) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1eqnEr-0004WN-Mm for guix-patches@gnu.org; Tue, 27 Feb 2018 16:56:46 -0500 Received: by mail-wr0-x22a.google.com with SMTP id z12so331776wrg.4 for ; Tue, 27 Feb 2018 13:56:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:user-agent:from:to:subject:date:message-id:mime-version; bh=9py7Z0DpH6wyAuDMs0n7X1ExAtz8dmoBmitkvrBiU2M=; b=A/ZlqJrJG006k/35y9qHW3KEfNICyUoBjPL2lJT2Uwx6vHEyJuzwe8ShRVwNX3/TsT sysB/8o6LV9girkRw4h1ETrEzFPXvGI9h+Gxp3c710QnNLgFwuA1ynB5tk0RPc0m0A5y NDVddaxXogVwLHMoH0O+fwLzK7bqkOYSJ2VhOvZLhni8krKjzjibDg3JWz/U509euCCz yB9ZP2pb5TkwrQIsow4QDTs3UHxcdEsABMZohK4IQZ7zPS5QXgmTNXfT7cybuYNglR4W xnpBsIgtJenNtSzFRRXvvw8o83MxmuaDZHc9dUHmcQxb8oNqxk5ygcY8VGceuHwiBC1j bG0A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:user-agent:from:to:subject:date :message-id:mime-version; bh=9py7Z0DpH6wyAuDMs0n7X1ExAtz8dmoBmitkvrBiU2M=; b=XNoHAz33FpqV2xcyfvBx7knzTNUeYgdBpN+Iq1I3jbUlQdBaOrR0GIfg5a5qMTxu/H EH033wuHZz8eDQLohF2lmGJPpXb+H4F+yHsAS528XYkucN59VF0V3BlQuWhHv7w6vxyL WBFZP/bKy0z4j2qN6p0pJCny5pkDqs5ofI36CfsFX9EDDQsR05EQDb72hLA0SZBZEju5 NqRXHfT5oPJSER4mfn58Uq4mXmw00LcqNKFvHw5UExHMRLQb5tBU5etG/tblT6GPMmhm KU6yvy2pLvleMc+wBlsb2IHjv/Wdr24T6o6NGYoP1G7cK+TZnyjx6u0UrRMpqoFsmHVt sA2w== X-Gm-Message-State: APf1xPA9YPv1qNxpq5DiLztbJv41ypEGrSA5FNs1OAWwQyjpV3zqCgL8 idcYyZ3BBu0aTFKqhAWRF9Z2W66N X-Google-Smtp-Source: AH8x227IJBfYS3+HsrBbvfdSpU+f9EIRCAM71EavblfoROWzdE78qLjmXqODy60oybEt0qCAlbSnlg== X-Received: by 10.223.190.134 with SMTP id i6mr13899355wrh.157.1519768603861; Tue, 27 Feb 2018 13:56:43 -0800 (PST) Received: from pidgey ([2a00:23c0:4e80:d501:8f8:e119:5f6d:87e3]) by smtp.gmail.com with ESMTPSA id p104sm161020wrb.47.2018.02.27.13.56.41 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 27 Feb 2018 13:56:43 -0800 (PST) User-agent: mu4e 1.0; emacs 25.3.1 From: Carlo Zancanaro To: guix-patches@gnu.org Subject: [WIP] shepherd: Poll every 0.5s to find dead forked services Date: Wed, 28 Feb 2018 08:56:34 +1100 Message-ID: <878tbe9jvx.fsf@zancanaro.id.au> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -3.5 (---) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Another patch for shepherd! This one I'm much less happy about, but I'm sending it for feedback. I'll explain the problem it solves, then the (hacky) way that it solves it for now. I'm sure there's a better way to solve this, but it was annoying me enough for me to do this now. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.2 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (carlozancanaro[at]gmail.com) 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid 0.2 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; format=flowed Another patch for shepherd! This one I'm much less happy about, but I'm sending it for feedback. I'll explain the problem it solves, then the (hacky) way that it solves it for now. I'm sure there's a better way to solve this, but it was annoying me enough for me to do this now. The problem is that shepherd, when run as a user process, can "lose" services which fork away. Shepherd can still kill them, but a SIGCHLD won't be delivered if they die, so shepherd can't restart/disable them. My prime example is emacs, which I run with --daemon. If I then kill emacs, shepherd will still think that it is running. To solve this problem, I have modified shepherd to poll each process that it thinks is running. I think this is approximately every 0.5s, but I don't quite understand guile's behaviour when it comes to socket timeouts. This involves a subtle change in the meaning of the running field in objects. My patch treats any `integer?` as a pid, and if a corresponding process is not found it will attempt to restart the service. I considered creating a new "pid" datatype to make it clear when a number represents a pid, but didn't want to go overboard without further feedback. So, thoughts? Can anyone suggest a better way to solve this problem? I'm also confused by my new test. If I run it myself then it passes fine, but in a guix build container there is something different which means that the processes don't get reaped properly, so the test doesn't terminate. I'll keep trying to work it out, but if anyone else has an idea I'd love to hear it. Carlo --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline; filename=0001-shepherd-Poll-every-0.5s-to-find-dead-forked-service.patch Content-Transfer-Encoding: quoted-printable From=20ac87535ebdcc7490d9bbd552c99337917e42a6e8 Mon Sep 17 00:00:00 2001 From: Carlo Zancanaro Date: Wed, 21 Feb 2018 22:57:59 +1100 Subject: [PATCH] shepherd: Poll every 0.5s to find dead forked services * modules/shepherd.scm (open-server-socket): Set socket to be non-blocking. (main): Use select with a timeout, and call check-for-dead-services betwe= en connections/timeouts. * modules/shepherd/service.scm (fork+exec-command): Install handle-SIGCHLD = as signal handler. (respawn-service): Separate logic for respawning services from handling SIGCHLD. (handle-SIGCHLD, check-for-dead-services): New exported procedures. * tests/basic.sh, tests/status-sexp.sh: Replace constant integers with symbols. * tests/forking-service.sh: New file. * Makefile.am: Add tests/forking-service.sh. =2D-- Makefile.am | 3 +- modules/shepherd.scm | 19 ++++++-- modules/shepherd/service.scm | 78 ++++++++++++++++++------------ tests/basic.sh | 4 +- tests/forking-service.sh | 111 +++++++++++++++++++++++++++++++++++++++= ++++ tests/status-sexp.sh | 4 +- 6 files changed, 179 insertions(+), 40 deletions(-) create mode 100644 tests/forking-service.sh diff --git a/Makefile.am b/Makefile.am index 1c394e1..5eb058f 100644 =2D-- a/Makefile.am +++ b/Makefile.am @@ -189,7 +189,8 @@ TESTS =3D \ tests/no-home.sh \ tests/pid-file.sh \ tests/status-sexp.sh \ =2D tests/signals.sh + tests/signals.sh \ + tests/forking-service.sh =20 TEST_EXTENSIONS =3D .sh EXTRA_DIST +=3D $(TESTS) diff --git a/modules/shepherd.scm b/modules/shepherd.scm index 650ba63..cdcd328 100644 =2D-- a/modules/shepherd.scm +++ b/modules/shepherd.scm @@ -42,6 +42,8 @@ (with-fluids ((%default-port-encoding "UTF-8")) (let ((sock (socket PF_UNIX SOCK_STREAM 0)) (address (make-socket-address AF_UNIX file-name))) + (fcntl sock F_SETFL (logior O_NONBLOCK + (fcntl sock F_GETFL))) (bind sock address) (listen sock 10) sock))) @@ -218,11 +220,18 @@ (_ #t)) =20 (let next-command () =2D (match (accept sock) =2D ((command-source . client-address) =2D (setvbuf command-source _IOFBF 1024) =2D (process-connection command-source)) =2D (_ #f)) + (define (read-from sock) + (match (accept sock) + ((command-source . client-address) + (setvbuf command-source _IOFBF 1024) + (process-connection command-source)) + (_ #f))) + (match (select (list sock) (list) (list) 0.5) + (((sock) _ _) + (read-from sock)) + (_ + #f)) + (check-for-dead-services) (next-command)))))) =20 (define (process-connection sock) diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index 83600e4..22972c5 100644 =2D-- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -3,6 +3,7 @@ ;; Copyright (C) 2002, 2003 Wolfgang J=C3=A4rling ;; Copyright (C) 2014 Alex Sassmannshausen ;; Copyright (C) 2016 Alex Kost +;; Copyright (C) 2018 Carlo Zancanaro ;; ;; This file is part of the GNU Shepherd. ;; @@ -64,6 +65,7 @@ for-each-service lookup-services respawn-service + handle-SIGCHLD register-services provided-by required-by @@ -77,6 +79,7 @@ make-system-destructor make-init.d-service =20 + check-for-dead-services root-service make-actions =20 @@ -800,7 +803,7 @@ false." its PID." ;; Install the SIGCHLD handler if this is the first fork+exec-command ca= ll (unless %sigchld-handler-installed? =2D (sigaction SIGCHLD respawn-service SA_NOCLDSTOP) + (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) (let ((pid (primitive-fork))) (if (zero? pid) @@ -991,7 +994,7 @@ child left." what (strerror errno)) '(0 . #f))))))) =20 =2D(define (respawn-service signum) +(define (handle-SIGCHLD signum) "Handle SIGCHLD, possibly by respawning the service that just died, or otherwise by updating its state." (let loop () @@ -1009,39 +1012,42 @@ otherwise by updating its state." =20 ;; SERV can be #f for instance when this code runs just after a ;; service's 'stop' method killed its process and completed. =2D (when serv =2D (slot-set! serv 'running #f) =2D (if (and (respawn? serv) =2D (not (respawn-limit-hit? (slot-ref serv 'last-respaw= ns) =2D (car respawn-limit) =2D (cdr respawn-limit)))) =2D (if (not (slot-ref serv 'waiting-for-termination?)) =2D (begin =2D ;; Everything is okay, start it. =2D (local-output "Respawning ~a." =2D (canonical-name serv)) =2D (slot-set! serv 'last-respawns =2D (cons (current-time) =2D (slot-ref serv 'last-respawns))) =2D (start serv)) =2D ;; We have just been waiting for the =2D ;; termination. The `running' slot has already =2D ;; been set to `#f' by `stop'. =2D (begin =2D (local-output "Service ~a terminated." =2D (canonical-name serv)) =2D (slot-set! serv 'waiting-for-termination? #f))) =2D (begin =2D (local-output "Service ~a has been disabled." =2D (canonical-name serv)) =2D (when (respawn? serv) =2D (local-output " (Respawning too fast.)")) =2D (slot-set! serv 'enabled? #f)))) + (respawn-service serv) =20 ;; As noted in libc's manual (info "(libc) Process Completion"), ;; loop so we don't miss any terminated child process. (loop)))))) =20 +(define (respawn-service serv) + (when serv + (slot-set! serv 'running #f) + (if (and (respawn? serv) + (not (respawn-limit-hit? (slot-ref serv 'last-respawns) + (car respawn-limit) + (cdr respawn-limit)))) + (if (not (slot-ref serv 'waiting-for-termination?)) + (begin + ;; Everything is okay, start it. + (local-output "Respawning ~a." + (canonical-name serv)) + (slot-set! serv 'last-respawns + (cons (current-time) + (slot-ref serv 'last-respawns))) + (start serv)) + ;; We have just been waiting for the + ;; termination. The `running' slot has already + ;; been set to `#f' by `stop'. + (begin + (local-output "Service ~a terminated." + (canonical-name serv)) + (slot-set! serv 'waiting-for-termination? #f))) + (begin + (local-output "Service ~a has been disabled." + (canonical-name serv)) + (when (respawn? serv) + (local-output " (Respawning too fast.)")) + (slot-set! serv 'enabled? #f))))) + ;; Add NEW-SERVICES to the list of known services. (define (register-services . new-services) (define (register-single-service new) @@ -1171,6 +1177,18 @@ file when persistence is enabled." (lambda (p) (format p "~{~a ~}~%" running-services)))))) =20 +(define (check-for-dead-services) + (define (process-exists? pid) + (catch #t + (lambda () (kill pid 0) #t) + (lambda _ #f))) + (for-each-service (lambda (service) + (let ((running (slot-ref service 'running))) + (when (and (integer? running) + (not (process-exists? running))) + (local-output "PID ~a (~a) is dead!" running (= canonical-name service)) + (respawn-service service)))))) + (define root-service (make #:docstring "The root service is used to operate on shepherd itself." diff --git a/tests/basic.sh b/tests/basic.sh index 1ddb334..2ecd8fb 100644 =2D-- a/tests/basic.sh +++ b/tests/basic.sh @@ -150,7 +150,7 @@ cat > "$confdir/some-conf.scm" < #:provides '(test-loaded) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f))) EOF =20 @@ -166,7 +166,7 @@ $herd status test-loaded $herd status test-loaded | grep stopped =20 $herd start test-loaded =2D$herd status test-loaded | grep -i 'running.*42' +$herd status test-loaded | grep -i 'running.*abc' $herd stop test-loaded $herd unload root test-loaded =20 diff --git a/tests/forking-service.sh b/tests/forking-service.sh new file mode 100644 index 0000000..90c684a =2D-- /dev/null +++ b/tests/forking-service.sh @@ -0,0 +1,111 @@ +# GNU Shepherd --- Test detecting a forked process' termination +# Copyright =C2=A9 2016 Ludovic Court=C3=A8s +# Copyright =C2=A9 2018 Carlo Zancanaro +# +# This file is part of the GNU Shepherd. +# +# The GNU Shepherd is free software; you can redistribute it and/or modify= it +# under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3 of the License, or (at +# your option) any later version. +# +# The GNU Shepherd is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with the GNU Shepherd. If not, see . + +shepherd --version +herd --version + +socket=3D"t-socket-$$" +conf=3D"t-conf-$$" +log=3D"t-log-$$" +pid=3D"t-pid-$$" +service_pid=3D"t-service-pid-$$" +service2_pid=3D"t-service2-pid-$$" +service2_started=3D"t-service2-starts-$$" + +herd=3D"herd -s $socket" + +function cleanup() { + cat $log || true + rm -f $socket $conf $log $service2_started + test -f $pid && kill "$(cat $pid)" || true + rm -f $pid + test -f $service_pid && kill "$(cat $service_pid)" || true + rm -f $service_pid + test -f $service2_pid && kill "$(cat $service2_pid)" || true + rm -f $service2_pid +} + +trap cleanup EXIT + +cat > "$conf"< $PWD/$service_pid")) + +(register-services + (make + ;; A service that forks into a different process. + #:provides '(test) + #:start (make-forkexec-constructor %command + #:pid-file "$PWD/$service_pid") + #:stop (make-kill-destructor) + #:respawn? #f)) + +(define %command2 + '("$SHELL" "-c" "echo started >> $PWD/$service2_started; sleep 600 & ech= o \$! > $PWD/$service2_pid")) + +(register-services + (make + ;; A service that forks into a different process. + #:provides '(test2) + #:start (make-forkexec-constructor %command2 + #:pid-file "$PWD/$service2_pid") + #:stop (make-kill-destructor) + #:respawn? #t)) +EOF +cat $conf + +rm -f "$pid" +shepherd -I -s "$socket" -c "$conf" -l "$log" --pid=3D"$pid" & + +# Wait till it's ready. +while ! test -f "$pid" ; do sleep 0.3 ; done + +shepherd_pid=3D"$(cat $pid)" + +# start both of the services +$herd start test +$herd start test2 + +# make sure "test" is started +until $herd status test | grep started; do sleep 0.3; done +test -f "$service_pid" +service_pid_value=3D"$(cat $service_pid)" +# now kill it +kill "$service_pid_value" +while kill -0 "$service_pid_value"; do sleep 0.3; done +# shepherd should notice that the service has stopped within one second +sleep 1 +$herd status test | grep stopped + + + +# make sure "test2" has started +until $herd status test2 | grep started; do sleep 0.3; done +test -f "$service2_pid" +service2_pid_value=3D"$(cat $service2_pid)" +test "$(cat $PWD/$service2_started)" =3D "started" +# now kill it +rm -f "$service2_pid" +kill $service2_pid_value +while kill -0 "$service2_pid_value"; do sleep 0.3; done +# shepherd should notice that the service has stopped, and restart it, wit= hin one second +sleep 1; +$herd status test2 | grep started +test "$(cat $PWD/$service2_started)" =3D "started +started" diff --git a/tests/status-sexp.sh b/tests/status-sexp.sh index b7c8cb4..11b967e 100644 =2D-- a/tests/status-sexp.sh +++ b/tests/status-sexp.sh @@ -33,7 +33,7 @@ cat > "$conf"< #:provides '(foo) =2D #:start (const 42) + #:start (const 'abc) #:stop (const #f) #:docstring "Foo!" #:respawn? #t) @@ -85,7 +85,7 @@ root_service_sexp=3D" (service (version 0) (provides (foo)) (requires ()) (respawn? #t) (docstring \"Foo!\") =2D (enabled? #t) (running 42) (conflicts ()) + (enabled? #t) (running abc) (conflicts ()) (last-respawns ())) (service (version 0) (provides (bar)) (requires (foo)) =2D-=20 2.16.1 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEE1lpncq7JnOkt+LaeqdyPv9awIbwFAlqV1BIACgkQqdyPv9aw Ibz/Nw/5AdjbhliDDcHtRm/WW9Mf/cMZdbNQU2Dl2+/Ig/yQ4KgtZcIEtGxQV49E 0TYIXnbD0lJ7KWR/+4bttlzBcXe9xZSNObMC5OcM68Kvuikmrp9DsjjttNJ6kLcx 056Meze8bCoY9V9C+3MyESXhkipVA6AdvqSfR+z+0bsbl2xaebZyufMdbooEjTAg Ift3U8A073NyGKnMtEn5JAcVbyn8djFx37tZkBbB+nYRGhzy5sodsMRb0EOREo/8 Wxfrpg8fIxzNE+vppEBPbk5E2F823j+cIVgQmztiT3Ia1epYlKR6ooj6sIzoFI2T U9LT43swEFcsk0VazvjBx1ADGjtdT3cNhVXfdnCrHvBOhHsDg9mA9rJ4/ILpEosP Wbc2W3Yym8N5DKc3yqDgSsJ1enwUhW6XP/RN6p9kdxOBbykhStp0s1xsILmXNzsW 3fDi9h14WVhq+ZbyWbaoHRQZ/hNKBsOxoOAjimI5/NjfGzIMWJWUcwyTSe1HN/MK v3a4Uo04UEMva/CX+B6+1oi6O7X068elROfPmn1BhZkhRrUV9eDugaGqgasGRpgV QEWc5rLidIwuSnXWeJqfMbi8wldcrWh5joqIchW0dKC8kuE4t2iXL8RYzSkEepkA 1Lde+ema8r1VA971yE4xNuftH38AGBhBtWdQ1PseMXaGS7uoCBk= =yPdk -----END PGP SIGNATURE----- --==-=-=-- ------------=_1520259362-570-1--