From unknown Thu Aug 14 17:27:05 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#76790 <76790@debbugs.gnu.org> To: bug#76790 <76790@debbugs.gnu.org> Subject: Status: [Shepherd] Handling process termination before service is running Reply-To: bug#76790 <76790@debbugs.gnu.org> Date: Fri, 15 Aug 2025 00:27:05 +0000 retitle 76790 [Shepherd] Handling process termination before service is run= ning reassign 76790 guix submitter 76790 Ludovic Court=C3=A8s severity 76790 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Thu Mar 06 15:46:26 2025 Received: (at submit) by debbugs.gnu.org; 6 Mar 2025 20:46:26 +0000 Received: from localhost ([127.0.0.1]:45617 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tqI6r-0007kH-C9 for submit@debbugs.gnu.org; Thu, 06 Mar 2025 15:46:25 -0500 Received: from lists.gnu.org ([2001:470:142::17]:36978) by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1tqI6p-0007jz-F4 for submit@debbugs.gnu.org; Thu, 06 Mar 2025 15:46:23 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tqI6b-00025y-Iu for bug-guix@gnu.org; Thu, 06 Mar 2025 15:46:15 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tqI6b-0005WG-Am for bug-guix@gnu.org; Thu, 06 Mar 2025 15:46:09 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:Subject:To:From:in-reply-to: references; bh=WR1Gm38CmbXjWNn2DjqcpqxZd2AfZSPYjosnOin9dL0=; b=pH+XfgisZCKBMi S3/9YuIzfDHQOqarCZV85GHxyAlyDHOziTOCPFC4oL1NaNckM1mr6DOkNsKqNXZ0a8jxglVojpluy xcukcMMh2vmR4K3jsJqaiBYQWgeY+BtNlFluk/6Ip86rFmBE5DeLMkOTN4OAHGneh7gGYnoAsRSW5 R7dZJD2wY2SZDWa+uZdbMWStR+itfn9IlysJ+q94Aj1SNj/WrnT8A04Xn+W+HXk04gqdhte5gaeQJ ymKblfRdQs7fIDfIxa1yjEsRO82JSf+TNrs4N48v67CdoshlCBNvh0U1zT3/ZUARSvg+Le/F0YA9v Qq3Zn+6j7ZLU+UQ8QzDw==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: bug-guix@gnu.org Subject: [Shepherd] Handling process termination before service is running X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: Sextidi 16 =?utf-8?Q?Vent=C3=B4se?= an 233 de la =?utf-8?Q?R=C3=A9volution=2C?= jour de =?utf-8?Q?l'=C3=89pinard?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Thu, 06 Mar 2025 21:45:56 +0100 Message-ID: <87ikolx4zf.fsf@inria.fr> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable While on a quest for flaky tests in the Shepherd, I found a genuine bug that would manifest with this =E2=80=98tests/basic.sh=E2=80=99 failure: --8<---------------cut here---------------start------------->8--- + herd -s t-socket-21679 status test-run-from-nonexistent-directory + sleep 0.5 + herd -s t-socket-21679 status test-run-from-nonexistent-directory + grep 'exited with code 127' + sleep 0.5 + herd -s t-socket-21679 status test-run-from-nonexistent-directory + grep 'exited with code 127' [=E2=80=A6] 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory started. 2025-03-06 14:06:36 Failed to run "/gnu/store/3bg5qfsmjw6p7bh1xadarbaq246zi= s0d-coreutils-9.1/bin/pwd": In procedure chdir: No such file or directory 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory running wit= h value #< id: 22431 command: ("/gnu/store/3bg5qfsmjw6p7bh1xadarba= q246zis0d-coreutils-9.1/bin/pwd")>. 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been st= arted. 2025-03-06 14:06:36 Service test-run-from-nonexistent-directory has been di= sabled. 2025-03-06 14:11:51 Stopping service root... --8<---------------cut here---------------end--------------->8--- What happens is that the service is not marked as =E2=80=9Cexited with code 127=E2=80=9D; instead, it is marked as having exited with code 0: --8<---------------cut here---------------start------------->8--- =E2=97=8F Status of test-run-from-nonexistent-directory: It is stopped since 14:06:36 (37 seconds ago). Process exited successfully. It is disabled. Provides: test-run-from-nonexistent-directory Will not be respawned. --8<---------------cut here---------------end--------------->8--- This is due to a race condition: the process terminates before its service goes from =E2=80=98starting=E2=80=99 to =E2=80=98running=E2=80=99. By the time the service controller calls =E2=80=98monitor-service-process= =E2=80=99, the process has already terminated, so the process monitor replies 0 to the 'await request because that process no longer exists. Attached is a test that reproduces the problem. Ludo=E2=80=99. --=-=-= Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename=terminate-before-running.sh Content-Transfer-Encoding: quoted-printable # GNU Shepherd --- Handling termination of a process before 'start' complet= es. # Copyright =C2=A9 2025 Ludovic Court=C3=A8s # # This file is part of the GNU Shepherd. # # The GNU Shepherd is free software; you can redistribute it and/or modify = it # under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or (at # your option) any later version. # # The GNU Shepherd is distributed in the hope that it will be useful, but # WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with the GNU Shepherd. If not, see . shepherd --version herd --version socket=3D"t-socket-$$" conf=3D"t-conf-$$" log=3D"t-log-$$" pid=3D"t-pid-$$" herd=3D"herd -s $socket" trap "cat $log || true; rm -f $socket $conf $log; test -f $pid && kill \`cat $pid\` || true; rm -f $pid" EXIT cat > "$conf" <) id 1trJJ9-0002sw-Em for submit@debbugs.gnu.org; Sun, 09 Mar 2025 12:15:19 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:58762) by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1trJJ5-0002sC-Dj for 76790@debbugs.gnu.org; Sun, 09 Mar 2025 12:15:16 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1trJJ0-0007JJ-2Z for 76790@debbugs.gnu.org; Sun, 09 Mar 2025 12:15:10 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=J6BfpEgVncv547gwcA0AIQsr/komiffQTPyEoilKYDY=; b=nd2b8ievPA6yzQXQgPzc j+iwOvBd1e2tx7nyDQgrw8eX2UzyCQJSzFxOVvY0gHOIkGCwDwHtgNq+Eca2IbrhlJjmk4N7Gklaw asmC03fAbyAPHRwv3vdAyX4cQpqzXhLuetm6nJrxDg2/Dk6lQ77yHr/maXuPGB8wZ9/vu4gCl6pOl 8TSlf74liRvvJr4VxUA26qg0WLjcwy6RQYth3Qs2w6hxMm6YhOnV+KRPxNlWyxMY9Grx4mSAikpzZ ULqfMvRRISMzpj1GXwsYI1FI//oJXnNsmrUcwWFCN4PZYbZ6BsdYYYbqY0Aibqn9gfdMbMYGxXlyM 1ZF9PC0Nu68RWQ==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 76790@debbugs.gnu.org Subject: Re: bug#76790: [Shepherd] Handling process termination before service is running In-Reply-To: <87ikolx4zf.fsf@inria.fr> ("Ludovic =?utf-8?Q?Court=C3=A8s=22?= =?utf-8?Q?'s?= message of "Thu, 06 Mar 2025 21:45:56 +0100") References: <87ikolx4zf.fsf@inria.fr> Date: Sun, 09 Mar 2025 17:15:05 +0100 Message-ID: <87ldteqiye.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 76790 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Ludovic Court=C3=A8s skribis: > This is due to a race condition: the process terminates before its > service goes from =E2=80=98starting=E2=80=99 to =E2=80=98running=E2=80=99. > > By the time the service controller calls =E2=80=98monitor-service-process= =E2=80=99, the > process has already terminated, so the process monitor replies 0 to the > 'await request because that process no longer exists. Fixed in 88cc5a43bf04c13b00b15a8d93cb635d9b64713c by introducing an atomic fork + monitor primitive. For the record, I also considered a hacky but less intrusive solution: adding support for =E2=80=9Czombies=E2=80=9D in the process monitor, whereb= y it would record the status of terminated processes and give that (instead of 0) when somebody awaits them (patch attached). It=E2=80=99s fragile though because unlike real zombies, we cannot guarantee that the PID won=E2=80=99t be reused in the meantime; it=E2=80=99s just unl= ikely. (PID FDs would help, but that=E2=80=99s not portable so best if we can avoid it.) Ludo=E2=80=99. --=-=-= Content-Type: text/x-patch Content-Disposition: inline diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm index 221998b..fec773c 100644 --- a/modules/shepherd/service.scm +++ b/modules/shepherd/service.scm @@ -2345,25 +2345,35 @@ may never terminate, even after sending it SIGKILL---e.g., kthreadd on Linux." (const #f))) (const #f))) +(define %max-zombie-processes + ;; Number of processes recorded as "zombies" by the process + ;; monitor--processes that had no waiters when they terminated. + 20) + (define (process-monitor channel) "Run a process monitor that handles requests received over @var{channel}." - (let loop ((waiters vlist-null)) + (let loop ((waiters vlist-null) + (zombies (ring-buffer %max-zombie-processes))) (match (get-message channel) (('handle-process-termination pid status) - ;; Notify any waiters. - (vhash-foldv* (lambda (waiter _) - (put-message waiter status) - #t) - #t pid waiters) - - ;; XXX: The call below is linear in the size of WAITERS, but WAITERS is - ;; usually empty or small. - (loop (vhash-fold (lambda (key value result) - (if (= key pid) - result - (vhash-consv key value result))) - vlist-null - waiters))) + ;; Notify any waiters. When there are none, record PID and STATUS in + ;; the ZOMBIES buffer in case somebody asks for it later-- + (let ((zombie? (vhash-foldv* (lambda (waiter _) + (put-message waiter status) + #f) + #t pid waiters))) + + ;; XXX: The call below is linear in the size of WAITERS, but WAITERS + ;; is usually empty or small. + (loop (vhash-fold (lambda (key value result) + (if (= key pid) + result + (vhash-consv key value result))) + vlist-null + waiters) + (if zombie? + (ring-buffer-insert (cons pid status) zombies) + zombies)))) (('spawn arguments service reply) ;; Spawn the command as specified by ARGUMENTS; send the spawn result @@ -2378,22 +2388,31 @@ may never terminate, even after sending it SIGKILL---e.g., kthreadd on Linux." (put-message reply result) (match result (('exception . _) - (loop waiters)) + (loop waiters zombies)) (('success (pid)) - (loop (vhash-consv pid reply waiters)))))) + (loop (vhash-consv pid reply waiters) zombies))))) (('await pid reply) ;; Await the termination of PID and send its status on REPLY. (if (and (catch-system-error (kill pid 0)) (not (pseudo-process? pid))) - (loop (vhash-consv pid reply waiters)) - (begin ;PID is gone or a pseudo-process - ;; This might be a race condition (the caller thinks PID is up - ;; when it's already dead) so log it. - (local-output (l10n "Awaiting PID ~a, which is already gone.") - pid) - (put-message reply 0) - (loop waiters))))))) + (loop (vhash-consv pid reply waiters) zombies) + ;; PID is gone or a pseudo-process; in the former case, check if we + ;; have info about it in ZOMBIES (this is linear in the length of + ;; ZOMBIES, which is small.) + (let* ((zombies* (ring-buffer->list zombies)) + (status (assoc-ref zombies* pid))) + (unless status + ;; This might be a race condition (the caller thinks PID is up + ;; when it's already dead) so log it. + (local-output (l10n "Awaiting PID ~a, which is already gone.") + pid)) + (put-message reply (or status 0)) + (loop waiters + (if status + (list->ring-buffer (alist-delete pid zombies*) + %max-zombie-processes) + zombies)))))))) (define spawn-process-monitor (essential-task-launcher 'process-monitor process-monitor)) --=-=-=-- From debbugs-submit-bounces@debbugs.gnu.org Sun Mar 09 12:15:52 2025 Received: (at control) by debbugs.gnu.org; 9 Mar 2025 16:15:52 +0000 Received: from localhost ([127.0.0.1]:33966 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1trJJf-0002u1-VA for submit@debbugs.gnu.org; Sun, 09 Mar 2025 12:15:52 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:42820) by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1trJJc-0002th-2P for control@debbugs.gnu.org; Sun, 09 Mar 2025 12:15:48 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1trJJW-0007LI-PV for control@debbugs.gnu.org; Sun, 09 Mar 2025 12:15:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:Subject:From:To:Date:in-reply-to: references; bh=0cj8DVnzT5yAGcIWa2iTW8WLPRKSVSY61oDbGJ0SQCc=; b=PclfqzVDwxnM6m VzOvgf52pgLoFKmiwwnjYaWEPqzaIPEIKMvFjbMnqfz51HsVR9EBJKVFned53AA5bqzIob9s+oOrZ HC++EvuAOt/sPNI+2pX+VUwoUrQ8xyFOANJxwh7p/ynNH9UxN0PWGOrF27aeCbwCvcK/xUwuQKPyQ nU3fD7ilF4/16lFc9c93mzNanWRMGtO9C15GwW7RvDIR9P7rOpM31CNfCHMS2dQa8gGWsyb1EDlDN w6k6ehgldwNMeqN+Yk6E9S70F8jIB+MdMCyGHkXYe/CA4YnTOKBFOss0M1mEFybpZ3cXy+++jXUex XHxnrbC8QkPVx9Cujn7A==; Date: Sun, 09 Mar 2025 17:15:15 +0100 Message-Id: <87jz8yqiy4.fsf@gnu.org> To: control@debbugs.gnu.org From: =?utf-8?Q?Ludovic_Court=C3=A8s?= Subject: control message for bug #76790 MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) close 76790 quit From unknown Thu Aug 14 17:27:05 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 07 Apr 2025 11:24:06 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator