From debbugs-submit-bounces@debbugs.gnu.org Mon Sep 19 00:29:52 2022 Received: (at submit) by debbugs.gnu.org; 19 Sep 2022 04:29:52 +0000 Received: from localhost ([127.0.0.1]:51779 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oa8Po-00083p-4j for submit@debbugs.gnu.org; Mon, 19 Sep 2022 00:29:52 -0400 Received: from lists.gnu.org ([209.51.188.17]:45858) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oa8Pl-00083h-FB for submit@debbugs.gnu.org; Mon, 19 Sep 2022 00:29:50 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:45478) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oa8Pl-0006b5-9x for bug-guix@gnu.org; Mon, 19 Sep 2022 00:29:49 -0400 Received: from mail-qv1-xf30.google.com ([2607:f8b0:4864:20::f30]:41614) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oa8Pj-0006u9-PM for bug-guix@gnu.org; Mon, 19 Sep 2022 00:29:49 -0400 Received: by mail-qv1-xf30.google.com with SMTP id l14so12634845qvq.8 for ; Sun, 18 Sep 2022 21:29:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:message-id:date:subject:to:from:from:to:cc:subject :date; bh=A6w1ujOl8PyJXJquYATugAQsz2k2jgChb30YmyLuPco=; b=lBRIrgfzrC9RSjYFs8o9OsLRrn9m+h3tMa0pUtNNP42bopBCXFfZJeUcNAk0e2i/JT MaXk5Yn3/JfWFXiziKEhHbDedtikdOWfxapQeXPsYFSUQrPh22S3GGwY4TrDTvbN+Q92 QPc6rTkaVQg+2wjzkYpIXom87rAUwQVyd04mAjdCNLa7ectiAY7UW45lqIFxBIsYFs+g fQHHEPgREUHrWktO9jR2hoyX5ok4kZHXYLztpZv7RrGt4BdytF8uxr76T1QUGAavSb3c Qiqu6Gxjm7gdyrgq8ccpDaB/oEwWZgwUeky/Q3J9MFoFb4ivHM3DvaXV3cQLYaA4eI43 5TUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:message-id:date:subject:to:from:x-gm-message-state :from:to:cc:subject:date; bh=A6w1ujOl8PyJXJquYATugAQsz2k2jgChb30YmyLuPco=; b=5nTHqU9Q8fnU4UXvqb4OtgR/VdY8TUhTamLEXwKpo2X0WjJok/feZOZY+/d9tcEUPr 6TQiwL8/EVVYJNd0UteuIzolK+Oj45oEt+FVvolxMLUfkSDi0b4ogHZKlpz3Zpz+vvdh fPd7K5HCb726ELBzHFmMWBUnQVEBk/TUhSFVeSms3exCFPIhPtZit5wDFjvsV4GDbmpu iRpHpP+wTDn3iRUrtdEYUBJlCpWcYPgtcU/W7yTU4/8Tg+37qd7w4fdFEq5vysUbuIzf 3npWe5yjN+H8MWACjOtkBYPc+2TmB9WeqFDHxojY3pUcOVlRehWAP13nY9RcsJcuZhmJ Uq9g== X-Gm-Message-State: ACrzQf3fkTGA6ZVNC5hmAaK6Iyfz+xlzfwCvGjlUrQY5m8AIErbTFzsf taVpwnmKaciZ3GZtvwTWWmSs8uypIGo= X-Google-Smtp-Source: AMsMyM5m0SNaiKUywp0fj/sxPxq2OCYh+S4b8Cc8tCfPIKwJID+teukhNVLCm/BHYVqLMquWlmWO6g== X-Received: by 2002:a05:6214:2aa4:b0:4ac:8848:b251 with SMTP id js4-20020a0562142aa400b004ac8848b251mr12932469qvb.55.1663561785980; Sun, 18 Sep 2022 21:29:45 -0700 (PDT) Received: from hurd (dsl-148-8.b2b2c.ca. [66.158.148.8]) by smtp.gmail.com with ESMTPSA id m5-20020a05620a24c500b006bb366779a4sm12943248qkn.6.2022.09.18.21.29.45 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 18 Sep 2022 21:29:45 -0700 (PDT) From: Maxim Cournoyer To: bug-guix Subject: Shepherd doesn't seem to correctly handle waitpid itself Date: Mon, 19 Sep 2022 00:29:44 -0400 Message-ID: <874jx4q953.fsf@gmail.com> MIME-Version: 1.0 Content-Type: text/plain Received-SPF: pass client-ip=2607:f8b0:4864:20::f30; envelope-from=maxim.cournoyer@gmail.com; helo=mail-qv1-xf30.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hi, I've tried to determine why a workaround in the jami-service-type is required in the 'stop' slot to avoid failures in 'herd restart jami', and haven't quite found the culprit, but it appears to me that: 1. waipid is only called in one place in Shepherd, which is in the handle-SIGCHLD procedure in (shepherd service), which does not specifically wait for an exact PID but rather does: (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling in the case a system-error exception is thrown with an ECHILD or EINTR error number. This doesn't strike me as a strong guarantee that waitpid occurs when stop is called, because: 1. It requires to be installed in the signal handlers for each processes, with something like: --8<---------------cut here---------------start------------->8--- (unless %sigchld-handler-installed? (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) (set! %sigchld-handler-installed? #t)) --8<---------------cut here---------------end--------------->8--- Done for fork+exec-command and make-inetd-forkexec-constructor, but not for make-forkexec-constructor/container, AFAICT; 2. it has the WNOHANG flag, which means the stop simply does a kill the the signal handling weakly (because of WNOHANG) waits on it, which means the start may begin before the process was actually completely terminated. Here's a small reproducer to apply on our code base: --8<---------------cut here---------------start------------->8--- modified gnu/services/telephony.scm @@ -685,13 +685,7 @@ (define (archive-name->username archive) ;; Finally, return the PID of the daemon process. daemon-pid)) - (stop - #~(lambda (pid . args) - (kill pid SIGKILL) - ;; Wait for the process to exit; this prevents overlapping - ;; processes when issuing 'herd restart'. - (waitpid pid) - #f)))))))) + (stop #~(make-kill-destructor)))))))) (define jami-service-type (service-type --8<---------------cut here---------------end--------------->8--- Then run 'make check-system TESTS=jami-provisioning' to see new failures, or if you want to investigate manually the system: --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix system vm --no-grafts --no-offload --no-graphic \ -e '(@@ (gnu tests telephony) %jami-os-provisioning)' $ /gnu/store/rxi7c14hga62qslb0sr6nac9qnkxr0nn-run-vm.sh -m 1G -smp 4 \ -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22 # Connect to the QEMU VM: $ ssh root@localhost -p10022 root@jami ~# herd restart jami Service jami has been stopped. herd: exception caught while executing 'start' on service 'jami': dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying") root@jami ~# herd status jami Status of jami: It is stopped. It is enabled. Provides (jami). Requires (jami-dbus-session). Conflicts with (). Will be respawned. root@jami ~# pgrep jami --8<---------------cut here---------------end--------------->8--- Thanks, Maxim From debbugs-submit-bounces@debbugs.gnu.org Tue Sep 20 03:32:02 2022 Received: (at 57922) by debbugs.gnu.org; 20 Sep 2022 07:32:02 +0000 Received: from localhost ([127.0.0.1]:55967 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oaXje-0000EY-0y for submit@debbugs.gnu.org; Tue, 20 Sep 2022 03:32:02 -0400 Received: from jpoiret.xyz ([206.189.101.64]:54904) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oaXjb-0000EL-Mb for 57922@debbugs.gnu.org; Tue, 20 Sep 2022 03:32:00 -0400 Received: from authenticated-user (jpoiret.xyz [206.189.101.64]) by jpoiret.xyz (Postfix) with ESMTPA id 225D6184F2B; Tue, 20 Sep 2022 07:31:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpoiret.xyz; s=dkim; t=1663659118; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=mT1YjZpcqepo2dF2FR0yNfjgn5RDzMuu0yfYCsoJb48=; b=aRMkyh37qQdSRt8ZUj5JuMGFuaF/+XFfUW98xBKHCcmFasFbwTJxrQaBS9KF5Yver7X4um 9WSK+DxWgXTrKrCqGlUaXaNswGgv+DFKNgRkRHdYQHwXjIgnmLdg/bEFzx09yQzRL6wwMM sb1kYwvNPNMFn3gM7J/3qx+eFoGuqYo8etgzWSJEMUmzkrfBAZOTH7OtSQyPJhJ06d4Wdx QLBNiIoUCaRl4+9XX1MdMTJSCyY5bK6NXlqg3skXMfOXRK153KrMlmIkm7GMWsHPdGfP3H XMuAEpo6+h0gjw9yeQ74espP1QvzwLBkVVItioX4uD/649IRFF7wQej7CYKxeg== From: Josselin Poiret To: Maxim Cournoyer , 57922@debbugs.gnu.org Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself In-Reply-To: <874jx4q953.fsf@gmail.com> References: <874jx4q953.fsf@gmail.com> Date: Tue, 20 Sep 2022 09:31:57 +0200 Message-ID: <87o7va33iq.fsf@jpoiret.xyz> MIME-Version: 1.0 Content-Type: text/plain Authentication-Results: jpoiret.xyz; auth=pass smtp.auth=jpoiret@jpoiret.xyz smtp.mailfrom=dev@jpoiret.xyz X-Spamd-Bar: / X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi Maxim, Maxim Cournoyer writes: > Hi, > > I've tried to determine why a workaround in the jami-service-type is > required in the 'stop' slot to avoid failures in 'herd restart jami', > and haven't quite found the culprit, but it app [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 0.0 FROM_SUSPICIOUS_NTLD From abused NTLD X-Debbugs-Envelope-To: 57922 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi Maxim, Maxim Cournoyer writes: > Hi, > > I've tried to determine why a workaround in the jami-service-type is > required in the 'stop' slot to avoid failures in 'herd restart jami', > and haven't quite found the culprit, but it app [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 1.0 BULK_RE_SUSP_NTLD Precedence bulk and RE: from a suspicious TLD 0.0 FROM_SUSPICIOUS_NTLD From abused NTLD -1.0 MAILING_LIST_MULTI Multiple indicators imply a widely-seen list manager Hi Maxim, Maxim Cournoyer writes: > Hi, > > I've tried to determine why a workaround in the jami-service-type is > required in the 'stop' slot to avoid failures in 'herd restart jami', > and haven't quite found the culprit, but it appears to me that: > > 1. waipid is only called in one place in Shepherd, which is in the > handle-SIGCHLD procedure in (shepherd service), which does not > specifically wait for an exact PID but rather does: > > (waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling > in the case a system-error exception is thrown with an ECHILD or EINTR > error number. > > This doesn't strike me as a strong guarantee that waitpid occurs when > stop is called, because: > > 1. It requires to be installed in the signal handlers for each > processes, with something like: > > --8<---------------cut here---------------start------------->8--- > (unless %sigchld-handler-installed? > (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) > (set! %sigchld-handler-installed? #t)) > --8<---------------cut here---------------end--------------->8--- > > Done for fork+exec-command and make-inetd-forkexec-constructor, but not > for make-forkexec-constructor/container, AFAICT; The signal handler is only installed once in PID 1 (in fact, you haven't forked yet here), since it's the one that receives the SIGCHLD. What I don't understand that well is that this signal handler could be installed only once when shepherd starts, right? That way, it wouldn't need to depend on specific start actions being chosen. > 2. it has the WNOHANG flag, which means the stop simply does a kill the > the signal handling weakly (because of WNOHANG) waits on it, which means > the start may begin before the process was actually completely > terminated. > > Here's a small reproducer to apply on our code base: > > --8<---------------cut here---------------start------------->8--- > modified gnu/services/telephony.scm > @@ -685,13 +685,7 @@ (define (archive-name->username archive) > > ;; Finally, return the PID of the daemon process. > daemon-pid)) > - (stop > - #~(lambda (pid . args) > - (kill pid SIGKILL) > - ;; Wait for the process to exit; this prevents overlapping > - ;; processes when issuing 'herd restart'. > - (waitpid pid) > - #f)))))))) > + (stop #~(make-kill-destructor)))))))) > > (define jami-service-type > (service-type > --8<---------------cut here---------------end--------------->8--- The real problem here is not really the WNOHANG flag (you could remove that and still get issues) but rather that the waitpid is run inside a signal handler, which in Guile means that it's run through asyncs. You have no guarantees wrt. when asyncs run, so they could run after or in the middle of the next action. I also think make-kill-destructor should waitpid the processes it's killing, as you're implying, and leave the signal handler only for unexpected service crashes. Best, -- Josselin Poiret From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 23 02:33:43 2022 Received: (at 57922) by debbugs.gnu.org; 23 Sep 2022 06:33:44 +0000 Received: from localhost ([127.0.0.1]:39187 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1obcFr-0001NS-Hz for submit@debbugs.gnu.org; Fri, 23 Sep 2022 02:33:43 -0400 Received: from eggs.gnu.org ([209.51.188.92]:42014) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1obcFp-0001NF-9Z for 57922@debbugs.gnu.org; Fri, 23 Sep 2022 02:33:41 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:49122) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1obcFj-0002KU-NS; Fri, 23 Sep 2022 02:33:35 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To: From; bh=ROm0IsvWG+YODmu//zCK0zwGmAHR1MstxI/DI4pj51E=; b=W9dQV0u2JqEppHyYhfuD g43U1V3u0zGgftVitbix/vU/4YG4WHxy9FVBGYf1rbMjukKV6YexG90J309Ncar5vNsVKdFwK/+CU GfaI/uSfsmPLwtViMQyjhWSWDkVHsHFlfflnLRc6vn8ltawqcucvQjsUhGCcqxSonSpvamhsBS5qu ZWH5HzA2R6opiBPtvJbVBNMVWmp/W59ovX8ZRncjvLqWs8Gja2eJS6zj5/lYppltbrf4pbyjLkomC BC4oc34jTV1pcbDcZuBoTABP0zLAkKG1d3krPIlWrcjomfLyjCykPHRhL9qDhKs3ZaJrp4JnGVAT5 qAXSIScGpoOMiA==; Received: from [89.207.171.75] (port=39712 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1obcFf-00009Z-Kr; Fri, 23 Sep 2022 02:33:35 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Josselin Poiret Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself References: <874jx4q953.fsf@gmail.com> <87o7va33iq.fsf@jpoiret.xyz> Date: Fri, 23 Sep 2022 08:33:28 +0200 In-Reply-To: <87o7va33iq.fsf@jpoiret.xyz> (Josselin Poiret's message of "Tue, 20 Sep 2022 09:31:57 +0200") Message-ID: <87bkr6fvlz.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 3.3 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi, Josselin Poiret skribis: > Maxim Cournoyer writes: Content analysis details: (3.3 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 3.6 RCVD_IN_SBL_CSS RBL: Received via a relay in Spamhaus SBL-CSS [89.207.171.75 listed in zen.spamhaus.org] 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at https://www.dnswl.org/, medium trust [209.51.188.92 listed in list.dnswl.org] X-Debbugs-Envelope-To: 57922 Cc: 57922@debbugs.gnu.org, Maxim Cournoyer X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 2.3 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi, Josselin Poiret skribis: > Maxim Cournoyer writes: Content analysis details: (2.3 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at https://www.dnswl.org/, medium trust [209.51.188.92 listed in list.dnswl.org] 3.6 RCVD_IN_SBL_CSS RBL: Received via a relay in Spamhaus SBL-CSS [89.207.171.75 listed in zen.spamhaus.org] -0.0 SPF_PASS SPF: sender matches SPF record 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -1.0 MAILING_LIST_MULTI Multiple indicators imply a widely-seen list manager Hi, Josselin Poiret skribis: > Maxim Cournoyer writes: [...] >> 1. It requires to be installed in the signal handlers for each >> processes, with something like: >> >> --8<---------------cut here---------------start------------->8--- >> (unless %sigchld-handler-installed? >> (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP) >> (set! %sigchld-handler-installed? #t)) >> --8<---------------cut here---------------end--------------->8--- >> >> Done for fork+exec-command and make-inetd-forkexec-constructor, but not >> for make-forkexec-constructor/container, AFAICT; > > The signal handler is only installed once in PID 1 (in fact, you haven't > forked yet here), since it's the one that receives the SIGCHLD. Right. > What I don't understand that well is that this signal handler could be > installed only once when shepherd starts, right? That way, it wouldn't > need to depend on specific start actions being chosen. The SIGCHLD handler is installed lazily since f776de04e6702e18d95152072e78c43441d3ccc3. The rationale was discussed here: https://issues.guix.gnu.org/27553 That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely on signalfd(2). It=E2=80=99s from the main even loop in shepherd.scm that = the signal handler is called. >> Here's a small reproducer to apply on our code base: >> >> --8<---------------cut here---------------start------------->8--- >> modified gnu/services/telephony.scm >> @@ -685,13 +685,7 @@ (define (archive-name->username archive) >>=20=20 >> ;; Finally, return the PID of the daemon process. >> daemon-pid)) >> - (stop >> - #~(lambda (pid . args) >> - (kill pid SIGKILL) >> - ;; Wait for the process to exit; this prevents over= lapping >> - ;; processes when issuing 'herd restart'. >> - (waitpid pid) >> - #f)))))))) >> + (stop #~(make-kill-destructor)))))))) I think the main difference between these two is that the first one uses SIGKILL while the second one uses SIGTERM. You could try #~(make-kill-destructor SIGKILL) to get the same effect. (Another difference is that =E2=80=98make-kill-destructor=E2=80=99 kills th= e process group, not just the process itself.) Anyway, the key point is that shepherd takes care of calling =E2=80=98waitp= id=E2=80=99 for its child processes (services). If you call it yourself as in the snippet above, you=E2=80=99re racing with shepherd; in the case above it probably doesn=E2=80=99t make any difference though because it will consider that the service is stopped in any case. HTH! Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 23 13:49:35 2022 Received: (at 57922-done) by debbugs.gnu.org; 23 Sep 2022 17:49:35 +0000 Received: from localhost ([127.0.0.1]:41691 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1obmnu-00079e-Ti for submit@debbugs.gnu.org; Fri, 23 Sep 2022 13:49:35 -0400 Received: from mail-qt1-f181.google.com ([209.85.160.181]:38459) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1obmnt-00079R-Fg for 57922-done@debbugs.gnu.org; Fri, 23 Sep 2022 13:49:33 -0400 Received: by mail-qt1-f181.google.com with SMTP id y2so523527qtv.5 for <57922-done@debbugs.gnu.org>; Fri, 23 Sep 2022 10:49:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:from:to:cc:subject :date; bh=m9QUAKp6OUKoJCi+0GmI0iq8xaNi5TvAMWD9W/uKWMU=; b=C7K9VNRyShds9BkCbrvaGSRFknp6pOcyrbq775oiZ3cg0Vghf3xs8ZOztwzWdMSAj8 IGlamFmW+UdnE7UAwiT7tA+Ye1PKX7J5BDjWPZFHPJijnFIu/Tw5PPqc9mTo+mlFQF4U A0XM3t4eKQKEGvHe+vmLHGWh8dQSVwR7+MbmAZVAxfbT9K5GuNETZaZD3Tq2VDuCdt9D X8zjAsYioomdzpEOjO019+mdL/06n77uYXFO1hMOqEvrvaF83Rx2uhvhQt8w8Ev1hODP Gn+tAS8uBjynKwIt8MDPDa/FNaGm8J46jlzKuGkLnJaiAwbz+ddmf7kBQD3bbZvjQoTm JZ6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date; bh=m9QUAKp6OUKoJCi+0GmI0iq8xaNi5TvAMWD9W/uKWMU=; b=5cD+QeihetTvW3yleoe+X8QBz9WuI09TtOc2fVpipGDiqJekkyTzddMIacWg1X9Uee dq9iToFcyrPdLHVHd+9Je+mMD9lY+zMSlPXGB5bLs5d0woWG2BppQuMgjIPanNxQ9wy6 EesIQglQDcyTPaduikWCk3NRqC126hfHsg+u1ivJqIEWygqQAAzdr/7dEyhy3dX59i8a vF8ZHXoIcRbu0eyRhf0jvLHNijLWM/uEeR4BVc+/GQpSWla9nj0OGEuNYiK/CXid7w90 cOkPu62TF1KFYtuA2Oh/DLp1SkskEGCl8sNq8iGH5ShShI1AZPDAm0SgS/NrtO+Eg1qE kvxw== X-Gm-Message-State: ACrzQf0U+r/deUvF9hRRV6zlYb0pmB7bRwc07hKTU3WjBC8ncH/EbSUA sy63puctda0vANynup2kzXw9CqvIhWs= X-Google-Smtp-Source: AMsMyM5nm59h++wEY6AwUnYF+zmZIp+2KH5+xKJKDvyidaROFmLodBQK5LQtMElWcZmE2FxQQRlZHg== X-Received: by 2002:ac8:5847:0:b0:35d:18b8:aa0f with SMTP id h7-20020ac85847000000b0035d18b8aa0fmr6784383qth.591.1663955367654; Fri, 23 Sep 2022 10:49:27 -0700 (PDT) Received: from hurd (dsl-10-130-64.b2b2c.ca. [72.10.130.64]) by smtp.gmail.com with ESMTPSA id br30-20020a05620a461e00b006ceafb1aa92sm6562072qkb.96.2022.09.23.10.49.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Sep 2022 10:49:27 -0700 (PDT) From: Maxim Cournoyer To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself References: <874jx4q953.fsf@gmail.com> <87o7va33iq.fsf@jpoiret.xyz> <87bkr6fvlz.fsf@gnu.org> Date: Fri, 23 Sep 2022 13:49:26 -0400 In-Reply-To: <87bkr6fvlz.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Fri, 23 Sep 2022 08:33:28 +0200") Message-ID: <875yhe9e1l.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 57922-done Cc: Josselin Poiret , 57922-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) tags 57922 +notabug thanks Hi Ludo! Ludovic Court=C3=A8s writes: [...] >> What I don't understand that well is that this signal handler could be >> installed only once when shepherd starts, right? That way, it wouldn't >> need to depend on specific start actions being chosen. > > The SIGCHLD handler is installed lazily since > f776de04e6702e18d95152072e78c43441d3ccc3. The rationale was discussed > here: > > https://issues.guix.gnu.org/27553 > > That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely > on signalfd(2). It=E2=80=99s from the main even loop in shepherd.scm tha= t the > signal handler is called. I had missed that, thanks for explaining. >>> Here's a small reproducer to apply on our code base: >>> >>> --8<---------------cut here---------------start------------->8--- >>> modified gnu/services/telephony.scm >>> @@ -685,13 +685,7 @@ (define (archive-name->username archive) >>> >>> ;; Finally, return the PID of the daemon process. >>> daemon-pid)) >>> - (stop >>> - #~(lambda (pid . args) >>> - (kill pid SIGKILL) >>> - ;; Wait for the process to exit; this prevents ove= rlapping >>> - ;; processes when issuing 'herd restart'. >>> - (waitpid pid) >>> - #f)))))))) >>> + (stop #~(make-kill-destructor)))))))) > > I think the main difference between these two is that the first one uses > SIGKILL while the second one uses SIGTERM. > > You could try #~(make-kill-destructor SIGKILL) to get the same effect. You are right, the important difference was SIGTERM vs SIGKILL. I thought I had tried that. The problem only shows itself in the 'jami-provisioning' system test, not the 'jami' one. Marking this one as notabug and closing. Thanks again! Maxim From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 23 23:33:02 2022 Received: (at 57922) by debbugs.gnu.org; 24 Sep 2022 03:33:02 +0000 Received: from localhost ([127.0.0.1]:42030 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1obvuX-0000Rb-MF for submit@debbugs.gnu.org; Fri, 23 Sep 2022 23:33:02 -0400 Received: from mail-qk1-f170.google.com ([209.85.222.170]:41618) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1obvuV-0000RK-Qt for 57922@debbugs.gnu.org; Fri, 23 Sep 2022 23:33:00 -0400 Received: by mail-qk1-f170.google.com with SMTP id k12so1236856qkj.8 for <57922@debbugs.gnu.org>; Fri, 23 Sep 2022 20:32:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:cc:to:from:from:to:cc:subject:date; bh=LjNAuB0RhYiOD+fxOVpRlEmQOnkKmFfbEzpFDVb+MWI=; b=GvkKm+pT500R4BrK5/xw/zjXmkByhwGpWGcG6zUy0wQEJUnc7DCApw7BSpO1OFwIyR Eu9PA2zSwkgxdnlQ72qpqlkXu6RHAQV96X0/Ddq5m4ZJBwuqO/Df+W7cFN/YRIRM/IKJ 42pHHQayR4swFZox+WVg2+YiUQwYuUPRR3pTEfFRQ5NziV7ZxyKBQGN8VEpnNpWGtDxN 1FRxRxLAr6Kd4YOrJ2xevUjEdiO6plXCEDad77Uqhi1SOzrTAt+ZM2AYxG2ebntSxQxu 0xswavQS7VmmKaiEJQXEBlXQsO+ypDB+OYKdAJ8Xfcr7vTOMQ6uiV7jbZhXEF8nkWaUy VFdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:user-agent:message-id:in-reply-to:date:references :subject:cc:to:from:x-gm-message-state:from:to:cc:subject:date; bh=LjNAuB0RhYiOD+fxOVpRlEmQOnkKmFfbEzpFDVb+MWI=; b=faFHxYX95Q7y6pa3/6KSOUs/ZoqGvDzkh97nHoEqpEYJZm+jQlEqZ8thaq5unlw5kS y01yzDzIR/olShUpXbyp7f0WLdJv+gX3FyguC1iqFFft4jEW0+r1p25/UUiT5WYcWRu9 qMnPrDUOLLXhaV6oRlNbsfZT8GfxkF7BewVoWrdjOJpU7MWMvTDIHB45fIbyVGc9UEaJ CZKHaqMVbWtKTyH5HotGf8wPxjt9fzVr8l4AxdvaUmUf8Fo5jtIkwQOkjoEzmUdy0Kka wnLD/uPYkghRpVDNVqSFC96ZscK7JUbKTj6AYdJXbpo4+JdWFWJOEIFvFAXQfCb8duiR aOEw== X-Gm-Message-State: ACrzQf0v3bLWB0U66gUfSN5Hhtaac9qaaX9n4Z4KMr+xe1omDXvj+J7i hTMzksSHcTebemI4yx5YTVPOdJZD46o= X-Google-Smtp-Source: AMsMyM5Ym3Sy+McaVWduALVsD+jrBYZPrSMhMehb8pztlJQ2C1744S3H92eMHUv/zuzBoacqSJNQsg== X-Received: by 2002:a05:620a:40c1:b0:6ce:a11a:7279 with SMTP id g1-20020a05620a40c100b006cea11a7279mr7888764qko.703.1663990373980; Fri, 23 Sep 2022 20:32:53 -0700 (PDT) Received: from hurd (dsl-10-130-64.b2b2c.ca. [72.10.130.64]) by smtp.gmail.com with ESMTPSA id v11-20020a05622a014b00b0035cf0f50d7csm7483131qtw.52.2022.09.23.20.32.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 23 Sep 2022 20:32:53 -0700 (PDT) From: Maxim Cournoyer To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself References: <874jx4q953.fsf@gmail.com> <87o7va33iq.fsf@jpoiret.xyz> <87bkr6fvlz.fsf@gnu.org> Date: Fri, 23 Sep 2022 23:32:52 -0400 In-Reply-To: <87bkr6fvlz.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Fri, 23 Sep 2022 08:33:28 +0200") Message-ID: <878rm98n17.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 57922 Cc: Josselin Poiret , 57922@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) reopen 57922 tags 57922 -notabug thanks Hi again, [...] >>> Here's a small reproducer to apply on our code base: >>> >>> --8<---------------cut here---------------start------------->8--- >>> modified gnu/services/telephony.scm >>> @@ -685,13 +685,7 @@ (define (archive-name->username archive) >>> >>> ;; Finally, return the PID of the daemon process. >>> daemon-pid)) >>> - (stop >>> - #~(lambda (pid . args) >>> - (kill pid SIGKILL) >>> - ;; Wait for the process to exit; this prevents overlapping >>> - ;; processes when issuing 'herd restart'. >>> - (waitpid pid) >>> - #f)))))))) >>> + (stop #~(make-kill-destructor)))))))) > > I think the main difference between these two is that the first one uses > SIGKILL while the second one uses SIGTERM. > > You could try #~(make-kill-destructor SIGKILL) to get the same effect. > You are right, the important difference was SIGTERM vs SIGKILL. I > thought I had tried that. The problem only shows itself in the > 'jami-provisioning' system test, not the 'jami' one. > Marking this one as notabug and closing. I think I spoke too soon. SIGKILL does fix the problem when *not* using waitpid explicitly, but when using waitpid explicitly, SIGTERM can be used just fine. In other words, this works: --8<---------------cut here---------------start------------->8--- @@ -687,7 +687,7 @@ (define (archive-name->username archive) daemon-pid)) (stop #~(lambda (pid . args) - (kill pid SIGKILL) + (kill pid SIGTERM) ;; Wait for the process to exit; this prevents overlapping ;; processes when issuing 'herd restart'. (waitpid pid) --8<---------------cut here---------------end--------------->8--- but this doesn't: --8<---------------cut here---------------start------------->8--- @@ -685,13 +685,7 @@ (define (archive-name->username archive) ;; Finally, return the PID of the daemon process. daemon-pid)) - (stop - #~(lambda (pid . args) - (kill pid SIGKILL) - ;; Wait for the process to exit; this prevents overlapping - ;; processes when issuing 'herd restart'. - (waitpid pid) - #f)))))))) + (stop #~(make-kill-destructor)))))))) (define jami-service-type --8<---------------cut here---------------end--------------->8--- when exercised with 'make check-system TESTS=jami-provisioning': --8<---------------cut here---------------start------------->8--- This is the GNU system. Welcome. jami login: Jami Daemon 13.4.0, by Savoir-faire Linux 2004-2019 https://jami.net/ [Video support enabled] [Plugins support enabled] 23:29:05.375 os_core_unix.c !pjlib 2.12.1 for POSIX initialized shepherd: Service jami has been stopped. Caught signal Terminated, terminating... Some deprecated features have been used. Set the environment variable GUILE_WARN_DEPRECATED to "detailed" and rerun the program to get more information. Set it to "no" to suppress this message. Jami Daemon 13.4.0, by Savoir-faire Linux 2004-2019 https://jami.net/ [Video support enabled] [Plugins support enabled] One does not simply initialize the client: Another daemon is detected /gnu/store/2vcv1fyqfyym2zcyf5bvbj1pcgbcc515-shepherd-marionette.scm:1:1718: ERROR: 1. &action-exception-error: service: jami action: start key: misc-error args: (#f "~A ~S ~S ~S" (dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")) #f) --8<---------------cut here---------------end--------------->8--- or manually through the test VM: --8<---------------cut here---------------start------------->8--- $(./pre-inst-env guix system vm --no-graphic --no-grafts --no-offload \ -e '(@@ (gnu tests telephony) %jami-os-provisioning)') \ -m 1G -smp $(nproc) "-nic" user,model=virtio-net-pci,hostfwd=tcp::10022-:22 --8<---------------cut here---------------end--------------->8--- This leads me to believe that Shepherd does not block until the process is actually dead to mark the process as stopped (it just waitpid on the group pid with WNOHANG), which means it won't block if the child process hasn't exited yet, if I'm correct. When we are in the stop slot, we know for sure that the process should terminate completely, hence it'd make sense to call 'waitpid' *without* WNOHANG there, to avoid 'herd restart' from starting the service while its stopped process is not done terminating. jamid can take quite some time to terminate cleanly because of the networking threads in the opendht library that needs to be finalized, which is probably the reason this problem can be observed here. Thoughts? Maxim From debbugs-submit-bounces@debbugs.gnu.org Sat Sep 24 04:09:07 2022 Received: (at 57922) by debbugs.gnu.org; 24 Sep 2022 08:09:07 +0000 Received: from localhost ([127.0.0.1]:42243 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oc0Di-0007xM-Op for submit@debbugs.gnu.org; Sat, 24 Sep 2022 04:09:06 -0400 Received: from jpoiret.xyz ([206.189.101.64]:40644) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oc0Dg-0007xD-Ul for 57922@debbugs.gnu.org; Sat, 24 Sep 2022 04:09:05 -0400 Received: from authenticated-user (jpoiret.xyz [206.189.101.64]) by jpoiret.xyz (Postfix) with ESMTPA id 30AA1185310; Sat, 24 Sep 2022 08:09:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jpoiret.xyz; s=dkim; t=1664006941; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=talIB2iV9WeuRSS20bV8CkxBo8H3SOAfTcO7cdcc5qU=; b=SHlRxd6OVp7KWNNqtGToTvlCYm2Y1KWNqL0XKaOc0h26AFW3EGEvX+ygG55h4pdioAq4SX Hrnykmu+v+D3y6mqzfWU4OKzeG63yp10F9DacxSeN7Ja1AoRSCzaRcgjhpGji3OK5gzplL 0rnoVZpOUWHTsWKHdUfvGSswrFC5JdAjnBMFAF0S/6UBuxOD5sszEwK4+/T3jbbwaEtoP1 b01M3n36ze+pUyOk+gcuRcSeARs0kLGJEsfqKyehwUM6EZ8w+rCINUQzk5rgNE29aLXSaA /xpla6e5U4HZvBW5V/Cci87384kEey1TKhsjjjOytjW0dMI43tOADbA75GL01A== From: Josselin Poiret To: Maxim Cournoyer , Ludovic =?utf-8?Q?Court?= =?utf-8?Q?=C3=A8s?= Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself In-Reply-To: <878rm98n17.fsf@gmail.com> References: <874jx4q953.fsf@gmail.com> <87o7va33iq.fsf@jpoiret.xyz> <87bkr6fvlz.fsf@gnu.org> <878rm98n17.fsf@gmail.com> Date: Sat, 24 Sep 2022 10:09:00 +0200 Message-ID: <87sfkh8a8z.fsf@jpoiret.xyz> MIME-Version: 1.0 Content-Type: text/plain Authentication-Results: jpoiret.xyz; auth=pass smtp.auth=jpoiret@jpoiret.xyz smtp.mailfrom=dev@jpoiret.xyz X-Spamd-Bar: / X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi everyone, Maxim Cournoyer writes: > This leads me to believe that Shepherd does not block until the process > is actually dead to mark the process as stopped (it just waitpid on the > group pid with WNOHANG), which means it won't bloc [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 0.0 FROM_SUSPICIOUS_NTLD From abused NTLD X-Debbugs-Envelope-To: 57922 Cc: 57922@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi everyone, Maxim Cournoyer writes: > This leads me to believe that Shepherd does not block until the process > is actually dead to mark the process as stopped (it just waitpid on the > group pid with WNOHANG), which means it won't bloc [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record 1.0 BULK_RE_SUSP_NTLD Precedence bulk and RE: from a suspicious TLD 0.0 FROM_SUSPICIOUS_NTLD From abused NTLD -1.0 MAILING_LIST_MULTI Multiple indicators imply a widely-seen list manager Hi everyone, Maxim Cournoyer writes: > This leads me to believe that Shepherd does not block until the process > is actually dead to mark the process as stopped (it just waitpid on the > group pid with WNOHANG), which means it won't block if the child process > hasn't exited yet, if I'm correct. > > When we are in the stop slot, we know for sure that the process should > terminate completely, hence it'd make sense to call 'waitpid' *without* > WNOHANG there, to avoid 'herd restart' from starting the service while > its stopped process is not done terminating. > > jamid can take quite some time to terminate cleanly because of the > networking threads in the opendht library that needs to be finalized, > which is probably the reason this problem can be observed here. > > Thoughts? I agree with you, make-kill-destructor should waitpid the processes it's killing. There shouldn't be any issues waitpid'ing before the shepherd's signal handler, since stop actions are run with asyncs disabled. The signal handler will run once but won't get anything because all the processes were already waitpid'd and it uses WNOHANG. Best, -- Josselin Poiret From debbugs-submit-bounces@debbugs.gnu.org Sat Sep 24 12:30:20 2022 Received: (at 57922) by debbugs.gnu.org; 24 Sep 2022 16:30:20 +0000 Received: from localhost ([127.0.0.1]:45112 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oc82m-0004ZQ-Cm for submit@debbugs.gnu.org; Sat, 24 Sep 2022 12:30:20 -0400 Received: from eggs.gnu.org ([209.51.188.92]:51750) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1oc82l-0004EQ-1p for 57922@debbugs.gnu.org; Sat, 24 Sep 2022 12:30:19 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:42760) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oc82f-0002vl-O1; Sat, 24 Sep 2022 12:30:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To: From; bh=got81TcrqCy98JpJ+p0OHfPAiY4xP2grH0geQAy4Xz4=; b=PQ5IgUkyodJVm2GBRwTn RXQTMHT/vhZp9nr8zTEb7F8PAhJTJVPGJ4xmeQeYmu7rEjJf4Cn/Gm8Ax3hi2fonL81rYRyA7QJsZ BbrwAlonOetQvGFn6v9TKxfEj4mQol4b7CRgcCREA1BARdyUXZoV7At7ZMrjYaC2Jaw1Nw8RR79t2 hCuuOOQkzH9CMJ553rWZhoON/n2tOfzPwqDAfmfzYobV8k38t1Mqo86jJ7W8lm5Oflr9qGb3tmnJe SIIRJFtZ9Am/Wn0l7VJ/JQRRf7LHyGR6j58Dg+y/7ju9b0lKPO000wz+bVX4eZ6BGKDFXMctrEFMH A5pTvKvjxxlvQQ==; Received: from 91-160-117-201.subs.proxad.net ([91.160.117.201]:63645 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oc82c-0001gd-2G; Sat, 24 Sep 2022 12:30:13 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Josselin Poiret Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself References: <874jx4q953.fsf@gmail.com> <87o7va33iq.fsf@jpoiret.xyz> <87bkr6fvlz.fsf@gnu.org> <878rm98n17.fsf@gmail.com> <87sfkh8a8z.fsf@jpoiret.xyz> X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: Tridi 3 =?utf-8?Q?Vend=C3=A9miaire?= an 231 de la =?utf-8?Q?R=C3=A9volution=2C?= jour de la =?utf-8?Q?Ch=C3=A2taigne?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sat, 24 Sep 2022 18:30:07 +0200 In-Reply-To: <87sfkh8a8z.fsf@jpoiret.xyz> (Josselin Poiret's message of "Sat, 24 Sep 2022 10:09:00 +0200") Message-ID: <87zgeo68hc.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.3 (/) X-Debbugs-Envelope-To: 57922 Cc: 57922@debbugs.gnu.org, Maxim Cournoyer X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.3 (-) Hi, Josselin Poiret skribis: > Maxim Cournoyer writes: > >> This leads me to believe that Shepherd does not block until the process >> is actually dead to mark the process as stopped (it just waitpid on the >> group pid with WNOHANG), which means it won't block if the child process >> hasn't exited yet, if I'm correct. Correct: the service is marked as stopped as soon as =E2=80=98stop=E2=80=99= returns. >> When we are in the stop slot, we know for sure that the process should >> terminate completely, hence it'd make sense to call 'waitpid' *without* >> WNOHANG there, to avoid 'herd restart' from starting the service while >> its stopped process is not done terminating. >> >> jamid can take quite some time to terminate cleanly because of the >> networking threads in the opendht library that needs to be finalized, >> which is probably the reason this problem can be observed here. >> >> Thoughts? > > I agree with you, make-kill-destructor should waitpid the processes it's > killing. There shouldn't be any issues waitpid'ing before the > shepherd's signal handler, since stop actions are run with asyncs > disabled. The signal handler will run once but won't get anything > because all the processes were already waitpid'd and it uses WNOHANG. I think we need an extra =E2=80=9Cstopping=E2=80=9D state for services. In= general, we=E2=80=99ll want to send SIGTERM, wait for some grace period or dead proc= ess notification, then send SIGKILL, and finally change state to =E2=80=9Cstopp= ed=E2=80=9D. This is not possible in 0.9 but is something I=E2=80=99d like to have in 0.= 10=C2=B9. Ludo=E2=80=99. =C2=B9 https://lists.gnu.org/archive/html/guix-devel/2022-06/msg00350.html From debbugs-submit-bounces@debbugs.gnu.org Sun Sep 25 20:12:23 2022 Received: (at 57922) by debbugs.gnu.org; 26 Sep 2022 00:12:23 +0000 Received: from localhost ([127.0.0.1]:48724 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ocbjT-0001CA-FR for submit@debbugs.gnu.org; Sun, 25 Sep 2022 20:12:23 -0400 Received: from mail-qv1-f53.google.com ([209.85.219.53]:39494) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ocbjN-0001Bt-G8 for 57922@debbugs.gnu.org; Sun, 25 Sep 2022 20:12:21 -0400 Received: by mail-qv1-f53.google.com with SMTP id c6so3453601qvn.6 for <57922@debbugs.gnu.org>; Sun, 25 Sep 2022 17:12:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:from:to:cc:subject :date; bh=/lD9T+YU39Xm2ZAkqduxh4bIOb/HECvfCUpzrUYo9IE=; b=Ypzv9pQ1qf0oJX1b/J54ZsZ+F8ggFEf/RdvTnFYAB7SCcD41BayU0v2MHF+BmV+4fr Z4SWubsjBX6Wid8r7hg1UayBJJArna7ehuM6PpE3+7aXgqNhgR5jXCLvPVpteiXklo52 KMoynGalWN4rXdh+WwG7xkcL4XjnfI6BvjuwNUq0NvKEALkgBC0urEDLGj+fe1XgeJtA cRRQ8cq34N6Tu0vo3I1rwaF07hdLpdU3xpqDwhYy7YwQtDAv2b4M3Wh7RpBaR7PhGXYb njmWEzOKPwQ4RJG2m+QGwl+ljf3RuRdrLrJ5N/Ny1W3Yoz+OMX0vYIZ1Fs7AfYOg3wEt AecA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date; bh=/lD9T+YU39Xm2ZAkqduxh4bIOb/HECvfCUpzrUYo9IE=; b=v91G5iNxz3Kde0pJCTM7vVO03MvsehCDfrHBvV4Spy3Vrn84lpJ/vDOIGRyR1VhCuM WgKgYHdbtUeZCOiA1UhZDgiuKpwuZf1JofFTGKqqOply4Q9Gx6cXLfK4q2e8O2xHemJu nAZn7DmXf/qouMCadj9S0kubGatasdUdgoC6c5Ft3B2JB14lOvbSF46AkoGC1K2+pf0B kc32yLR1tsCpeTaXEJinYi4j9ndNME3BmwN/mJN3d16QBxNIc0KkAB+a2gn3jpuqVVhI x5Oks5jw1FxqMFv268DUnCOzfCZNOEhdRE3FhlybyEu+3y+QAgxmaVRFWhq+eY/Ya6He QkaQ== X-Gm-Message-State: ACrzQf3TzGfFvEnCbXrq+wi3oEnNpbUedpJrLimCfkKkTy21kYztZ3hV hxbitKwJde1k1dZTxViHJP7lvfim4C8= X-Google-Smtp-Source: AMsMyM4uQUvGzqzUJYYAVtL2g/bmqkc06QU/vIU4r0cQgoNYYB0hf1ri0ErcwqgrUSMbCrPVaZs/ag== X-Received: by 2002:a0c:ec46:0:b0:4a7:509:386e with SMTP id n6-20020a0cec46000000b004a70509386emr15332891qvq.61.1664151131702; Sun, 25 Sep 2022 17:12:11 -0700 (PDT) Received: from hurd (dsl-10-132-99.b2b2c.ca. [72.10.132.99]) by smtp.gmail.com with ESMTPSA id o17-20020a05622a045100b003447c4f5aa5sm10741698qtx.24.2022.09.25.17.12.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 25 Sep 2022 17:12:10 -0700 (PDT) From: Maxim Cournoyer To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid itself References: <874jx4q953.fsf@gmail.com> <87o7va33iq.fsf@jpoiret.xyz> <87bkr6fvlz.fsf@gnu.org> <878rm98n17.fsf@gmail.com> <87sfkh8a8z.fsf@jpoiret.xyz> <87zgeo68hc.fsf@gnu.org> Date: Sun, 25 Sep 2022 20:12:09 -0400 In-Reply-To: <87zgeo68hc.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Sat, 24 Sep 2022 18:30:07 +0200") Message-ID: <87leq76lk6.fsf@gmail.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 2.0 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi, Ludovic Courtès writes: > Hi, > > Josselin Poiret skribis: > >> Maxim Cournoyer writes: >> >>> This leads me to believe that Shepherd does not block until the process >>> is actually dead to mark the process as stopped ( [...] Content analysis details: (2.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record 2.0 PDS_OTHER_BAD_TLD Untrustworthy TLDs [URI: jpoiret.xyz (xyz)] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (maxim.cournoyer[at]gmail.com) 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at https://www.dnswl.org/, no trust [209.85.219.53 listed in list.dnswl.org] -0.0 RCVD_IN_MSPIKE_H2 RBL: Average reputation (+2) [209.85.219.53 listed in wl.mailspike.net] X-Debbugs-Envelope-To: 57922 Cc: Josselin Poiret , 57922@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) Hi, Ludovic Court=C3=A8s writes: > Hi, > > Josselin Poiret skribis: > >> Maxim Cournoyer writes: >> >>> This leads me to believe that Shepherd does not block until the process >>> is actually dead to mark the process as stopped (it just waitpid on the >>> group pid with WNOHANG), which means it won't block if the child process >>> hasn't exited yet, if I'm correct. > > Correct: the service is marked as stopped as soon as =E2=80=98stop=E2=80= =99 returns. > >>> When we are in the stop slot, we know for sure that the process should >>> terminate completely, hence it'd make sense to call 'waitpid' *without* >>> WNOHANG there, to avoid 'herd restart' from starting the service while >>> its stopped process is not done terminating. >>> >>> jamid can take quite some time to terminate cleanly because of the >>> networking threads in the opendht library that needs to be finalized, >>> which is probably the reason this problem can be observed here. >>> >>> Thoughts? >> >> I agree with you, make-kill-destructor should waitpid the processes it's >> killing. There shouldn't be any issues waitpid'ing before the >> shepherd's signal handler, since stop actions are run with asyncs >> disabled. The signal handler will run once but won't get anything >> because all the processes were already waitpid'd and it uses WNOHANG. > > I think we need an extra =E2=80=9Cstopping=E2=80=9D state for services. = In general, > we=E2=80=99ll want to send SIGTERM, wait for some grace period or dead pr= ocess > notification, then send SIGKILL, and finally change state to =E2=80=9Csto= pped=E2=80=9D. > > This is not possible in 0.9 but is something I=E2=80=99d like to have in = 0.10=C2=B9. This sounds good. Let's keep this ticket open until this goodness lands, as a reminder. Thank you! Maxim From unknown Tue Jun 17 22:00:33 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 24 Oct 2022 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator