From debbugs-submit-bounces@debbugs.gnu.org Fri May 26 08:14:33 2023 Received: (at submit) by debbugs.gnu.org; 26 May 2023 12:14:33 +0000 Received: from localhost ([127.0.0.1]:49043 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1q2WL2-00042S-9p for submit@debbugs.gnu.org; Fri, 26 May 2023 08:14:32 -0400 Received: from lists.gnu.org ([209.51.188.17]:33732) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1q2WL0-00042K-Fr for submit@debbugs.gnu.org; Fri, 26 May 2023 08:14:31 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2WKz-0001vt-Du for bug-guix@gnu.org; Fri, 26 May 2023 08:14:29 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2WKz-0006a3-3G for bug-guix@gnu.org; Fri, 26 May 2023 08:14:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:Subject:To:From:in-reply-to: references; bh=NbXV+3bW5lFEorrVP9xAnMSXiPOSpucZyhcSISmHcrw=; b=RPAk0a08AbXadH S6BIoebgXo3huC5JbkYsYIsnIF3m5Tqe/9uqQE9laXv6a8cbjf7GZLBx0RCfh5AER2EUtSxbP8FAH pCXsk5genw6N7JfiEwAeV98IZlR3ZPFx4nblkVpqGcMOowWvyus/fHArT2y1xh9454DdumO77+4rK gRZu4cmWQupbShEJzJO44kpsOVv2etp8IutL/m5rCConUHWz1HFUgsj743iyCgjt9aRf+QCwHzo6q W3rvXf6dv371E/qf7fQTjPlU+P7X5DNz+lJiUB29tfAZQ4AFYnkiePFfafQNcp3oSyCEtWYJPsM9S IQcDlI8KTjxdjF7tfweg==; Received: from [193.50.110.149] (helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2WKi-00059Q-Sr for bug-guix@gnu.org; Fri, 26 May 2023 08:14:28 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: bug-guix@gnu.org Subject: [Shepherd] Loss of SIGCHLD notifications X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: Septidi 7 Prairial an 231 de la =?utf-8?Q?R=C3=A9vol?= =?utf-8?Q?ution=2C?= jour du Fromental X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Fri, 26 May 2023 14:14:09 +0200 Message-ID: <87353jb88e.fsf@inria.fr> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) I experienced, with the Shepherd 0.10.0, a situation where =E2=80=98herd restart=E2=80=99 would get stuck after the end-of-grace-period expiration (restarting nscd in this case): --8<---------------cut here---------------start------------->8--- May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Status of nscd:=20 May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is running sin= ce 08:43:57 (36 seconds ago).=20 May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Running value is = 23753.=20 May 26 08:44:33 localhost shepherd[1]: [NetworkManager] It is enabled.=20 May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Provides (nscd).= =20 May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Requires (user-pr= ocesses syslogd).=20 May 26 08:44:33 localhost shepherd[1]: [NetworkManager] Will be respawned= .=20 May 26 08:44:33 localhost shepherd[1]: Stopping service nscd...=20 [...] May 26 08:44:38 localhost shepherd[1]: Grace period of 5 seconds is over; s= ending -23753 SIGKILL.=20 --8<---------------cut here---------------end--------------->8--- The =E2=80=98herd restart=E2=80=99 process is indeed waiting, and the nscd = process is still around as zombie, meaning that shepherd didn=E2=80=99t call waitpid(2= ): --8<---------------cut here---------------start------------->8--- $ sudo herd status nscd Status of nscd: It is being stopped. It is enabled. Provides (nscd). Requires (user-processes syslogd). Will be respawned. $ ps aux|grep nscd root 23753 0.0 0.0 0 0 ? Zs 08:43 0:00 [nscd] root 23870 0.0 0.1 49968 18088 ? Sl 08:44 0:00 /gnu/store= /4gvgcfdiz67wv04ihqfa8pqwzsb0qpv5-guile-3.0.9/bin/guile --no-auto-compile /= run/current-system/profile/bin/herd restart nscd ludo 25280 33.3 0.0 6112 2304 pts/0 S+ 09:32 0:00 grep --col= or=3Dauto nscd --8<---------------cut here---------------end--------------->8--- At that point, if I pick another service process and kill it, shepherd again fails to react to SIGCHLD (or doesn=E2=80=99t receive it): --8<---------------cut here---------------start------------->8--- $ sudo herd status ntpd Status of ntpd: It is running since 08:44:31 AM (52 minutes ago). Running value is 23814. It is enabled. Provides (ntpd). Requires (user-processes networking). Will be respawned. $ sudo kill 23814 $ sudo herd status ntpd Status of ntpd: It is running since 08:44:31 AM (52 minutes ago). Running value is 23814. It is enabled. Provides (ntpd). Requires (user-processes networking). Will be respawned. ludo@ribbon ~/src/guix$ ps 23814 PID TTY STAT TIME COMMAND 23814 ? Zs 0:00 [ntpd] --8<---------------cut here---------------end--------------->8--- If I repeat the same experiment while stracing shepherd, here=E2=80=99s wha= t I see: --8<---------------cut here---------------start------------->8--- $ sudo herd status guix-publish Status of guix-publish: It is running since 08:44:32 AM (55 minutes ago). Running value is 23822. It is enabled. Provides (guix-publish). Requires (user-processes guix-daemon avahi-daemon). Will be respawned. $ sudo kill 23822 $ ps 23822 PID TTY STAT TIME COMMAND 23822 ? Zs 0:02 [guix publish] --8<---------------cut here---------------end--------------->8--- =E2=80=A6 and the strace: --8<---------------cut here---------------start------------->8--- 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) =3D 0 1 epoll_wait(13, [{events=3DEPOLLIN, data=3D{u32=3D14, u64=3D14}}], 8, = -1) =3D 1 1 accept4(14, {sa_family=3DAF_UNIX}, [112 =3D> 2], SOCK_CLOEXEC|SOCK_NO= NBLOCK) =3D 25 1 accept4(14, 0x7ffe6f8f0be0, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) =3D -1= EAGAIN (Resource temporarily unavailable) 1 epoll_ctl(13, EPOLL_CTL_MOD, 14, {events=3DEPOLLIN|EPOLLRDHUP|EPOLLON= ESHOT, data=3D{u32=3D14, u64=3D14}}) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 read(25, "(shepherd-command (version 0) (action status) (service guix= -publish) (arguments ()) (directory \"/home/ludo/src/guix\"))", 1024) =3D 1= 18 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 epoll_wait(13, [], 8, 0) =3D 0 1 write(25, "(reply (version 0) (result ((service (version 0) (provides= (guix-publish)) (requires (user-processes guix-daemon avahi-daemon)) (resp= awn? #t) (docstring \"[No documentation.]\") (enabled? #t) (running 23822) = (conflicts ()) (last-respawns ()) (status-changes ((running . 1685083472) (= starting . 1685083472) (stopped . 1685083469) (stopping . 1685083469) (runn= ing . 1684924319) (starting . 1684924319) (stopped . 1684924319) (stopping = . 1684924319) (running . 1684853459) (starting . 1684853459) (stopped"..., = 686) =3D 686 1 close(25) =3D 0 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) =3D 0 1 epoll_wait(13, [{events=3DEPOLLHUP, data=3D{u32=3D30, u64=3D30}}], 8,= -1) =3D 1 1 read(30, "", 4096) =3D 0 1 close(30) =3D 0 1 close(36) =3D 0 1 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) =3D 0 1 epoll_wait(13,=20 --8<---------------cut here---------------end--------------->8--- The signal FD still exists but we didn=E2=80=99t see any activity on it: --8<---------------cut here---------------start------------->8--- $ sudo ls -lrt /proc/1/fd |grep signalfd lrwx------ 1 root root 64 May 26 09:41 10 -> anon_inode:[signalfd] --8<---------------cut here---------------end--------------->8--- Here=E2=80=99s the signal handling status (notice two signals queued): --8<---------------cut here---------------start------------->8--- $ sudo cat /proc/1/status|grep -i Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: 0000000000014003 SigIgn: 0000000000001000 SigCgt: 0000000120814423 --8<---------------cut here---------------end--------------->8--- With: (define (signal-mask->list mask) (let loop ((sig 32) (lst '())) (if (zero? sig) lst (loop (- sig 1) (if (bit-set? (- sig 1) mask) (cons sig lst) lst))))) =E2=80=A6 we see that the block mask is equal to =E2=80=98%precious-signals= =E2=80=99 (good): --8<---------------cut here---------------start------------->8--- scheme@(shepherd system)> (signal-mask->list #x0000000000014003) $33 =3D (1 2 15 17) --8<---------------cut here---------------end--------------->8--- =E2=80=A6 while the ignored mask lists SIGPIPE: --8<---------------cut here---------------start------------->8--- scheme@(shepherd system)> (signal-mask->list #x0000000000001000) $32 =3D (13) --8<---------------cut here---------------end--------------->8--- However, the process also has 3 GC marker threads and 1 finalization thread: --8<---------------cut here---------------start------------->8--- (gdb) info threads Id Target Id Frame=20 * 1 Thread 0x7fdbd530f380 (LWP 1) "shepherd" 0x00007fdbd5415626 in = epoll_wait (epfd=3D13,=20 events=3D0x7fdbd203a1a0, maxevents=3D8, timeout=3D-1) at ../sysdeps/uni= x/sysv/linux/epoll_wait.c:30 2 Thread 0x7fdbd4cd9640 (LWP 118) "GC-marker-0" __futex_abstimed_wait_= common64 (private=3D0,=20 cancel=3Dtrue, abstime=3D0x0, op=3D393, expected=3D0, futex_word=3D0x7f= dbd5857be8 ) at futex-internal.c:57 3 Thread 0x7fdbd44d8640 (LWP 119) "GC-marker-1" __futex_abstimed_wait_= common64 (private=3D0,=20 cancel=3Dtrue, abstime=3D0x0, op=3D393, expected=3D0, futex_word=3D0x7f= dbd5857be8 ) at futex-internal.c:57 4 Thread 0x7fdbd3cd7640 (LWP 120) "GC-marker-2" __futex_abstimed_wait_= common64 (private=3D0,=20 cancel=3Dtrue, abstime=3D0x0, op=3D393, expected=3D0, futex_word=3D0x7f= dbd5857be8 ) at futex-internal.c:57 5 Thread 0x7fdbd3389640 (LWP 123) "shepherd" __GI___libc_read (nbyt= es=3D1,=20 buf=3D0x7fdbd3388620, fd=3D6) at ../sysdeps/unix/sysv/linux/read.c:26 --8<---------------cut here---------------end--------------->8--- These threads seem to be blocking every signal, which is good: --8<---------------cut here---------------start------------->8--- $ sudo cat /proc/118/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/119/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/120/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5f7bfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 $ sudo cat /proc/123/status |grep ^Sig SigQ: 2/63467 SigPnd: 0000000000000000 SigBlk: fffffffe5ffbfeff SigIgn: 0000000000001000 SigCgt: 0000000120814423 --8<---------------cut here---------------end--------------->8--- Then all of a sudden, as I=E2=80=99m conducting this experiment, I see: --8<---------------cut here---------------start------------->8--- May 26 09:36:37 localhost ntpd[23814]: ntpd exiting on signal 15 (Terminate= d) [=E2=80=A6] May 26 10:52:10 localhost shepherd[1]: Respawning nscd.=20 May 26 10:52:10 localhost shepherd[1]: Service guix-publish (PID 23822) ter= minated with signal 15.=20 May 26 10:52:10 localhost shepherd[1]: Respawning ntpd.=20 May 26 10:52:10 localhost shepherd[1]: Respawning guix-publish.=20 May 26 10:52:10 localhost shepherd[1]: Starting service nscd...=20 May 26 10:52:10 localhost shepherd[1]: Starting service ntpd...=20 May 26 10:52:10 localhost shepherd[1]: Service ntpd has been started.=20 May 26 10:52:10 localhost shepherd[1]: Service ntpd started.=20 May 26 10:52:10 localhost shepherd[1]: Service ntpd running with value 2914= 8.=20 --8<---------------cut here---------------end--------------->8--- 10:52 is about the time I attached gdb to shepherd=E2=80=A6 Interestingly, we get a =E2=80=9Cterminated with signal 15=E2=80=9D message= from shepherd for guix-publish, but not for ntpd. Long story short: there seems to be a problem with signal delivery. Most likely, the initial grace period expiration above when stopping nscd is a symptom of shepherd no longer receiving/processing SIGCHLD rather than the cause. To be continued=E2=80=A6 Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Sat May 27 13:01:20 2023 Received: (at 63736) by debbugs.gnu.org; 27 May 2023 17:01:20 +0000 Received: from localhost ([127.0.0.1]:52477 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1q2xHt-0002hE-St for submit@debbugs.gnu.org; Sat, 27 May 2023 13:01:20 -0400 Received: from eggs.gnu.org ([209.51.188.92]:60522) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1q2xHs-0002gd-1I for 63736@debbugs.gnu.org; Sat, 27 May 2023 13:01:04 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2xHm-0003eX-Jd for 63736@debbugs.gnu.org; Sat, 27 May 2023 13:00:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:In-Reply-To:Date:References:Subject:To: From; bh=4BJoMv5vFcfe4Fbou2z3zXwZ5XnLkBMI0ts1928uoHE=; b=Oiv3STDpfyM3stWUEVkx cX2o1d2ceTftgDtszotObN7gRGFZIkS+gwwyGMqcIiowHLKpvi861Br25URw+XnysX8AxTrU7a/rd sW0Oma55IqP0e5jbPNj0hvKzNb4XU6dq/ovy33N1Yygyx6TFyJ6hi96/VwVPDKpGwvR1hVFlEUmrw C14jMq36IKh9ibb1bcTH4mimXm8xm4ePl5r2LPRFUHssGs/Lj4U2DtEWOC7EzaFXjmSAGcaHbtEyY pE465GClIENv3r5BfAybcaKHs4LvJDWGNquUCvL4heqgAHa05mzluOr3kQ1nNyZu6vF1a9VXN1Mr8 ifj0+2KPe+GLdA==; Received: from 91-160-117-201.subs.proxad.net ([91.160.117.201] helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q2xHm-0005cZ-6a for 63736@debbugs.gnu.org; Sat, 27 May 2023 13:00:58 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 63736@debbugs.gnu.org Subject: Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications References: <87353jb88e.fsf@inria.fr> Date: Sat, 27 May 2023 19:00:55 +0200 In-Reply-To: <87353jb88e.fsf@inria.fr> ("Ludovic =?utf-8?Q?Court=C3=A8s=22?= =?utf-8?Q?'s?= message of "Fri, 26 May 2023 14:14:09 +0200") Message-ID: <87wn0thfp4.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63736 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Ludovic Court=C3=A8s skribis: > Long story short: there seems to be a problem with signal delivery. > Most likely, the initial grace period expiration above when stopping > nscd is a symptom of shepherd no longer receiving/processing SIGCHLD > rather than the cause. Another possibility is lockup: one of the relevant fibers is either gone or stuck in =E2=80=98put-message=E2=80=99 or =E2=80=98get-message=E2=80=99. I did two things: b9a37f3 shepherd: Make signal handling fiber an essential task. 8ae2780 service: Do not attempt to restart transient services. Commit 8ae2780 fixes a bug whereby =E2=80=98herd restart=E2=80=99 could end= up attempting to restart a transient service, which would lock up the calling fiber because the service=E2=80=99s controlling fiber would first receive the 'terminate message, so it would return and nobody would be reading further messages send on its channel. Commit b9a37f3 will allows us to ensure that the signal-handling fiber never exits (and we=E2=80=99ll get a trace in the log if it tries to). Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Sun May 28 05:25:33 2023 Received: (at control) by debbugs.gnu.org; 28 May 2023 09:25:33 +0000 Received: from localhost ([127.0.0.1]:53263 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1q3Cea-00074D-VP for submit@debbugs.gnu.org; Sun, 28 May 2023 05:25:33 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52704) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1q3CeZ-00073y-BF for control@debbugs.gnu.org; Sun, 28 May 2023 05:25:32 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q3CeT-0008KH-VN for control@debbugs.gnu.org; Sun, 28 May 2023 05:25:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:Subject:From:To:Date:in-reply-to: references; bh=czsGSY44lYJ/b/mjARXQG2G9mdPejwoEykxMeaGIS9A=; b=KYNcZvI5/5v5oZ c1vVFtxbDXrQ0k22MALYUXFOhWrqwoF8dUiQRUDh44eV9bbJTnEIF2NoJDMXyYoopEFMO+Io8Zl3Y USAjwMCoekhgKQkbZuRxmI1wLEVCYBte9TqcjhW7xT0cZoqkSUrPiV547/yAYMb32P9GK/1gImLKn qJ+S0EaCvHIexr5fEnx1+xfn4f3yP8aks0FnKhbchPo4Iml8sPmpIOOO5kkYQ8YFJxDzVvi6TsiKZ G27YRq5OteO8k8N95edclF1bAaXIO/jRPkydBXzyzbBydnmEByBG+7Sm4UaCTadqktMHmsPNTYsNz E/n8M03PIQonq8oPA5NA==; Received: from 91-160-117-201.subs.proxad.net ([91.160.117.201] helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1q3Ce9-0000ZW-Sy for control@debbugs.gnu.org; Sun, 28 May 2023 05:25:21 -0400 Date: Sun, 28 May 2023 11:25:03 +0200 Message-Id: <87o7m4hkpc.fsf@gnu.org> To: control@debbugs.gnu.org From: =?utf-8?Q?Ludovic_Court=C3=A8s?= Subject: control message for bug #63736 MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) severity 63736 important quit From debbugs-submit-bounces@debbugs.gnu.org Thu Nov 23 15:47:03 2023 Received: (at 63736-done) by debbugs.gnu.org; 23 Nov 2023 20:47:03 +0000 Received: from localhost ([127.0.0.1]:35196 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1r6GbH-00083d-8C for submit@debbugs.gnu.org; Thu, 23 Nov 2023 15:47:03 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:33642) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1r6GbF-000839-5H for 63736-done@debbugs.gnu.org; Thu, 23 Nov 2023 15:47:01 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1r6Gb5-0003ik-QT for 63736-done@debbugs.gnu.org; Thu, 23 Nov 2023 15:46:51 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=IHXNV+QfA9Rgma9rz8PzYsGrFiCl5R3kHZEOaNSbQ+E=; b=A43XoeWjZUre5AN7tjvj nNpZtjbyHDXs65buh9OW4ZUsCHmU8T3PAsfTvytbBZ72JGIYfMAxxI7b8Uk0F63vt333v0v4p8ech R6lSPSAUm1J1TkagWx8q+slj/fynEM48c8QtHq1BCOUMDgGNkDDcxUdvcFjdfurm0z3Cn+XWM8m3L 5ilYbNRipcaCseZaibszMqmjoOibL5oss8YQvvfHSF0bF9poORVjLnuQpug/1VST/lC/Wdxu5PrlJ KnwypQ8WNrTnDnzpcj+dE5I1lwLMflz/RslvvBwZkrjUAOB/HqOk0c36AVbOq/sPbzgEiUjWvITgL n1QcRPgjQqHWFQ==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 63736-done@debbugs.gnu.org Subject: Re: bug#63736: [Shepherd] Loss of SIGCHLD notifications In-Reply-To: <87wn0thfp4.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Sat, 27 May 2023 19:00:55 +0200") References: <87353jb88e.fsf@inria.fr> <87wn0thfp4.fsf@gnu.org> Date: Thu, 23 Nov 2023 21:46:49 +0100 Message-ID: <87o7fk8afa.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63736-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Ludovic Court=C3=A8s skribis: > Another possibility is lockup: one of the relevant fibers is either gone > or stuck in =E2=80=98put-message=E2=80=99 or =E2=80=98get-message=E2=80= =99. > > I did two things: > > b9a37f3 shepherd: Make signal handling fiber an essential task. > 8ae2780 service: Do not attempt to restart transient services. > > Commit 8ae2780 fixes a bug whereby =E2=80=98herd restart=E2=80=99 could e= nd up > attempting to restart a transient service, which would lock up the > calling fiber because the service=E2=80=99s controlling fiber would first > receive the 'terminate message, so it would return and nobody would be > reading further messages send on its channel. > > Commit b9a37f3 will allows us to ensure that the signal-handling fiber > never exits (and we=E2=80=99ll get a trace in the log if it tries to). Apparently these commits, which made it in 0.10.1 months ago, fixed this particular bug. Closing! Ludo=E2=80=99. From unknown Mon Aug 18 17:59:11 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Fri, 22 Dec 2023 12:24:08 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator