Package: guix;
Reported by: Andreas Enge <andreas <at> enge.fr>
Date: Mon, 31 Mar 2025 16:04:03 UTC
Severity: important
Done: Ludovic Courtès <ludo <at> gnu.org>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Ludovic Courtès <ludo <at> gnu.org> To: Andreas Enge <andreas <at> enge.fr> Cc: 77407 <at> debbugs.gnu.org, mail <at> cbaines.net Subject: bug#77407: shepherd hangs at "herd restart nginx" Date: Wed, 09 Apr 2025 18:57:08 +0200
Hi! I just realized: this may well be fixed by Shepherd commit a5d9a41aa0bfadea9617df4411b86c6ea0c525bd (see <https://issues.guix.gnu.org/77373>.) Would be great to give it a try somehow. Anyway, some more data. Yesterday I ran ‘herd stop nginx’ on bayfront (which was running 1.0.2), again causing ‘herd status’ to hang afterwards. This time I captured an strace log of PID 1. The /var/log/messages showing what happens after ‘herd stop nginx’ --8<---------------cut here---------------start------------->8--- 2025-04-08 11:28:27 localhost shepherd[1]: Stopping service certbot-certificate-renewal... 2025-04-08 11:28:27 localhost shepherd[1]: Terminating timer 'certbot-certificate-renewal' with 0 processes running. 2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal stopped. 2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal is now stopped. 2025-04-08 11:28:27 localhost shepherd[1]: Stopping service nginx... 2025-04-08 11:28:27 localhost shepherd[1]: Registering new logger for nginx. 2025-04-08 11:28:28 localhost shepherd[1]: [nginx] nginx: [alert] could not open error log file: open() "/gnu/store/w2pi23rrk1dy95rzpb9l87gcj3c4jb38-nginx-1.27.3/logs/error.log" failed (2: No such file or directory) 2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type "text/html" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043 2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type "text/html" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1107 2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type "text/html" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:4100 2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: conflicting server name "bayfront.guix.gnu.org" on 0.0.0.0:80, ignored 2025-04-08 11:28:32 localhost shepherd[1]: Service nginx might have failed to stop. 2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: conflicting server name "logs.guix.gnu.org" on 0.0.0.0:80, ignored 2025-04-08 11:28:32 localhost shepherd[1]: Service nginx is now stopped. 2025-04-08 11:28:32 localhost shepherd[1]: Accepted connection on 0.0.0.0:22 from 101.35.215.239:37996. 2025-04-08 11:29:40 localhost shepherd[1]: 2 connections still in use after sshd-25913 termination. 2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 (PID 12257) exited with 3. 2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 has been disabled. 2025-04-08 11:32:37 localhost shepherd[1]: Accepted connection on [::]:22 from [2a0c:e300::58]:40428. --8<---------------cut here---------------end--------------->8--- (The “might have failed to stop” bit is harmless and fixed in 3f4438741615aa1a1198f9415be8bd110138d83e.) On berlin, which does not exhibit the bug, we get something similar (for ‘herd restart nginx’ here): --8<---------------cut here---------------start------------->8--- 2025-04-07 16:11:25 localhost shepherd[1]: Stopping service certbot-renewal... 2025-04-07 16:11:25 localhost shepherd[1]: Terminating timer 'certbot-renewal' with 0 processes running. 2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal stopped. 2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal is now stopped. 2025-04-07 16:11:26 localhost shepherd[1]: Stopping service nginx... 2025-04-07 16:11:26 localhost shepherd[1]: Registering new logger for nginx. 2025-04-07 16:11:26 localhost shepherd[1]: [nginx] nginx: [alert] could not open error log file: open() "/gnu/store/40mb76jc9qbm0d36qyncnpgrgsfab0db-nginx-1.27.3/logs/error.log" failed (2: No such file or directory) 2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [warn] 27359#0: duplicate MIME type "text/html" in /gnu/store/s2yrd68wr9cd2iirc900491qgigyfg0z-nginx.conf:3593 2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [notice] 27359#0: signal process started 2025-04-07 16:11:26 localhost shepherd[1]: Service nginx stopped. 2025-04-07 16:11:26 localhost shepherd[1]: Service nginx is now stopped. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/bootstrappable.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/issues-guix-gnu-org.https.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/qualif.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/disarchive.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/bootstrappable.https.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/dump-guix-gnu-org.https.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/workflows-guix-info.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/workflows-guix-info.https.access.log. 2025-04-07 16:11:26 localhost shepherd[1]: Starting service user-homes... 2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes has been started. 2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes started. 2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes running with value #t. 2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/https.access.log. --8<---------------cut here---------------end--------------->8--- Unfortunately the strace log isn’t very helpful: --8<---------------cut here---------------start------------->8--- 11:28:27 read(54, "(shepherd-command (version 0) (action stop) (service nginx) (arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) = 122 <0.000162> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=326773810}) = 0 <0.000043> 11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000040> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327037318}) = 0 <0.000040> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327207449}) = 0 <0.000040> 11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000037> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327433802}) = 0 <0.000038> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327698776}) = 0 <0.000038> 11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000036> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327923662}) = 0 <0.000038> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328114326}) = 0 <0.000038> 11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000038> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328338724}) = 0 <0.000038> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=355846206}) = 0 <0.000069> 11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=38, u64=38}}], 16, 0) = 1 <0.000045> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=356163002}) = 0 <0.000041> 11:28:27 read(38, "G", 1) = 1 <0.000049> 11:28:27 read(38, "E", 1) = 1 <0.000038> 11:28:27 read(38, "T", 1) = 1 <0.000038> […] 11:28:27 write(7, "Stopping service certbot-certificate-renewal...\n", 48) = 48 <0.000052> […] 11:28:27 write(25, "2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal stopped.\n", 88) = 88 <0.000049> […] 11:28:27 write(24, "shepherd[1]: Stopping service nginx...\n", 39) = 39 <0.000050> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560464210}) = 0 <0.000045> 11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=17, u64=17}}], 16, 0) = 1 <0.000040> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560715007}) = 0 <0.000042> 11:28:27 getuid() = 0 <0.000052> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=687926724}) = 0 <0.000040> 11:28:27 epoll_wait(11, [], 16, 0) = 0 <0.000038> 11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=688178499}) = 0 <0.000037> 11:28:27 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0 <0.000055> 11:28:27 pipe2([56, 59], O_CLOEXEC) = 0 <0.000060> 11:28:27 fcntl(56, F_GETFL) = 0 (flags O_RDONLY) <0.000038> 11:28:27 lseek(56, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) <0.000037> 11:28:27 fcntl(59, F_GETFL) = 0x1 (flags O_WRONLY) <0.000036> 11:28:27 lseek(59, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek) <0.000036> 11:28:27 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035> 11:28:27 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f01e6e99850) = 12958 <0.008960> 11:28:27 fcntl(56, F_GETFL) = 0 (flags O_RDONLY) <0.000040> 11:28:27 fcntl(56, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000039> 11:28:27 close(59) = 0 <0.000039> 11:28:27 rt_sigprocmask(SIG_SETMASK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0 <0.000038> 11:28:27 recvfrom(17, "shepherd[1]: Stopping service nginx...\n", 1024, 0, 0x7ffc10c01010, [112 => 0]) = 39 <0.000060> […] 11:28:27 read(56, 0x7f01d9a634d0, 1) = -1 EAGAIN (Resource temporarily unavailable) <0.000040> 11:28:27 epoll_ctl(11, EPOLL_CTL_MOD, 56, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = -1 ENOENT (No such file or directory) <0.000041> 11:28:27 epoll_ctl(11, EPOLL_CTL_ADD, 56, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = 0 <0.000043> 11:28:27 write(7, "Registering new logger for nginx.\n", 34) = 34 <0.000048> […] 11:28:28 read(56, "n", 1) = 1 <0.000049> 11:28:28 read(56, "g", 1) = 1 <0.000046> 11:28:28 read(56, "i", 1) = 1 <0.000042> 11:28:28 read(56, "n", 1) = 1 <0.000040> 11:28:28 read(56, "x", 1) = 1 <0.000039> 11:28:28 read(56, ":", 1) = 1 <0.000039> 11:28:28 read(56, " ", 1) = 1 <0.000062> 11:28:28 read(56, "[", 1) = 1 <0.000039> 11:28:28 read(56, "a", 1) = 1 <0.000042> 11:28:28 read(56, "l", 1) = 1 <0.000042> 11:28:28 read(56, "e", 1) = 1 <0.000041> 11:28:28 read(56, "r", 1) = 1 <0.000044> 11:28:28 read(56, "t", 1) = 1 <0.000064> 11:28:28 read(56, "]", 1) = 1 <0.000043> […] 11:28:29 write(7, "[", 1) = 1 <0.000048> 11:28:29 write(7, "nginx", 5) = 5 <0.000035> 11:28:29 write(7, "] ", 2) = 2 <0.000034> 11:28:29 write(7, "2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type \"text/html\" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043", 130) = 130 <0.000036> 11:28:29 write(7, "\n", 1) = 1 <0.000036> […] 11:28:32 accept4(57, {sa_family=AF_INET, sin_port=htons(37996), sin_addr=inet_addr("101.35.215.239")}, [112 => 16], SOCK_CLOEXEC) = 54 <0.000058> 11:28:32 write(7, "Accepted connection on 0.0.0.0:22 from 101.35.215.239:37996.\n", 61) = 61 <0.000050> 11:28:32 fcntl(54, F_GETFL) = 0x2 (flags O_RDWR) <0.000037> 11:28:32 fcntl(54, F_SETFL, O_RDWR) = 0 <0.000037> 11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512507676}) = 0 <0.000037> 11:28:32 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035> 11:28:32 epoll_wait(11, [{events=EPOLLIN, data={u32=6, u64=6}}], 16, 273) = 1 <0.000039> 11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512933004}) = 0 <0.000037> […] 11:28:52 accept4(16, 0x7ffc10c01010, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) <0.000057> 11:28:52 epoll_ctl(11, EPOLL_CTL_MOD, 16, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=16, u64=16}}) = 0 <0.000042> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448197354}) = 0 <0.000037> 11:28:52 epoll_wait(11, [], 16, 0) = 0 <0.000038> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448434462}) = 0 <0.000036> 11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000046> 11:28:52 write(41, "2025-04-08 11:28:52 39:4 17 (call-with-sigint _ _)\n", 56) = 56 <0.000105> 11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000045> 11:28:52 write(41, "2025-04-08 11:28:52 In fibers.scm:\n", 35) = 35 <0.000058> 11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000045> 11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo: ludo : TTY=pts/9 ; PWD=/home/ludo/src/maintenance/hydra ; USER=root ; COMMAND=/run/current-system/profile/bin/herd status nginx\n", 168) = 168 <0.000062> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450211551}) = 0 <0.000039> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450387060}) = 0 <0.000039> 11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000045> 11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=30012)\n", 111) = 111 <0.000051> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450934120}) = 0 <0.000038> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=451064163}) = 0 <0.000059> 11:28:52 read(59, "(shepherd-command (version 0) (action status) (service nginx) (arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) = 124 <0.000084> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452217926}) = 0 <0.000040> 11:28:52 epoll_wait(11, [], 16, 0) = 0 <0.000044> 11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452460901}) = 0 <0.000039> --8<---------------cut here---------------end--------------->8--- At that point it was still accepting connections on /var/run/shepherd/socket and reading them, reading process output and logging it, but not much else. For example, the “Accepted connection on …:22” messages do not have a corresponding ‘clone’ call and not even “Starting service sshd-1234...” as we’d normally see right after. Ludo’.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.