GNU bug report logs - #77407
shepherd hangs at "herd restart nginx"

Previous Next

Package: guix;

Reported by: Andreas Enge <andreas <at> enge.fr>

Date: Mon, 31 Mar 2025 16:04:03 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #16 received at 77407 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Andreas Enge <andreas <at> enge.fr>
Cc: 77407 <at> debbugs.gnu.org, mail <at> cbaines.net
Subject: Re: bug#77407: shepherd hangs at "herd restart nginx"
Date: Wed, 09 Apr 2025 18:57:08 +0200
Hi!

I just realized: this may well be fixed by Shepherd commit
a5d9a41aa0bfadea9617df4411b86c6ea0c525bd (see
<https://issues.guix.gnu.org/77373>.)  Would be great to give it a try
somehow.

Anyway, some more data.

Yesterday I ran ‘herd stop nginx’ on bayfront (which was running 1.0.2),
again causing ‘herd status’ to hang afterwards.  This time I captured an
strace log of PID 1.

The /var/log/messages showing what happens after ‘herd stop nginx’

--8<---------------cut here---------------start------------->8---
2025-04-08 11:28:27 localhost shepherd[1]: Stopping service certbot-certificate-renewal...
2025-04-08 11:28:27 localhost shepherd[1]: Terminating timer 'certbot-certificate-renewal' with 0 processes running.
2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal stopped.
2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal is now stopped.
2025-04-08 11:28:27 localhost shepherd[1]: Stopping service nginx...
2025-04-08 11:28:27 localhost shepherd[1]: Registering new logger for nginx.
2025-04-08 11:28:28 localhost shepherd[1]: [nginx] nginx: [alert] could not open error log file: open() "/gnu/store/w2pi23rrk1dy95rzpb9l87gcj3c4jb38-nginx-1.27.3/logs/error.log" failed (2: No such file or directory)
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type "text/html" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type "text/html" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1107
2025-04-08 11:28:29 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type "text/html" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:4100
2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: conflicting server name "bayfront.guix.gnu.org" on 0.0.0.0:80, ignored
2025-04-08 11:28:32 localhost shepherd[1]: Service nginx might have failed to stop.
2025-04-08 11:28:32 localhost shepherd[1]: [nginx] 2025/04/08 11:28:28 [warn] 12958#0: conflicting server name "logs.guix.gnu.org" on 0.0.0.0:80, ignored
2025-04-08 11:28:32 localhost shepherd[1]: Service nginx is now stopped.
2025-04-08 11:28:32 localhost shepherd[1]: Accepted connection on 0.0.0.0:22 from 101.35.215.239:37996.
2025-04-08 11:29:40 localhost shepherd[1]: 2 connections still in use after sshd-25913 termination.
2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 (PID 12257) exited with 3.
2025-04-08 11:29:40 localhost shepherd[1]: Service sshd-25913 has been disabled.
2025-04-08 11:32:37 localhost shepherd[1]: Accepted connection on [::]:22 from [2a0c:e300::58]:40428.
--8<---------------cut here---------------end--------------->8---

(The “might have failed to stop” bit is harmless and fixed in
3f4438741615aa1a1198f9415be8bd110138d83e.)

On berlin, which does not exhibit the bug, we get something similar (for
‘herd restart nginx’ here):

--8<---------------cut here---------------start------------->8---
2025-04-07 16:11:25 localhost shepherd[1]: Stopping service certbot-renewal...
2025-04-07 16:11:25 localhost shepherd[1]: Terminating timer 'certbot-renewal' with 0 processes running.
2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal stopped.
2025-04-07 16:11:25 localhost shepherd[1]: Service certbot-renewal is now stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Stopping service nginx...
2025-04-07 16:11:26 localhost shepherd[1]: Registering new logger for nginx.
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] nginx: [alert] could not open error log file: open() "/gnu/store/40mb76jc9qbm0d36qyncnpgrgsfab0db-nginx-1.27.3/logs/error.log" failed (2: No such file or directory)
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [warn] 27359#0: duplicate MIME type "text/html" in /gnu/store/s2yrd68wr9cd2iirc900491qgigyfg0z-nginx.conf:3593
2025-04-07 16:11:26 localhost shepherd[1]: [nginx] 2025/04/07 16:11:26 [notice] 27359#0: signal process started
2025-04-07 16:11:26 localhost shepherd[1]: Service nginx stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Service nginx is now stopped.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/bootstrappable.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/issues-guix-gnu-org.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/qualif.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/disarchive.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/bootstrappable.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/dump-guix-gnu-org.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/workflows-guix-info.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/workflows-guix-info.https.access.log.
2025-04-07 16:11:26 localhost shepherd[1]: Starting service user-homes...
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes has been started.
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes started.
2025-04-07 16:11:26 localhost shepherd[1]: Service user-homes running with value #t.
2025-04-07 16:11:26 localhost shepherd[1]: Respawning anonip-/var/log/anonip/https.access.log.
--8<---------------cut here---------------end--------------->8---

Unfortunately the strace log isn’t very helpful:

--8<---------------cut here---------------start------------->8---
11:28:27 read(54, "(shepherd-command (version 0) (action stop) (service nginx) (arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) = 122 <0.000162>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=326773810}) = 0 <0.000043>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327037318}) = 0 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327207449}) = 0 <0.000040>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000037>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327433802}) = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327698776}) = 0 <0.000038>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000036>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=327923662}) = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328114326}) = 0 <0.000038>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=328338724}) = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=355846206}) = 0 <0.000069>
11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=38, u64=38}}], 16, 0) = 1 <0.000045>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=356163002}) = 0 <0.000041>
11:28:27 read(38, "G", 1)               = 1 <0.000049>
11:28:27 read(38, "E", 1)               = 1 <0.000038>
11:28:27 read(38, "T", 1)               = 1 <0.000038>
[…]
11:28:27 write(7, "Stopping service certbot-certificate-renewal...\n", 48) = 48 <0.000052>
[…]
11:28:27 write(25, "2025-04-08 11:28:27 localhost shepherd[1]: Service certbot-certificate-renewal stopped.\n", 88) = 88 <0.000049>
[…]
11:28:27 write(24, "shepherd[1]: Stopping service nginx...\n", 39) = 39 <0.000050>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560464210}) = 0 <0.000045>
11:28:27 epoll_wait(11, [{events=EPOLLIN, data={u32=17, u64=17}}], 16, 0) = 1 <0.000040>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=560715007}) = 0 <0.000042>
11:28:27 getuid()                       = 0 <0.000052>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=687926724}) = 0 <0.000040>
11:28:27 epoll_wait(11, [], 16, 0)      = 0 <0.000038>
11:28:27 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104507, tv_nsec=688178499}) = 0 <0.000037>
11:28:27 rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0 <0.000055>
11:28:27 pipe2([56, 59], O_CLOEXEC)     = 0 <0.000060>
11:28:27 fcntl(56, F_GETFL)             = 0 (flags O_RDONLY) <0.000038>
11:28:27 lseek(56, 0, SEEK_CUR)         = -1 ESPIPE (Illegal seek) <0.000037>
11:28:27 fcntl(59, F_GETFL)             = 0x1 (flags O_WRONLY) <0.000036>
11:28:27 lseek(59, 0, SEEK_CUR)         = -1 ESPIPE (Illegal seek) <0.000036>
11:28:27 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035>
11:28:27 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f01e6e99850) = 12958 <0.008960>
11:28:27 fcntl(56, F_GETFL)             = 0 (flags O_RDONLY) <0.000040>
11:28:27 fcntl(56, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 <0.000039>
11:28:27 close(59)                      = 0 <0.000039>
11:28:27 rt_sigprocmask(SIG_SETMASK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0 <0.000038>
11:28:27 recvfrom(17, "shepherd[1]: Stopping service nginx...\n", 1024, 0, 0x7ffc10c01010, [112 => 0]) = 39 <0.000060>
[…]
11:28:27 read(56, 0x7f01d9a634d0, 1)    = -1 EAGAIN (Resource temporarily unavailable) <0.000040>
11:28:27 epoll_ctl(11, EPOLL_CTL_MOD, 56, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = -1 ENOENT (No such file or directory) <0.000041>
11:28:27 epoll_ctl(11, EPOLL_CTL_ADD, 56, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=56, u64=56}}) = 0 <0.000043>
11:28:27 write(7, "Registering new logger for nginx.\n", 34) = 34 <0.000048>
[…]
11:28:28 read(56, "n", 1)               = 1 <0.000049>
11:28:28 read(56, "g", 1)               = 1 <0.000046>
11:28:28 read(56, "i", 1)               = 1 <0.000042>
11:28:28 read(56, "n", 1)               = 1 <0.000040>
11:28:28 read(56, "x", 1)               = 1 <0.000039>
11:28:28 read(56, ":", 1)               = 1 <0.000039>
11:28:28 read(56, " ", 1)               = 1 <0.000062>
11:28:28 read(56, "[", 1)               = 1 <0.000039>
11:28:28 read(56, "a", 1)               = 1 <0.000042>
11:28:28 read(56, "l", 1)               = 1 <0.000042>
11:28:28 read(56, "e", 1)               = 1 <0.000041>
11:28:28 read(56, "r", 1)               = 1 <0.000044>
11:28:28 read(56, "t", 1)               = 1 <0.000064>
11:28:28 read(56, "]", 1)               = 1 <0.000043>
[…]
11:28:29 write(7, "[", 1)               = 1 <0.000048>
11:28:29 write(7, "nginx", 5)           = 5 <0.000035>
11:28:29 write(7, "] ", 2)              = 2 <0.000034>
11:28:29 write(7, "2025/04/08 11:28:28 [warn] 12958#0: duplicate MIME type \"text/html\" in /gnu/store/plpbgkr7wxl0qm4v0zf24vkvznvlc28y-nginx.conf:1043", 130) = 130 <0.000036>
11:28:29 write(7, "\n", 1)              = 1 <0.000036>
[…]
11:28:32 accept4(57, {sa_family=AF_INET, sin_port=htons(37996), sin_addr=inet_addr("101.35.215.239")}, [112 => 16], SOCK_CLOEXEC) = 54 <0.000058>
11:28:32 write(7, "Accepted connection on 0.0.0.0:22 from 101.35.215.239:37996.\n", 61) = 61 <0.000050>
11:28:32 fcntl(54, F_GETFL)             = 0x2 (flags O_RDWR) <0.000037>
11:28:32 fcntl(54, F_SETFL, O_RDWR)     = 0 <0.000037>
11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512507676}) = 0 <0.000037>
11:28:32 rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0 <0.000035>
11:28:32 epoll_wait(11, [{events=EPOLLIN, data={u32=6, u64=6}}], 16, 273) = 1 <0.000039>
11:28:32 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104512, tv_nsec=512933004}) = 0 <0.000037>
[…]
11:28:52 accept4(16, 0x7ffc10c01010, [112], SOCK_CLOEXEC|SOCK_NONBLOCK) = -1 EAGAIN (Resource temporarily unavailable) <0.000057>
11:28:52 epoll_ctl(11, EPOLL_CTL_MOD, 16, {events=EPOLLIN|EPOLLRDHUP|EPOLLONESHOT, data={u32=16, u64=16}}) = 0 <0.000042>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448197354}) = 0 <0.000037>
11:28:52 epoll_wait(11, [], 16, 0)      = 0 <0.000038>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=448434462}) = 0 <0.000036>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000046>
11:28:52 write(41, "2025-04-08 11:28:52      39:4 17 (call-with-sigint _ _)\n", 56) = 56 <0.000105>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(41, "2025-04-08 11:28:52 In fibers.scm:\n", 35) = 35 <0.000058>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo:     ludo : TTY=pts/9 ; PWD=/home/ludo/src/maintenance/hydra ; USER=root ; COMMAND=/run/current-system/profile/bin/herd status nginx\n", 168) = 168 <0.000062>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450211551}) = 0 <0.000039>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450387060}) = 0 <0.000039>
11:28:52 newfstatat(AT_FDCWD, "/etc/localtime", {st_mode=S_IFREG|0444, st_size=2962, ...}, 0) = 0 <0.000045>
11:28:52 write(19, "2025-04-08 11:28:52 localhost sudo: pam_unix(sudo:session): session opened for user root(uid=0) by (uid=30012)\n", 111) = 111 <0.000051>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=450934120}) = 0 <0.000038>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=451064163}) = 0 <0.000059>
11:28:52 read(59, "(shepherd-command (version 0) (action status) (service nginx) (arguments ()) (directory \"/home/ludo/src/maintenance/hydra\"))", 1024) = 124 <0.000084>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452217926}) = 0 <0.000040>
11:28:52 epoll_wait(11, [], 16, 0)      = 0 <0.000044>
11:28:52 clock_gettime(CLOCK_REALTIME, {tv_sec=1744104532, tv_nsec=452460901}) = 0 <0.000039>
--8<---------------cut here---------------end--------------->8---

At that point it was still accepting connections on
/var/run/shepherd/socket and reading them, reading process output and
logging it, but not much else.

For example, the “Accepted connection on …:22” messages do not have a
corresponding ‘clone’ call and not even “Starting service sshd-1234...”
as we’d normally see right after.

Ludo’.




This bug report was last modified 37 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.