GNU bug report logs -
#41948
Shepherd deadlocks
Previous Next
Reported by: Mathieu Othacehe <othacehe <at> gnu.org>
Date: Fri, 19 Jun 2020 08:42:01 UTC
Severity: important
Done: Ludovic Courtès <ludo <at> gnu.org>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Hi!
Mathieu Othacehe <othacehe <at> gnu.org> skribis:
> Those two finalizer threads share the same pipe. When we try to
> stop the finalizer thread in Shepherd, right before forking a new
> process, we send a '\1' byte to the finalizer pipe.
>
> 1 write(6, "\1", 1 <unfinished ...>
>
>
> which is received by (line 183597):
>
> 253 <... read resumed>"\1", 1) = 1
>
> the marionette finalizer thread. Then, we pthread_join the Shepherd
> finalizer thread, which never stops! Quite unfortunate.
While working on a fix for this issue (finalizer pipe shared between
parent and child process), I found the ‘sleep_pipe’ of the main thread
is also shared between the parent and its child.
Attached is a reproducer. It prints something like this before hanging:
--8<---------------cut here---------------start------------->8---
$ guile ~/src/guile-debugging/signal-pipe.scm
parent: 25947
child: 25953
alarm in parent!
alarm in parent!
alarm in parent!
[...]
alarm in parent!
alarm in parent!
child woken up!
--8<---------------cut here---------------end--------------->8---
“child woken up” means that it’s the child process that won the race
reading on the sleep pipe (from ‘scm_std_select’).
The parent process then hangs because, in ‘scm_std_select’, it did:
1. select(1), which returned due to available data on ‘wakeup_fd’;
2. ‘full_read (wakeup_fd, &dummy, 1)’ gets stuck forever in read(2)
because the child process read that byte in the meantime so
there’s nothing left to read.
Here’s the sequence:
--8<---------------cut here---------------start------------->8---
25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000}) = 0 (Timeout)
25947 getpid() = 25947
25947 kill(25947, SIGALRM) = 0
25947 --- SIGALRM {si_signo=SIGALRM, si_code=SI_USER, si_pid=25947, si_uid=1000} ---
25947 write(8, "\16", 1) = 1
25947 rt_sigreturn({mask=[]}) = 0
25952 <... read resumed>"\16", 1) = 1
25947 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
25952 write(4, "\0", 1 <unfinished ...>
25947 <... rt_sigprocmask resumed>[], 8) = 0
25953 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=346370})
25952 <... write resumed>) = 1
25947 select(4, [3], NULL, NULL, {tv_sec=0, tv_usec=100000} <unfinished ...>
25953 read(3, <unfinished ...>
25952 rt_sigprocmask(SIG_BLOCK, NULL, <unfinished ...>
25947 <... select resumed>) = 1 (in [3], left {tv_sec=0, tv_usec=99999})
25953 <... read resumed>"\0", 1) = 1
25947 read(3, <unfinished ...>
25952 <... rt_sigprocmask resumed>~[KILL STOP PWR RTMIN RT_1], 8) = 0
25953 write(1, "child woken up!\n", 16 <unfinished ...>
25952 read(7, <unfinished ...>
--8<---------------cut here---------------end--------------->8---
Notice how “\16” (= SIGALRM) is written by the parent’s main thread and
read by the child’s main thread.
Thoughts?
Ludo’.
[signal-pipe.scm (text/plain, inline)]
;; https://issues.guix.gnu.org/41948
(use-modules (ice-9 match))
(setvbuf (current-output-port) 'line)
(sigaction SIGCHLD pk) ;start signal thread
(match (primitive-fork)
(0
(format #t "child: ~a~%" (getpid))
(let loop ()
(unless (zero? (usleep 500000))
;; If this happens, it means the select(2) call in 'scm_std_select'
;; returned because one of our file descriptors had input data
;; available (which shouldn't happen).
(format #t "child woken up!~%"))
(loop)))
(pid
(format #t "parent: ~a~%" (getpid))
(sigaction SIGALRM (lambda _
(format #t "alarm in parent!~%")))
(let loop ()
(kill (getpid) SIGALRM)
(usleep 100000)
(loop))))
This bug report was last modified 4 years and 77 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.