GNU bug report logs - #41948
Shepherd deadlocks

Previous Next

Package: guix;

Reported by: Mathieu Othacehe <othacehe <at> gnu.org>

Date: Fri, 19 Jun 2020 08:42:01 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #14 received at 41948 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Mathieu Othacehe <othacehe <at> gnu.org>
Cc: 41948 <at> debbugs.gnu.org
Subject: Re: bug#41948: Shepherd deadlocks
Date: Sat, 20 Jun 2020 12:31:40 +0200
Hi,

Mathieu Othacehe <othacehe <at> gnu.org> skribis:

> When running "gui-installed-desktop-os-encrypted" test, Shepherd seems
> to deadlock when restarting "guix-daemon". This can happen at different
> stages:
>
> * In "umount-cow-store" procedure, just before finishing the install.
>
> * During "set-http-proxy" tests inside the marionette.
>   
> This is not always reproducible. In order to gather some information, I
> created a Shepherd "strace" service that logs what's happening in
> Shepherd itself (patch attached).

We should be able to reproduce it with much simpler tests then, right?
Like maybe “while : ; do herd restart guix-daemon ; done” or similar?

> It seems that, just after blocking signals, in "fork+exec-command", I
> guess, Shepherd is taking a lock:
>
> 183553:1     chdir("/")                        = 0
> 183554:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=4387222}) = 0
> 183555:1     rt_sigprocmask(SIG_BLOCK, NULL, [HUP INT TERM CHLD], 8) = 0
> 183560:1     madvise(0x7f179782d000, 12288, MADV_DONTNEED) = 0
> 183561:1     clock_gettime(CLOCK_PROCESS_CPUTIME_ID, {tv_sec=1, tv_nsec=13592357}) = 0
> 183562:1     clone(child_stack=0x7f17981e6fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[271], tls=0x7f17981e7700, child_tidptr=0x7f17981e79d0) = 271
> 183563:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183564:1     write(15, "shepherd[1]: changing HTTP/HTTPS"..., 86 <unfinished ...>
> 183566:1     <... write resumed>)              = 86
> 183575:1     getpgid(169)                      = 169
> 183576:1     getpgid(0)                        = 0
> 183577:1     kill(-169, SIGTERM <unfinished ...>
> 183579:1     <... kill resumed>)               = 0
> 183582:1     stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=1920, ...}) = 0
> 183583:1     write(15, "shepherd[1]: Service guix-daemon"..., 51 <unfinished ...>
> 183585:1     <... write resumed>)              = 51
> 183594:1     getuid()                          = 0
> 183595:1     rt_sigprocmask(SIG_BLOCK, [HUP INT TERM CHLD], [HUP INT TERM CHLD], 8) = 0
> 183596:1     write(6, "\1", 1 <unfinished ...>
> 183598:1     <... write resumed>)              = 1
> 183605:1     futex(0x7f17981e79d0, FUTEX_WAIT, 271, NULL <unfinished ...>
>
> and is then blocking forever.

When that happens, we should check how many threads exist in PID 1.
There should be the finalization thread and the main thread, plus the
signal thread (because there are still ‘sigaction’ calls in the ‘main’
procedure), plus the GC marker threads.

In <https://issues.guix.gnu.org/31925#6>, Andy suggests that the signal
thread is not properly handled; indeed it takes locks and we don’t try
to shut it down upon fork.  However, when using signalfd, the signal
thread must be stuck in its ‘read’ call in ‘read_signal_pipe_data’, so I
don’t see how it could cause problems.

The GC threads are presumably taken care of by the atfork handler in
libgc.

Thoughts?

Ludo’.




This bug report was last modified 4 years and 75 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.