GNU bug report logs - #57922
Shepherd doesn't seem to correctly handle waitpid itself

Previous Next

Package: guix;

Reported by: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Date: Mon, 19 Sep 2022 04:30:03 UTC

Severity: normal

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Subject: bug#57922: closed (Re: bug#57922: Shepherd doesn't seem to
 correctly handle waitpid itself)
Date: Fri, 23 Sep 2022 17:50:02 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#57922: Shepherd doesn't seem to correctly handle waitpid itself

which was filed against the guix package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 57922 <at> debbugs.gnu.org.

-- 
57922: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=57922
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Josselin Poiret <dev <at> jpoiret.xyz>, 57922-done <at> debbugs.gnu.org
Subject: Re: bug#57922: Shepherd doesn't seem to correctly handle waitpid
 itself
Date: Fri, 23 Sep 2022 13:49:26 -0400
tags 57922 +notabug
thanks

Hi Ludo!

Ludovic Courtès <ludo <at> gnu.org> writes:

[...]

>> What I don't understand that well is that this signal handler could be
>> installed only once when shepherd starts, right?  That way, it wouldn't
>> need to depend on specific start actions being chosen.
>
> The SIGCHLD handler is installed lazily since
> f776de04e6702e18d95152072e78c43441d3ccc3.  The rationale was discussed
> here:
>
>   https://issues.guix.gnu.org/27553
>
> That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely
> on signalfd(2).  It’s from the main even loop in shepherd.scm that the
> signal handler is called.

I had missed that, thanks for explaining.

>>> Here's a small reproducer to apply on our code base:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> modified   gnu/services/telephony.scm
>>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>>
>>>                      ;; Finally, return the PID of the daemon process.
>>>                      daemon-pid))
>>> -               (stop
>>> -                #~(lambda (pid . args)
>>> -                    (kill pid SIGKILL)
>>> -                    ;; Wait for the process to exit; this prevents overlapping
>>> -                    ;; processes when issuing 'herd restart'.
>>> -                    (waitpid pid)
>>> -                    #f))))))))
>>> +               (stop #~(make-kill-destructor))))))))
>
> I think the main difference between these two is that the first one uses
> SIGKILL while the second one uses SIGTERM.
>
> You could try #~(make-kill-destructor SIGKILL) to get the same effect.

You are right, the important difference was SIGTERM vs SIGKILL.  I
thought I had tried that.  The problem only shows itself in the
'jami-provisioning' system test, not the 'jami' one.

Marking this one as notabug and closing.

Thanks again!

Maxim

[Message part 3 (message/rfc822, inline)]
From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: bug-guix <bug-guix <at> gnu.org>
Subject: Shepherd doesn't seem to correctly handle waitpid itself
Date: Mon, 19 Sep 2022 00:29:44 -0400
Hi,

I've tried to determine why a workaround in the jami-service-type is
required in the 'stop' slot to avoid failures in 'herd restart jami',
and haven't quite found the culprit, but it appears to me that:

1. waipid is only called in one place in Shepherd, which is in the
handle-SIGCHLD procedure in (shepherd service), which does not
specifically wait for an exact PID but rather does:

(waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
in the case a system-error exception is thrown with an ECHILD or EINTR
error number.

This doesn't strike me as a strong guarantee that waitpid occurs when
stop is called, because:

1. It requires to be installed in the signal handlers for each
processes, with something like:

--8<---------------cut here---------------start------------->8---
  (unless %sigchld-handler-installed?
    (sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
    (set! %sigchld-handler-installed? #t))
--8<---------------cut here---------------end--------------->8---

Done for fork+exec-command and make-inetd-forkexec-constructor, but not
for make-forkexec-constructor/container, AFAICT;

2. it has the WNOHANG flag, which means the stop simply does a kill the
the signal handling weakly (because of WNOHANG) waits on it, which means
the start may begin before the process was actually completely
terminated.

Here's a small reproducer to apply on our code base:

--8<---------------cut here---------------start------------->8---
modified   gnu/services/telephony.scm
@@ -685,13 +685,7 @@ (define (archive-name->username archive)
 
                     ;; Finally, return the PID of the daemon process.
                     daemon-pid))
-               (stop
-                #~(lambda (pid . args)
-                    (kill pid SIGKILL)
-                    ;; Wait for the process to exit; this prevents overlapping
-                    ;; processes when issuing 'herd restart'.
-                    (waitpid pid)
-                    #f))))))))
+               (stop #~(make-kill-destructor))))))))
 
 (define jami-service-type
   (service-type
--8<---------------cut here---------------end--------------->8---

Then run 'make check-system TESTS=jami-provisioning' to see new
failures, or if you want to investigate manually the system:

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix system vm --no-grafts --no-offload --no-graphic \
   -e '(@@ (gnu tests telephony) %jami-os-provisioning)'

$ /gnu/store/rxi7c14hga62qslb0sr6nac9qnkxr0nn-run-vm.sh -m 1G -smp 4 \
  -nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22

# Connect to the QEMU VM:
$ ssh root <at> localhost -p10022

root <at> jami ~# herd restart jami
Service jami has been stopped.
herd: exception caught while executing 'start' on service 'jami':
dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")
root <at> jami ~# herd status jami
Status of jami:
  It is stopped.
  It is enabled.
  Provides (jami).
  Requires (jami-dbus-session).
  Conflicts with ().
  Will be respawned.
root <at> jami ~# pgrep jami
--8<---------------cut here---------------end--------------->8---

Thanks,

Maxim



This bug report was last modified 2 years and 237 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.