GNU bug report logs -
#57922
Shepherd doesn't seem to correctly handle waitpid itself
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your bug report
#57922: Shepherd doesn't seem to correctly handle waitpid itself
which was filed against the guix package, has been closed.
The explanation is attached below, along with your original report.
If you require more details, please reply to 57922 <at> debbugs.gnu.org.
--
57922: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=57922
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
tags 57922 +notabug
thanks
Hi Ludo!
Ludovic Courtès <ludo <at> gnu.org> writes:
[...]
>> What I don't understand that well is that this signal handler could be
>> installed only once when shepherd starts, right? That way, it wouldn't
>> need to depend on specific start actions being chosen.
>
> The SIGCHLD handler is installed lazily since
> f776de04e6702e18d95152072e78c43441d3ccc3. The rationale was discussed
> here:
>
> https://issues.guix.gnu.org/27553
>
> That said, on GNU/Linux, SIGCHLD is actually blocked and instead we rely
> on signalfd(2). It’s from the main even loop in shepherd.scm that the
> signal handler is called.
I had missed that, thanks for explaining.
>>> Here's a small reproducer to apply on our code base:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> modified gnu/services/telephony.scm
>>> @@ -685,13 +685,7 @@ (define (archive-name->username archive)
>>>
>>> ;; Finally, return the PID of the daemon process.
>>> daemon-pid))
>>> - (stop
>>> - #~(lambda (pid . args)
>>> - (kill pid SIGKILL)
>>> - ;; Wait for the process to exit; this prevents overlapping
>>> - ;; processes when issuing 'herd restart'.
>>> - (waitpid pid)
>>> - #f))))))))
>>> + (stop #~(make-kill-destructor))))))))
>
> I think the main difference between these two is that the first one uses
> SIGKILL while the second one uses SIGTERM.
>
> You could try #~(make-kill-destructor SIGKILL) to get the same effect.
You are right, the important difference was SIGTERM vs SIGKILL. I
thought I had tried that. The problem only shows itself in the
'jami-provisioning' system test, not the 'jami' one.
Marking this one as notabug and closing.
Thanks again!
Maxim
[Message part 3 (message/rfc822, inline)]
Hi,
I've tried to determine why a workaround in the jami-service-type is
required in the 'stop' slot to avoid failures in 'herd restart jami',
and haven't quite found the culprit, but it appears to me that:
1. waipid is only called in one place in Shepherd, which is in the
handle-SIGCHLD procedure in (shepherd service), which does not
specifically wait for an exact PID but rather does:
(waitpid* WAIT_ANY WNOHANG), which is waitpid with some special handling
in the case a system-error exception is thrown with an ECHILD or EINTR
error number.
This doesn't strike me as a strong guarantee that waitpid occurs when
stop is called, because:
1. It requires to be installed in the signal handlers for each
processes, with something like:
--8<---------------cut here---------------start------------->8---
(unless %sigchld-handler-installed?
(sigaction SIGCHLD handle-SIGCHLD SA_NOCLDSTOP)
(set! %sigchld-handler-installed? #t))
--8<---------------cut here---------------end--------------->8---
Done for fork+exec-command and make-inetd-forkexec-constructor, but not
for make-forkexec-constructor/container, AFAICT;
2. it has the WNOHANG flag, which means the stop simply does a kill the
the signal handling weakly (because of WNOHANG) waits on it, which means
the start may begin before the process was actually completely
terminated.
Here's a small reproducer to apply on our code base:
--8<---------------cut here---------------start------------->8---
modified gnu/services/telephony.scm
@@ -685,13 +685,7 @@ (define (archive-name->username archive)
;; Finally, return the PID of the daemon process.
daemon-pid))
- (stop
- #~(lambda (pid . args)
- (kill pid SIGKILL)
- ;; Wait for the process to exit; this prevents overlapping
- ;; processes when issuing 'herd restart'.
- (waitpid pid)
- #f))))))))
+ (stop #~(make-kill-destructor))))))))
(define jami-service-type
(service-type
--8<---------------cut here---------------end--------------->8---
Then run 'make check-system TESTS=jami-provisioning' to see new
failures, or if you want to investigate manually the system:
--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix system vm --no-grafts --no-offload --no-graphic \
-e '(@@ (gnu tests telephony) %jami-os-provisioning)'
$ /gnu/store/rxi7c14hga62qslb0sr6nac9qnkxr0nn-run-vm.sh -m 1G -smp 4 \
-nic user,model=virtio-net-pci,hostfwd=tcp::10022-:22
# Connect to the QEMU VM:
$ ssh root <at> localhost -p10022
root <at> jami ~# herd restart jami
Service jami has been stopped.
herd: exception caught while executing 'start' on service 'jami':
dbus "method failed with error" "org.freedesktop.DBus.Error.NoReply" ("Message recipient disconnected from message bus without replying")
root <at> jami ~# herd status jami
Status of jami:
It is stopped.
It is enabled.
Provides (jami).
Requires (jami-dbus-session).
Conflicts with ().
Will be respawned.
root <at> jami ~# pgrep jami
--8<---------------cut here---------------end--------------->8---
Thanks,
Maxim
This bug report was last modified 2 years and 237 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.