GNU bug report logs - #56674
[Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Wed, 20 Jul 2022 21:40:01 UTC

Severity: important

Merged with 58926

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #10 received at 56674 <at> debbugs.gnu.org (full text, mbox):

From: Maxime Devos <maximedevos <at> telenet.be>
To: Ludovic Courtès <ludo <at> gnu.org>, 56674 <at> debbugs.gnu.org
Subject: Re: bug#56674: [Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
Date: Thu, 21 Jul 2022 01:48:02 +0200
[Message part 1 (text/plain, inline)]
On 20-07-2022 23:39, Ludovic Courtès wrote:
> Hi!
>
> We’ve just had a bad experience with the nginx service on berlin, where
> ‘herd restart nginx’ would cause shepherd to get stuck forever in
> ‘waitpid’ on the process that was supposed to start nginx.
>
> The details are unclear, but one thing is clear is that using ‘waitpid’
> (either directly or indirectly with ‘system*’, which is what
> ‘nginx-service-type’ does) is not great:
>
>    1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
>       is in ‘waitpid’ waiting for child process completion (“stuck” as
>       in: doesn’t do anything, not even answering ‘herd’ requests or
>       inetd connections.)
>
>    2. I don’t think that can happen with ‘system*’ (because it’s in C),
>       but generally speaking, there’s a possibility that shepherd’s event
>       loop will handle child process termination before some other
>       user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
>    1. Change the nginx service ‘stop’ method to just
>       (make-kill-destructor), which should work just as well as invoking
>       “nginx -s stop”.
>
>    2. Have Shepherd provide a replacement for ‘system*’.
Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?
>
> Thoughts?

3. Make waitpid (or a variant that does what we need) interact well with 
guile-fibers, like how 'accept' is doesn't inhibit switching to another 
fiber. There some Linux API with signal handlers or pid fds or such that 
might be useful here, though I don't recall the name. Presumably 
something similar can be done for the Hurd, though some C glue may be 
needed to access the right Hurd APIs if the signal handler API isn't 
portable.

Alternatively:

4. Do the waitpid in a separate thread (needs work-around for the 
multi-threaded fork problem, probably C things? Or modifying Guile and 
maybe glibc to avoid async-unsafe things or make more things async-safe 
or whatever the appropriate ...-safe is here.)

If not a Guile Fibers interaction problem, then the asynchronous signal 
handler API might still be useful.

Greetings,
Maxime

[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]

This bug report was last modified 2 years and 182 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.