GNU bug report logs -
#56674
[Shepherd] Use of ‘waitpid’, ‘system*’, etc. in service code can cause deadlocks
Previous Next
Reported by: Ludovic Courtès <ludo <at> gnu.org>
Date: Wed, 20 Jul 2022 21:40:01 UTC
Severity: important
Merged with 58926
Done: Ludovic Courtès <ludo <at> gnu.org>
Bug is archived. No further changes may be made.
Full log
Message #10 received at 56674 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 20-07-2022 23:39, Ludovic Courtès wrote:
> Hi!
>
> We’ve just had a bad experience with the nginx service on berlin, where
> ‘herd restart nginx’ would cause shepherd to get stuck forever in
> ‘waitpid’ on the process that was supposed to start nginx.
>
> The details are unclear, but one thing is clear is that using ‘waitpid’
> (either directly or indirectly with ‘system*’, which is what
> ‘nginx-service-type’ does) is not great:
>
> 1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
> is in ‘waitpid’ waiting for child process completion (“stuck” as
> in: doesn’t do anything, not even answering ‘herd’ requests or
> inetd connections.)
>
> 2. I don’t think that can happen with ‘system*’ (because it’s in C),
> but generally speaking, there’s a possibility that shepherd’s event
> loop will handle child process termination before some other
> user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
> 1. Change the nginx service ‘stop’ method to just
> (make-kill-destructor), which should work just as well as invoking
> “nginx -s stop”.
>
> 2. Have Shepherd provide a replacement for ‘system*’.
Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?
>
> Thoughts?
3. Make waitpid (or a variant that does what we need) interact well with
guile-fibers, like how 'accept' is doesn't inhibit switching to another
fiber. There some Linux API with signal handlers or pid fds or such that
might be useful here, though I don't recall the name. Presumably
something similar can be done for the Hurd, though some C glue may be
needed to access the right Hurd APIs if the signal handler API isn't
portable.
Alternatively:
4. Do the waitpid in a separate thread (needs work-around for the
multi-threaded fork problem, probably C things? Or modifying Guile and
maybe glibc to avoid async-unsafe things or make more things async-safe
or whatever the appropriate ...-safe is here.)
If not a Guile Fibers interaction problem, then the asynchronous signal
handler API might still be useful.
Greetings,
Maxime
[OpenPGP_0x49E3EE22191725EE.asc (application/pgp-keys, attachment)]
[OpenPGP_signature (application/pgp-signature, attachment)]
This bug report was last modified 2 years and 182 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.