On 20-07-2022 23:39, Ludovic Courtès wrote:
> Hi!
>
> We’ve just had a bad experience with the nginx service on berlin, where
> ‘herd restart nginx’ would cause shepherd to get stuck forever in
> ‘waitpid’ on the process that was supposed to start nginx.
>
> The details are unclear, but one thing is clear is that using ‘waitpid’
> (either directly or indirectly with ‘system*’, which is what
> ‘nginx-service-type’ does) is not great:
>
>    1. In the best case, shepherd (as of 0.9.1) is stuck while ‘system*’
>       is in ‘waitpid’ waiting for child process completion (“stuck” as
>       in: doesn’t do anything, not even answering ‘herd’ requests or
>       inetd connections.)
>
>    2. I don’t think that can happen with ‘system*’ (because it’s in C),
>       but generally speaking, there’s a possibility that shepherd’s event
>       loop will handle child process termination before some other
>       user-made ‘waitpid’ call does.
>
> Anyway, that’s a bad situation.
>
> So I can think of several ways to address it:
>
>    1. Change the nginx service ‘stop’ method to just
>       (make-kill-destructor), which should work just as well as invoking
>       “nginx -s stop”.
>
>    2. Have Shepherd provide a replacement for ‘system*’.
Why Shepherd and not guile fibers? Is this a Shepherd-specific problem?
>
> Thoughts?

3. Make waitpid (or a variant that does what we need) interact well with 
guile-fibers, like how 'accept' is doesn't inhibit switching to another 
fiber. There some Linux API with signal handlers or pid fds or such that 
might be useful here, though I don't recall the name. Presumably 
something similar can be done for the Hurd, though some C glue may be 
needed to access the right Hurd APIs if the signal handler API isn't 
portable.

Alternatively:

4. Do the waitpid in a separate thread (needs work-around for the 
multi-threaded fork problem, probably C things? Or modifying Guile and 
maybe glibc to avoid async-unsafe things or make more things async-safe 
or whatever the appropriate ...-safe is here.)

If not a Guile Fibers interaction problem, then the asynchronous signal 
handler API might still be useful.

Greetings,
Maxime