On Mon, Sep 1, 2025, 3:25 PM Eli Zaretskii <eliz@gnu.org> wrote:

> > From: Spencer Baugh <sbaugh@janestreet.com>
> > Cc: 79334@debbugs.gnu.org,  eggert@cs.ucla.edu,  dmitry@gutov.dev
> > Date: Mon, 01 Sep 2025 12:18:37 -0400
> >
> > > You assume that the waiting_thread field will be cleared when a thread
> > > dies, but that is not guaranteed.  Why isn't it better to be defensive
> > > and make sure the thread is still alive?  If you are right, that test
> > > will always be true, but what if you are wrong in some rare cases?
> >
> > Fair enough; the patch has changed a lot so this
> > other_thread_is_waiting_for function doesn't exist anymore, but if I
> > re-add it or similar checks, I will check that the thread is alive.
>
> Thanks.
>
> > > What I had in mind was neither delete it nor deactivate it.  IOW, the
> > > loop which goes through the list of processes should only pay
> > > attention to processes locked to this thread and those that are not
> > > locked to any live thread.  It should leave the other processes to
> > > when their threads call status_notify.
> >
> > There are too many ways in which we delete or deactivate processes to do
> > that.
>
> We should audit all of them and prevent that from happening when the
> process is locked to another thread.
>
> > In particular, Lisp programs should be able to call
> > delete-process on a process without getting an error just because some
> > thread is currently calling select and happens to be monitoring that
> > process.  To emit such an error would make programming with processes
> > and threads extremely difficult.
>
> This is a problem two orders of magnitude smaller than what we are
> trying to fix here.  So my suggestion is to leave it for later.
> There's no special reason why Lisp programs "should be able" to delete
> a process locked to another thread.  If we don't want to signal an
> error (though I don't see why not), we could perhaps do nothing
> silently, until we get to dealing with that later.
>

Silently doing nothing is also an unacceptably confusing result, in my
opinion.

But it's okay: I believe it's possible to solve the problem without
introducing such bad behavior.

For now, sabotaging threads that wait for some sub-process is a much
> more grave problem, and we must fix it if we want threads and
> processes to live in peace in Emacs.
>

Of course.

> > Hmm... I don't understand: why do we need to do this?  The calls to
> > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
> > > set bits in the Available mask passed top pselect already avoid
> > > setting bits for descriptors on which some other thread waits.  So
> > > pselect cannot possibly indicate any output ready on any descriptors
> > > belonging to other threads, and this loop you suggest to add:
> > >
> > >> +      /* Save the fds we're currently waiting_thread for.  */
> > >> +      fd_set waiting_fds;
> > >> +      FD_ZERO (&waiting_fds);
> > >> +      int max_waiting_fds = 0;
> > >> +      for (int fd = 0; fd <= max_desc; ++fd)
> > >> +        {
> > >> +    if (fd_callback_info[fd].waiting_thread == current_thread)
> > >> +      {
> > >> +        FD_SET (fd, &waiting_fds);
> > >> +        max_waiting_fds = fd;
> > >> +      }
> > >> +  }
> > >
> > > should be unnecessary.  Or what am I missing?
> >
> > That's exactly the problem.  A file descriptor for which we set
> > waiting_thread can have delete_read_fd or delete_write_fd called on it
> > by another thread, or by a signal handler, while we're waiting in
> > pselect.
>
> We need to prevent that from happening, not to go along with it.  But
> anyway, I don't see why you need another fd_set for this.  Why not
> just look at the descriptors in the Available set?
>

The Available set is cleared by the select call, and afterwards only
contains descriptors which has events.  In that case we could just look at
the descriptors in Available and check waiting_thread for each of them.

But if there was an EBADF, then Available is left in an undefined state
(see pselect(1)) and we no longer know what descriptors we originally set
waiting_thread on.  So we can't check to see if one of our descriptors was
"stolen" (had waiting_thread set to a different value), which would mean
the EBADF is spurious and we should continue running.

Though, I guess we could alternatively just assume that EBADF is always
spurious.

> BTW, the fact that this problematic case is also possible via signal
> > handlers (namely the SIGCHLD handler) is a separate bug introduced by
> > the refactoring which added fd_callback_info.  It means that with a
> > certain interleaving, I think it should be possible to get buggy
> > behavior by just using processes without threads.  It's quite rare
> > though and I haven't been able to figure out what exact path that would
> > cause a problem.
>
> Let's not attempt to solve too many problems at the same time.  How to
> make SIGCHLD safer when threads are present is an outstanding issue,
> and the solution will IMO depend in whether signals are delivered to
> the main thread and whether or not the build uses the child signal
> pipe.
>

You misread me: I said SIGCHLD is unsafe even without threads.

>