GNU bug report logs - #79334
[PATCH] Don't release thread select lock unnecessarily

Previous Next

Package: emacs;

Reported by: Spencer Baugh <sbaugh <at> janestreet.com>

Date: Thu, 28 Aug 2025 21:10:02 UTC

Severity: normal

Tags: patch

To reply to this bug, email your comments to 79334 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to dmitry <at> gutov.dev, bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Thu, 28 Aug 2025 21:10:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Spencer Baugh <sbaugh <at> janestreet.com>:
New bug report received and forwarded. Copy sent to dmitry <at> gutov.dev, bug-gnu-emacs <at> gnu.org. (Thu, 28 Aug 2025 21:10:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] Don't release thread select lock unnecessarily
Date: Thu, 28 Aug 2025 17:08:43 -0400

[Message part 1 (text/plain, inline)]

Tags: patch


Previously, we were using thread_select to release the thread
lock around two different calls to select in
wait_reading_process_output.  The main call to select was fine,
but we also used thread_select for a pselect call which returns
immediately (with a timeout of 0) and is supposed to just check
if any file descriptors are active.

If we actually thread switch at that pselect, it's likely to
break internal state for Emacs: namely, the call to
status_notify immediately after can close file descriptors which
another thread is selecting on, causing that thread to get EBADF
from select and then call emacs_abort.

We don't need to thread switch here, so don't.

* src/process.c (wait_reading_process_output): Remove
unnecessary thread_select wrapper.


In GNU Emacs 30.1.90 (build 29, x86_64-pc-linux-gnu, X toolkit, cairo
 version 1.15.12, Xaw scroll bars) of 2025-08-26 built on
 igm-qws-u22796a
Repository revision: 54857fe2fb0ed033afcba231f920f8f7fa185333
Repository branch: emacs-30
Windowing system distributor 'The X.Org Foundation', version 11.0.12011000
System Description: Rocky Linux 8.10 (Green Obsidian)

Configured using:
 'configure --with-x-toolkit=lucid --without-gpm --without-gconf
 --without-selinux --without-imagemagick --with-modules --with-gif=no
 --with-cairo --with-rsvg --without-compress-install --with-tree-sitter
 --with-native-compilation=aot
 PKG_CONFIG_PATH=/usr/local/home/garnish/libtree-sitter/0.22.6-1/lib/pkgconfig/'

[0001-Don-t-release-thread-select-lock-unnecessarily.patch (text/patch, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 12:26:02 GMT) Full text and rfc822 format available.

Message #8 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 79334 <at> debbugs.gnu.org
Subject: Re: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 15:24:59 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Date: Thu, 28 Aug 2025 17:08:43 -0400
> 
> Previously, we were using thread_select to release the thread
> lock around two different calls to select in
> wait_reading_process_output.  The main call to select was fine,
> but we also used thread_select for a pselect call which returns
> immediately (with a timeout of 0) and is supposed to just check
> if any file descriptors are active.
> 
> If we actually thread switch at that pselect, it's likely to
> break internal state for Emacs: namely, the call to
> status_notify immediately after can close file descriptors which
> another thread is selecting on, causing that thread to get EBADF
> from select and then call emacs_abort.
> 
> We don't need to thread switch here, so don't.
> 
> * src/process.c (wait_reading_process_output): Remove
> unnecessary thread_select wrapper.

Thanks.

However, I don't think I follow the logic behind this change.  I agree
that status_notify does things that can be dangerous (more about that
below), but the call to status_notify will happen whether we call
thread_select or pselect directly, so if status_notify can do some
damage, it will do that regardless of whether we call pselect directly
or not.  I think you assume that the problematic call to status_notify
will happen in another thread, if we allow switching threads at this
point, is that right?  IOW, I think you assume that, if we switch to
another thread, that other thread could call status_notify, and thus
close some descriptors on which the current thread, the one which
calls thread_select at this point, is waiting.  If that is your
assumption, then the actual situation could also be the opposite: the
current thread, the one where you want to call pselect, will call
status_notify after pselect returns, and that will close some
descriptors used by other threads.  In that case, calling pselect
directly will make things worse, not better, because it makes this
closing of descriptors imminent.

So, while I agree we have a problem here, I think the place and the
way we should solve it are different.

Which brings us to status_notify.  It indeed can close the descriptors
of processes which terminated, so we should probably make it safer.
For example, we could leave alone descriptors marked with non-NULL
waiting_thread if that thread is still alive and is different from the
current thread.  Would that fix the problem?  If yes, I think we
should install such a change.

We should perhaps also understand better what happens with the current
code if and when descriptors are closed by some other thread in this
scenario.  A thread using such a descriptor could be either stuck in
the pselect call, or it could be before or after the call to pselect.
Paul, what happens if a descriptor passed to pselect becomes closed
while pselect waits? does pselect return with EBADF, or does pselect
handle that gracefully, like by considering such descriptors to be
never "ready"?

If the thread is before or after pselect, then either calling pselect
or attempting to read from a closed descriptor after pselect returns
will cause EBADF.  Is our code prepared to deal with such a situation
(e.g., by ignoring the failed read), or will that cause some Lisp API
to signal an error or produce unexpected results?  If the latter, we
should probably handle EBADF specially, because I don't believe we
could completely prevent this from happening.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 13:08:02 GMT) Full text and rfc822 format available.

Message #11 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>,
 Dmitry Gutov <dmitry <at> gutov.dev>
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 09:07:50 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> Date: Thu, 28 Aug 2025 17:08:43 -0400
>> 
>> Previously, we were using thread_select to release the thread
>> lock around two different calls to select in
>> wait_reading_process_output.  The main call to select was fine,
>> but we also used thread_select for a pselect call which returns
>> immediately (with a timeout of 0) and is supposed to just check
>> if any file descriptors are active.
>> 
>> If we actually thread switch at that pselect, it's likely to
>> break internal state for Emacs: namely, the call to
>> status_notify immediately after can close file descriptors which
>> another thread is selecting on, causing that thread to get EBADF
>> from select and then call emacs_abort.
>> 
>> We don't need to thread switch here, so don't.
>> 
>> * src/process.c (wait_reading_process_output): Remove
>> unnecessary thread_select wrapper.
>
> Thanks.
>
> However, I don't think I follow the logic behind this change.  I agree
> that status_notify does things that can be dangerous (more about that
> below), but the call to status_notify will happen whether we call
> thread_select or pselect directly, so if status_notify can do some
> damage, it will do that regardless of whether we call pselect directly
> or not.  I think you assume that the problematic call to status_notify
> will happen in another thread, if we allow switching threads at this
> point, is that right?  IOW, I think you assume that, if we switch to
> another thread, that other thread could call status_notify, and thus
> close some descriptors on which the current thread, the one which
> calls thread_select at this point, is waiting.  If that is your
> assumption, then the actual situation could also be the opposite: the
> current thread, the one where you want to call pselect, will call
> status_notify after pselect returns, and that will close some
> descriptors used by other threads.  In that case, calling pselect
> directly will make things worse, not better, because it makes this
> closing of descriptors imminent.

Yes.  I think there's multiple problems in this code and I haven't
tracked them all down yet.

That being said, I'm confident that it's incorrect and unnecessary to be
using thread_select instead of pselect here, so we should stop doing
that in addition to any other fix.

> So, while I agree we have a problem here, I think the place and the
> way we should solve it are different.
>
> Which brings us to status_notify.  It indeed can close the descriptors
> of processes which terminated, so we should probably make it safer.
> For example, we could leave alone descriptors marked with non-NULL
> waiting_thread if that thread is still alive and is different from the
> current thread.  Would that fix the problem?  If yes, I think we
> should install such a change.

That was my initial idea for a fix as well.  But I'm not sure about the
specific details.  Namely, perhaps we need to be checking waiting_thread
everywhere that we're doing FOR_EACH_PROCESS, not just in status_notify.
And maybe some new functions/macros should be added to simplify this;
maybe a version of FOR_EACH_PROCESS which only iterates over processes
for which waiting_thread == current_thread.

> We should perhaps also understand better what happens with the current
> code if and when descriptors are closed by some other thread in this
> scenario.  A thread using such a descriptor could be either stuck in
> the pselect call, or it could be before or after the call to pselect.
> Paul, what happens if a descriptor passed to pselect becomes closed
> while pselect waits? does pselect return with EBADF, or does pselect
> handle that gracefully, like by considering such descriptors to be
> never "ready"?
>
> If the thread is before or after pselect, then either calling pselect
> or attempting to read from a closed descriptor after pselect returns
> will cause EBADF.  Is our code prepared to deal with such a situation
> (e.g., by ignoring the failed read), or will that cause some Lisp API
> to signal an error or produce unexpected results?  If the latter, we
> should probably handle EBADF specially, because I don't believe we
> could completely prevent this from happening.

(I'm more hopeful: I think we can completely prevent this from
happening, we just need to be more careful about our locking.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 13:38:03 GMT) Full text and rfc822 format available.

Message #14 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 16:36:51 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: Paul Eggert <eggert <at> cs.ucla.edu>,  79334 <at> debbugs.gnu.org, Dmitry Gutov
>   <dmitry <at> gutov.dev>
> Date: Fri, 29 Aug 2025 09:07:50 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > However, I don't think I follow the logic behind this change.  I agree
> > that status_notify does things that can be dangerous (more about that
> > below), but the call to status_notify will happen whether we call
> > thread_select or pselect directly, so if status_notify can do some
> > damage, it will do that regardless of whether we call pselect directly
> > or not.  I think you assume that the problematic call to status_notify
> > will happen in another thread, if we allow switching threads at this
> > point, is that right?  IOW, I think you assume that, if we switch to
> > another thread, that other thread could call status_notify, and thus
> > close some descriptors on which the current thread, the one which
> > calls thread_select at this point, is waiting.  If that is your
> > assumption, then the actual situation could also be the opposite: the
> > current thread, the one where you want to call pselect, will call
> > status_notify after pselect returns, and that will close some
> > descriptors used by other threads.  In that case, calling pselect
> > directly will make things worse, not better, because it makes this
> > closing of descriptors imminent.
> 
> Yes.  I think there's multiple problems in this code and I haven't
> tracked them all down yet.
> 
> That being said, I'm confident that it's incorrect and unnecessary to be
> using thread_select instead of pselect here, so we should stop doing
> that in addition to any other fix.

If we fix status_notify, what other reasons remain for not calling
thread_select there?

> > So, while I agree we have a problem here, I think the place and the
> > way we should solve it are different.
> >
> > Which brings us to status_notify.  It indeed can close the descriptors
> > of processes which terminated, so we should probably make it safer.
> > For example, we could leave alone descriptors marked with non-NULL
> > waiting_thread if that thread is still alive and is different from the
> > current thread.  Would that fix the problem?  If yes, I think we
> > should install such a change.
> 
> That was my initial idea for a fix as well.  But I'm not sure about the
> specific details.  Namely, perhaps we need to be checking waiting_thread
> everywhere that we're doing FOR_EACH_PROCESS, not just in status_notify.
> And maybe some new functions/macros should be added to simplify this;
> maybe a version of FOR_EACH_PROCESS which only iterates over processes
> for which waiting_thread == current_thread.

You may be right, we should audit all the users of FOR_EACH_PROCESS.
At least one of them (get-buffer-process) is probably okay as it is,
but others might need to avoid looking at processes locked to other
threads, indeed.

> > We should perhaps also understand better what happens with the current
> > code if and when descriptors are closed by some other thread in this
> > scenario.  A thread using such a descriptor could be either stuck in
> > the pselect call, or it could be before or after the call to pselect.
> > Paul, what happens if a descriptor passed to pselect becomes closed
> > while pselect waits? does pselect return with EBADF, or does pselect
> > handle that gracefully, like by considering such descriptors to be
> > never "ready"?
> >
> > If the thread is before or after pselect, then either calling pselect
> > or attempting to read from a closed descriptor after pselect returns
> > will cause EBADF.  Is our code prepared to deal with such a situation
> > (e.g., by ignoring the failed read), or will that cause some Lisp API
> > to signal an error or produce unexpected results?  If the latter, we
> > should probably handle EBADF specially, because I don't believe we
> > could completely prevent this from happening.
> 
> (I'm more hopeful: I think we can completely prevent this from
> happening, we just need to be more careful about our locking.)

If we can prevent that completely, it's even better.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 13:46:01 GMT) Full text and rfc822 format available.

Message #17 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 09:45:07 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> Cc: Paul Eggert <eggert <at> cs.ucla.edu>,  79334 <at> debbugs.gnu.org, Dmitry Gutov
>>   <dmitry <at> gutov.dev>
>> Date: Fri, 29 Aug 2025 09:07:50 -0400
>> 
>> Eli Zaretskii <eliz <at> gnu.org> writes:
>> 
>> > However, I don't think I follow the logic behind this change.  I agree
>> > that status_notify does things that can be dangerous (more about that
>> > below), but the call to status_notify will happen whether we call
>> > thread_select or pselect directly, so if status_notify can do some
>> > damage, it will do that regardless of whether we call pselect directly
>> > or not.  I think you assume that the problematic call to status_notify
>> > will happen in another thread, if we allow switching threads at this
>> > point, is that right?  IOW, I think you assume that, if we switch to
>> > another thread, that other thread could call status_notify, and thus
>> > close some descriptors on which the current thread, the one which
>> > calls thread_select at this point, is waiting.  If that is your
>> > assumption, then the actual situation could also be the opposite: the
>> > current thread, the one where you want to call pselect, will call
>> > status_notify after pselect returns, and that will close some
>> > descriptors used by other threads.  In that case, calling pselect
>> > directly will make things worse, not better, because it makes this
>> > closing of descriptors imminent.
>> 
>> Yes.  I think there's multiple problems in this code and I haven't
>> tracked them all down yet.
>> 
>> That being said, I'm confident that it's incorrect and unnecessary to be
>> using thread_select instead of pselect here, so we should stop doing
>> that in addition to any other fix.
>
> If we fix status_notify, what other reasons remain for not calling
> thread_select there?

Primarily: Calling thread_select here complicates reasoning about the
code.  We should not release the lock when not necessary, because it
makes it harder to think about possible thread interleavings.  And it's
not necessary here because this pselect call returns immediately.

(The only reason the pselect call wouldn't return immediately is if we
released the lock and then switched to another thread)

Also, avoiding thread_select here will slightly improve performance by
avoiding lock operations.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 14:02:02 GMT) Full text and rfc822 format available.

Message #20 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 17:00:56 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Fri, 29 Aug 2025 09:45:07 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> That being said, I'm confident that it's incorrect and unnecessary to be
> >> using thread_select instead of pselect here, so we should stop doing
> >> that in addition to any other fix.
> >
> > If we fix status_notify, what other reasons remain for not calling
> > thread_select there?
> 
> Primarily: Calling thread_select here complicates reasoning about the
> code.  We should not release the lock when not necessary, because it
> makes it harder to think about possible thread interleavings.  And it's
> not necessary here because this pselect call returns immediately.

How do we define "necessary" in this context?

> (The only reason the pselect call wouldn't return immediately is if we
> released the lock and then switched to another thread)

You seem to assume that a call to thread_select is only necessary when
some potentially prolonged operation or wait is expected?  But I'm not
sure this is correct, because that could prevent other threads from
running for too long, and force programmers to inject sleep-for
etc. in their programs.  By allowing thread switches as frequently as
possible lets other threads more opportunities to run (to some extent
simulating preemptive scheduling), which I think is a Good Thing
overall, no?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 14:50:03 GMT) Full text and rfc822 format available.

Message #23 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 10:49:34 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
>> Date: Fri, 29 Aug 2025 09:45:07 -0400
>> 
>> Eli Zaretskii <eliz <at> gnu.org> writes:
>> 
>> >> That being said, I'm confident that it's incorrect and unnecessary to be
>> >> using thread_select instead of pselect here, so we should stop doing
>> >> that in addition to any other fix.
>> >
>> > If we fix status_notify, what other reasons remain for not calling
>> > thread_select there?
>> 
>> Primarily: Calling thread_select here complicates reasoning about the
>> code.  We should not release the lock when not necessary, because it
>> makes it harder to think about possible thread interleavings.  And it's
>> not necessary here because this pselect call returns immediately.
>
> How do we define "necessary" in this context?

It's necessary to release the lock around long pselect calls because
otherwise other threads will not run while Emacs is waiting for input,
which means they probably won't run at all.

>> (The only reason the pselect call wouldn't return immediately is if we
>> released the lock and then switched to another thread)
>
> You seem to assume that a call to thread_select is only necessary when
> some potentially prolonged operation or wait is expected?  But I'm not
> sure this is correct, because that could prevent other threads from
> running for too long, and force programmers to inject sleep-for
> etc. in their programs.  By allowing thread switches as frequently as
> possible lets other threads more opportunities to run (to some extent
> simulating preemptive scheduling), which I think is a Good Thing
> overall, no?

Three points:

- Most importantly: We can only allow thread switches at known safe
  points.  It's not clear that this is a safe point.  (Actually, we know
  it currently isn't, because of the status_notify issue.  But I suspect
  there may be other issues which are harder to analyze)

- Adding more thread switches is not necessarily a good thing.  It can
  degrade whole-system performance, because context switching is slow.

- wait_reading_process_output already does a thread switch around the
  pselect, so it's not important for one to happen here.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 15:42:02 GMT) Full text and rfc822 format available.

Message #26 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 18:41:00 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Fri, 29 Aug 2025 10:49:34 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> Primarily: Calling thread_select here complicates reasoning about the
> >> code.  We should not release the lock when not necessary, because it
> >> makes it harder to think about possible thread interleavings.  And it's
> >> not necessary here because this pselect call returns immediately.
> >
> > How do we define "necessary" in this context?
> 
> It's necessary to release the lock around long pselect calls because
> otherwise other threads will not run while Emacs is waiting for input,
> which means they probably won't run at all.
> 
> >> (The only reason the pselect call wouldn't return immediately is if we
> >> released the lock and then switched to another thread)
> >
> > You seem to assume that a call to thread_select is only necessary when
> > some potentially prolonged operation or wait is expected?  But I'm not
> > sure this is correct, because that could prevent other threads from
> > running for too long, and force programmers to inject sleep-for
> > etc. in their programs.  By allowing thread switches as frequently as
> > possible lets other threads more opportunities to run (to some extent
> > simulating preemptive scheduling), which I think is a Good Thing
> > overall, no?
> 
> Three points:
> 
> - Most importantly: We can only allow thread switches at known safe
>   points.  It's not clear that this is a safe point.  (Actually, we know
>   it currently isn't, because of the status_notify issue.  But I suspect
>   there may be other issues which are harder to analyze)

This reason will be removed once we fix status_notify.

> - Adding more thread switches is not necessarily a good thing.  It can
>   degrade whole-system performance, because context switching is slow.
> 
> - wait_reading_process_output already does a thread switch around the
>   pselect, so it's not important for one to happen here.

You basically give the same single reason in several different
wordings.

I see your point, but my experience with threads in Emacs is the
opposite: there seem to be too few thread switch opportunities, so
many programs don't work as expected unless one sprinkles sleep-for in
various places.  For example, try to run something in a non-main
thread and then attempt to interact with Emacs in the main thread.

IOW, it is quite easy for a Lisp program to hog the global lock and
starve the other threads, so each additional thread-switch opportunity
is IME a blessing, not a curse.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 15:44:01 GMT) Full text and rfc822 format available.

Message #29 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev>
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 11:43:21 -0400

[Message part 1 (text/plain, inline)]

On Fri, Aug 29, 2025, 11:41 AM Eli Zaretskii <eliz <at> gnu.org> wrote:

> > From: Spencer Baugh <sbaugh <at> janestreet.com>
> > Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> > Date: Fri, 29 Aug 2025 10:49:34 -0400
> >
> > Eli Zaretskii <eliz <at> gnu.org> writes:
> >
> > >> Primarily: Calling thread_select here complicates reasoning about the
> > >> code.  We should not release the lock when not necessary, because it
> > >> makes it harder to think about possible thread interleavings.  And
> it's
> > >> not necessary here because this pselect call returns immediately.
> > >
> > > How do we define "necessary" in this context?
> >
> > It's necessary to release the lock around long pselect calls because
> > otherwise other threads will not run while Emacs is waiting for input,
> > which means they probably won't run at all.
> >
> > >> (The only reason the pselect call wouldn't return immediately is if we
> > >> released the lock and then switched to another thread)
> > >
> > > You seem to assume that a call to thread_select is only necessary when
> > > some potentially prolonged operation or wait is expected?  But I'm not
> > > sure this is correct, because that could prevent other threads from
> > > running for too long, and force programmers to inject sleep-for
> > > etc. in their programs.  By allowing thread switches as frequently as
> > > possible lets other threads more opportunities to run (to some extent
> > > simulating preemptive scheduling), which I think is a Good Thing
> > > overall, no?
> >
> > Three points:
> >
> > - Most importantly: We can only allow thread switches at known safe
> >   points.  It's not clear that this is a safe point.  (Actually, we know
> >   it currently isn't, because of the status_notify issue.  But I suspect
> >   there may be other issues which are harder to analyze)
>
> This reason will be removed once we fix status_notify.
>
> > - Adding more thread switches is not necessarily a good thing.  It can
> >   degrade whole-system performance, because context switching is slow.
> >
> > - wait_reading_process_output already does a thread switch around the
> >   pselect, so it's not important for one to happen here.
>
> You basically give the same single reason in several different
> wordings.
>
> I see your point, but my experience with threads in Emacs is the
> opposite: there seem to be too few thread switch opportunities, so
> many programs don't work as expected unless one sprinkles sleep-for in
> various places.  For example, try to run something in a non-main
> thread and then attempt to interact with Emacs in the main thread.
>
> IOW, it is quite easy for a Lisp program to hog the global lock and
> starve the other threads, so each additional thread-switch opportunity
> is IME a blessing, not a curse.
>

Okay, so then how about we delete this unsafe thread_select and put some
thread-yield calls around wait_reading_process_output?

>

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 15:58:02 GMT) Full text and rfc822 format available.

Message #32 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 18:57:33 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Date: Fri, 29 Aug 2025 11:43:21 -0400
> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev>
> 
> Okay, so then how about we delete this unsafe thread_select and put some thread-yield calls around
> wait_reading_process_output?

I'm not objected to not calling thread_select there, but can we have
tests that will show us the gains?

Also, I think we should fix the problems with status_notify and other
users of FOR_EACH_PROCESS before we make the change with
thread_select, because current code in status_notify is definitely
wrong, and affects the related behavior.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 16:12:01 GMT) Full text and rfc822 format available.

Message #35 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev>
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 12:11:02 -0400

[Message part 1 (text/plain, inline)]

On Fri, Aug 29, 2025, 11:57 AM Eli Zaretskii <eliz <at> gnu.org> wrote:

> > From: Spencer Baugh <sbaugh <at> janestreet.com>
> > Date: Fri, 29 Aug 2025 11:43:21 -0400
> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <
> dmitry <at> gutov.dev>
> >
> > Okay, so then how about we delete this unsafe thread_select and put some
> thread-yield calls around
> > wait_reading_process_output?
>
> I'm not objected to not calling thread_select there, but can we have
> tests that will show us the gains?
>

The primary gain is that it becomes easier to reason about locking.  I'm
not sure how to write a test for that :)

Also, I think we should fix the problems with status_notify and other
> users of FOR_EACH_PROCESS before we make the change with
> thread_select, because current code in status_notify is definitely
> wrong, and affects the related behavior.
>

Yes, I'll work on a patch which does both.  (I'm not going to bother making
a change which doesn't rely on the change to remove the thread_select,
because this code is hard enough to reason about already)

>

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 17:33:03 GMT) Full text and rfc822 format available.

Message #38 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 13:31:52 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> Cc: Paul Eggert <eggert <at> cs.ucla.edu>,  79334 <at> debbugs.gnu.org, Dmitry Gutov
>>   <dmitry <at> gutov.dev>
>> Date: Fri, 29 Aug 2025 09:07:50 -0400
>> 
>> Eli Zaretskii <eliz <at> gnu.org> writes:
>> > Which brings us to status_notify.  It indeed can close the descriptors
>> > of processes which terminated, so we should probably make it safer.
>> > For example, we could leave alone descriptors marked with non-NULL
>> > waiting_thread if that thread is still alive and is different from the
>> > current thread.  Would that fix the problem?  If yes, I think we
>> > should install such a change.
>> 
>> That was my initial idea for a fix as well.  But I'm not sure about the
>> specific details.  Namely, perhaps we need to be checking waiting_thread
>> everywhere that we're doing FOR_EACH_PROCESS, not just in status_notify.
>> And maybe some new functions/macros should be added to simplify this;
>> maybe a version of FOR_EACH_PROCESS which only iterates over processes
>> for which waiting_thread == current_thread.
>
> You may be right, we should audit all the users of FOR_EACH_PROCESS.
> At least one of them (get-buffer-process) is probably okay as it is,
> but others might need to avoid looking at processes locked to other
> threads, indeed.

While doing this, I had another thought...

Do you know if we have any code which handles the fact that
delete-process might be called on a process while another thread is
calling pselect on its fds?

I don't think we do, which suggests that might also be problematic.
I'll work on a fix which encompasses that as well.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 19:51:02 GMT) Full text and rfc822 format available.

Message #41 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Fri, 29 Aug 2025 15:50:00 -0400

[Message part 1 (text/plain, inline)]

Spencer Baugh <sbaugh <at> janestreet.com> writes:
> While doing this, I had another thought...
>
> Do you know if we have any code which handles the fact that
> delete-process might be called on a process while another thread is
> calling pselect on its fds?
>
> I don't think we do, which suggests that might also be problematic.
> I'll work on a fix which encompasses that as well.

OK, I think this patch fixes the issue.

Many different pieces of code in Emacs can close file descriptors.  We
need to be generically careful to not close file descriptors which some
other thread is currently waiting on.  In my patch, we do this by
letting the waiting_thread itself close those file descriptors, after it
returns from select.

Unfortunately I don't have a reliable test which produces the failure
case, but in this patch I added a fprintf when we close a deferred fd.
This prints whenever some thread would have closed an fd that another
thread was waiting on.  If I run the following program (with emacs -Q
--batch):

(defun my-break-thread ()
  (let ((proc (make-process
               :name "foo"
               :command '("sleep" "1"))))
    (while (process-live-p proc)
      (accept-process-output proc 0.05))))
(while t
  (make-thread #'my-break-thread "thread1")
  (thread-join (make-thread #'my-break-thread "thread2")))

I get about one print every 10 seconds.  Each print is a problem case
which has been averted, so I do think this patch fixes the bug.

(Obviously, we should remove the fprintf before applying the change)

[0001-Defer-closing-file-descriptors-used-by-other-threads.patch (text/x-patch, inline)]

From 4dda5e4fff3067cac53d6273ed3bd327437d00c8 Mon Sep 17 00:00:00 2001
From: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Fri, 29 Aug 2025 15:39:00 -0400
Subject: [PATCH] Defer closing file descriptors used by other threads

* src/process.c (other_thread_is_waiting_for): Add.
(clear_fd_callback_data): Don't clear WAITING_FOR_CLOSE_FD.
(close_process_fd): Set WAITING_FOR_CLOSE_FD for fds with a
waiting_thread.
(wait_reading_process_output): Close fds with
WAITING_FOR_CLOSE_FD. (bug#79334)
---
 src/process.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/src/process.c b/src/process.c
index b33bebb9a35..8273ba78095 100644
--- a/src/process.c
+++ b/src/process.c
@@ -448,7 +448,9 @@ make_lisp_proc (struct Lisp_Process *p)
   /* This descriptor refers to a process.  */
   PROCESS_FD = 8,
   /* A non-blocking connect.  Only valid if FOR_WRITE is set.  */
-  NON_BLOCKING_CONNECT_FD = 16
+  NON_BLOCKING_CONNECT_FD = 16,
+  /* This fd is waiting to be closed. */
+  WAITING_FOR_CLOSE_FD = 32,
 };
 
 static struct fd_callback_data
@@ -466,14 +468,30 @@ make_lisp_proc (struct Lisp_Process *p)
   struct thread_state *waiting_thread;
 } fd_callback_info[FD_SETSIZE];
 
+static bool
+other_thread_is_waiting_for (struct fd_callback_data* elem)
+{
+  return elem->waiting_thread != NULL && elem->waiting_thread != current_thread;
+}
+
 static void
 clear_fd_callback_data(struct fd_callback_data* elem)
 {
   elem->func = NULL;
   elem->data = NULL;
-  elem->flags = 0;
   elem->thread = NULL;
-  elem->waiting_thread = NULL;
+  if (other_thread_is_waiting_for (elem))
+    {
+      /* Preserve WAITING_FOR_CLOSE_FD; see process_close_fd.  Leaving
+	 flags non-zero means recompute_max_desc won't reduce max_desc
+	 past this element. */
+      elem->flags &= WAITING_FOR_CLOSE_FD;
+    }
+  else
+    {
+      elem->flags = 0;
+      elem->waiting_thread = NULL;
+    }
 }
 
 
@@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr)
   if (0 <= fd)
     {
       *fd_addr = -1;
-      emacs_close (fd);
+      if (other_thread_is_waiting_for (&fd_callback_info[fd]))
+	/* Let waiting_thread handle closing the fd. */
+	fd_callback_info[fd].flags = WAITING_FOR_CLOSE_FD;
+      else
+	emacs_close (fd);
     }
 }
 
@@ -5816,6 +5838,22 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
       /* Make C-g and alarm signals set flags again.  */
       clear_waiting_for_input ();
 
+      /* Close fds which other threads wanted to close.  */
+      for (int fd = 0; fd <= max_desc; ++fd)
+	{
+	  if (fd_callback_info[fd].waiting_thread == current_thread
+	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
+	    {
+	      fprintf (stderr, "closing deferred fd %d\n", fd);
+	      emacs_close (fd);
+	      /* Ignore any events which happened on this fd. */
+	      FD_CLR (fd, &Available);
+	      FD_CLR (fd, &Writeok);
+	      fd_callback_info[fd].flags = 0;
+	    }
+	}
+      recompute_max_desc ();
+
       /*  If we woke up due to SIGWINCH, actually change size now.  */
       do_pending_window_change (0);
 
-- 
2.43.7

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Fri, 29 Aug 2025 20:26:01 GMT) Full text and rfc822 format available.

Message #44 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Fri, 29 Aug 2025 16:25:39 -0400

[Message part 1 (text/plain, inline)]

Spencer Baugh <sbaugh <at> janestreet.com> writes:
> Spencer Baugh <sbaugh <at> janestreet.com> writes:
>> While doing this, I had another thought...
>>
>> Do you know if we have any code which handles the fact that
>> delete-process might be called on a process while another thread is
>> calling pselect on its fds?
>>
>> I don't think we do, which suggests that might also be problematic.
>> I'll work on a fix which encompasses that as well.
>
> OK, I think this patch fixes the issue.
>
> Many different pieces of code in Emacs can close file descriptors.  We
> need to be generically careful to not close file descriptors which some
> other thread is currently waiting on.  In my patch, we do this by
> letting the waiting_thread itself close those file descriptors, after it
> returns from select.
>
> Unfortunately I don't have a reliable test which produces the failure
> case, but in this patch I added a fprintf when we close a deferred fd.
> This prints whenever some thread would have closed an fd that another
> thread was waiting on.  If I run the following program (with emacs -Q
> --batch):
>
> (defun my-break-thread ()
>   (let ((proc (make-process
>                :name "foo"
>                :command '("sleep" "1"))))
>     (while (process-live-p proc)
>       (accept-process-output proc 0.05))))
> (while t
>   (make-thread #'my-break-thread "thread1")
>   (thread-join (make-thread #'my-break-thread "thread2")))
>
> I get about one print every 10 seconds.  Each print is a problem case
> which has been averted, so I do think this patch fixes the bug.
>
> (Obviously, we should remove the fprintf before applying the change)

Of course, I immediately found a bug in my patch (I wasn't clearing
waiting_thread after closing a deferred fd).  Fixed in the attached.

In related news: running process-tests.el in a loop also triggers my
"Closing deferred fd" message.  Which suggests that src/process-tests.el
was probably also suffering from bugs in this area, which could have
been causing hangs or flaky failures.  Which should now hopefully be
fixed.

[0001-Defer-closing-file-descriptors-used-by-other-threads.patch (text/x-patch, inline)]

From 9e46aa50f5de7d719a6e427fe44bb6b8b8e226fc Mon Sep 17 00:00:00 2001
From: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Fri, 29 Aug 2025 15:39:00 -0400
Subject: [PATCH] Defer closing file descriptors used by other threads

* src/process.c (other_thread_is_waiting_for): Add.
(clear_fd_callback_data): Don't clear WAITING_FOR_CLOSE_FD.
(close_process_fd): Set WAITING_FOR_CLOSE_FD for fds with a
waiting_thread.
(wait_reading_process_output): Close fds with
WAITING_FOR_CLOSE_FD. (bug#79334)
---
 src/process.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/src/process.c b/src/process.c
index b33bebb9a35..9180b92c61f 100644
--- a/src/process.c
+++ b/src/process.c
@@ -448,7 +448,9 @@ make_lisp_proc (struct Lisp_Process *p)
   /* This descriptor refers to a process.  */
   PROCESS_FD = 8,
   /* A non-blocking connect.  Only valid if FOR_WRITE is set.  */
-  NON_BLOCKING_CONNECT_FD = 16
+  NON_BLOCKING_CONNECT_FD = 16,
+  /* This fd is waiting to be closed. */
+  WAITING_FOR_CLOSE_FD = 32,
 };
 
 static struct fd_callback_data
@@ -466,14 +468,30 @@ make_lisp_proc (struct Lisp_Process *p)
   struct thread_state *waiting_thread;
 } fd_callback_info[FD_SETSIZE];
 
+static bool
+other_thread_is_waiting_for (struct fd_callback_data* elem)
+{
+  return elem->waiting_thread != NULL && elem->waiting_thread != current_thread;
+}
+
 static void
 clear_fd_callback_data(struct fd_callback_data* elem)
 {
   elem->func = NULL;
   elem->data = NULL;
-  elem->flags = 0;
   elem->thread = NULL;
-  elem->waiting_thread = NULL;
+  if (other_thread_is_waiting_for (elem))
+    {
+      /* Preserve WAITING_FOR_CLOSE_FD; see process_close_fd.  Leaving
+	 flags non-zero means recompute_max_desc won't reduce max_desc
+	 past this element. */
+      elem->flags &= WAITING_FOR_CLOSE_FD;
+    }
+  else
+    {
+      elem->flags = 0;
+      elem->waiting_thread = NULL;
+    }
 }
 
 
@@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr)
   if (0 <= fd)
     {
       *fd_addr = -1;
-      emacs_close (fd);
+      if (other_thread_is_waiting_for (&fd_callback_info[fd]))
+	/* Let waiting_thread handle closing the fd. */
+	fd_callback_info[fd].flags = WAITING_FOR_CLOSE_FD;
+      else
+	emacs_close (fd);
     }
 }
 
@@ -5816,6 +5838,22 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
       /* Make C-g and alarm signals set flags again.  */
       clear_waiting_for_input ();
 
+      /* Close fds which other threads wanted to close.  */
+      for (int fd = 0; fd <= max_desc; ++fd)
+	{
+	  if (fd_callback_info[fd].waiting_thread == current_thread
+	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
+	    {
+	      fprintf (stderr, "closing deferred fd %d\n", fd);
+	      emacs_close (fd);
+	      /* Ignore any events which happened on this fd. */
+	      FD_CLR (fd, &Available);
+	      FD_CLR (fd, &Writeok);
+	      clear_fd_callback_data (&fd_callback_info[fd]);
+	    }
+	}
+      recompute_max_desc ();
+
       /*  If we woke up due to SIGWINCH, actually change size now.  */
       do_pending_window_change (0);
 
-- 
2.43.7

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Sat, 30 Aug 2025 06:19:01 GMT) Full text and rfc822 format available.

Message #47 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Sat, 30 Aug 2025 09:18:12 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Fri, 29 Aug 2025 15:50:00 -0400
> 
> Many different pieces of code in Emacs can close file descriptors.  We
> need to be generically careful to not close file descriptors which some
> other thread is currently waiting on.  In my patch, we do this by
> letting the waiting_thread itself close those file descriptors, after it
> returns from select.

Not literally "when returning from pselect", I hope, but when the
process is being deleted _and_ the waiting thread returns from
pselect, right?  IOW, we don't close the descriptor each time pselecft
returns in the right thread, then reopen it before the next call to
pselect, because that won't work.

> +static bool
> +other_thread_is_waiting_for (struct fd_callback_data* elem)
> +{
> +  return elem->waiting_thread != NULL && elem->waiting_thread != current_thread;
> +}

This should also check that the waiting thread is still alive.  If it
isn't, it's okay to close the descriptor.  Otherwise, descriptors
belonging to threads that exited might never be closed.

> @@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr)
>    if (0 <= fd)
>      {
>        *fd_addr = -1;
> -      emacs_close (fd);
> +      if (other_thread_is_waiting_for (&fd_callback_info[fd]))
> +	/* Let waiting_thread handle closing the fd. */
                                                   ^^
Style: 2 spaces between the period and "*/".

> +      /* Close fds which other threads wanted to close.  */
> +      for (int fd = 0; fd <= max_desc; ++fd)
> +	{
> +	  if (fd_callback_info[fd].waiting_thread == current_thread
> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
> +	    {
> +	      fprintf (stderr, "closing deferred fd %d\n", fd);
> +	      emacs_close (fd);
> +	      /* Ignore any events which happened on this fd. */
> +	      FD_CLR (fd, &Available);
> +	      FD_CLR (fd, &Writeok);
> +	      fd_callback_info[fd].flags = 0;

Why do it here and not when we deactivate processes?  Until now we
never closed any descriptors inside wait_reading_process_output.  More
generally, why this design and not a simpler change which only
deactivates processes whose I/O is waited by the current thread, as
discussed previously?  I think this alternative would be easier to
understand and reason about, because (as established in another
discussion) wait_reading_process_output frequently loops more than we
expect it to.

Looking at all the callers of close_process_fd, it sounds like it's
too low-level to do this.  For example, it is called from
create_process (including in the forked child) to close the unneeded
ends of the pipe, where we probably don't want this.  And
clear_fd_callback_data is also used for the keyboard input descriptor,
which is probably also not relevant.  So I wonder whether it's clean
to have this logic at such a low level, where the purpose of the
descriptor is barely known, and not at a higher level, where we know
what each descriptor is used for, and therefore the purpose and the
intent of the logic is much more clear.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Sat, 30 Aug 2025 06:37:02 GMT) Full text and rfc822 format available.

Message #50 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Sat, 30 Aug 2025 09:35:42 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Fri, 29 Aug 2025 16:25:39 -0400
> 
> In related news: running process-tests.el in a loop also triggers my
> "Closing deferred fd" message.  Which suggests that src/process-tests.el
> was probably also suffering from bugs in this area, which could have
> been causing hangs or flaky failures.  Which should now hopefully be
> fixed.

process-tests uses questionable setup, apart of being based on
non-portable assumptions.  At the time, I was unable to convince the
author of the tests to spell out the assumptions and expectations from
what should happen in those situations, without which it was
impossible to decide whether some failures on MS-Windows are bugs or
just non-portable code that cannot work on Windows.  Some of those
tests always fail for me on Windows.

> +      for (int fd = 0; fd <= max_desc; ++fd)
> +	{
> +	  if (fd_callback_info[fd].waiting_thread == current_thread
> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
                                             ^^^^^^^^^^^^^^^^^^^^^^^
Shouldn't this test that the WAITING_FOR_CLOSE_FD bit is set, not that
it is the _only_ bit set?  Even if the code is correct today, it might
not be future-proof if we add some additional flags at some point.

> +	      /* Ignore any events which happened on this fd. */
> +	      FD_CLR (fd, &Available);
> +	      FD_CLR (fd, &Writeok);

Is this wise?  How can we be sure that these events were already
handled (presumably, in another thread)?  If they were not, we will
lose events.  What will happen if we don't ignore them here?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Sat, 30 Aug 2025 15:56:02 GMT) Full text and rfc822 format available.

Message #53 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Sat, 30 Aug 2025 11:55:27 -0400

[Message part 1 (text/plain, inline)]

(I'm responding to both your emails with review feedback here)

Eli Zaretskii <eliz <at> gnu.org> writes:
>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
>> Date: Fri, 29 Aug 2025 16:25:39 -0400
>> +      for (int fd = 0; fd <= max_desc; ++fd)
>> +	{
>> +	  if (fd_callback_info[fd].waiting_thread == current_thread
>> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
>                                              ^^^^^^^^^^^^^^^^^^^^^^^
> Shouldn't this test that the WAITING_FOR_CLOSE_FD bit is set, not that
> it is the _only_ bit set?  Even if the code is correct today, it might
> not be future-proof if we add some additional flags at some point.

True.  Fixed.

>> +	      /* Ignore any events which happened on this fd. */
>> +	      FD_CLR (fd, &Available);
>> +	      FD_CLR (fd, &Writeok);
>
> Is this wise?  How can we be sure that these events were already
> handled (presumably, in another thread)?  If they were not, we will
> lose events.  What will happen if we don't ignore them here?

Any events which might appear at this point were already events we were
going to lose by closing this file descriptor.  For example, if a
process produces output just as we do delete-process on it, we never
read that output.  The only change is that we now have to explicitly
ignore those events; before, we silently dropped them at the time we
closed the file descriptor.

>> Many different pieces of code in Emacs can close file descriptors.  We
>> need to be generically careful to not close file descriptors which some
>> other thread is currently waiting on.  In my patch, we do this by
>> letting the waiting_thread itself close those file descriptors, after it
>> returns from select.
>
> Not literally "when returning from pselect", I hope, but when the
> process is being deleted _and_ the waiting thread returns from
> pselect, right?  IOW, we don't close the descriptor each time pselecft
> returns in the right thread, then reopen it before the next call to
> pselect, because that won't work.

Right.

>> +static bool
>> +other_thread_is_waiting_for (struct fd_callback_data* elem)
>> +{
>> +  return elem->waiting_thread != NULL && elem->waiting_thread != current_thread;
>> +}
>
> This should also check that the waiting thread is still alive.  If it
> isn't, it's okay to close the descriptor.  Otherwise, descriptors
> belonging to threads that exited might never be closed.

waiting_thread always points to a living thread, because it's only set
while a thread is inside wait_reading_process_output.

(In fact, when a thread sees waiting_thread set to a value which is
neither NULL nor current_thread, it can only ever point to a thread
which is currently inside thread_select)

>> @@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr)
>>    if (0 <= fd)
>>      {
>>        *fd_addr = -1;
>> -      emacs_close (fd);
>> +      if (other_thread_is_waiting_for (&fd_callback_info[fd]))
>> +	/* Let waiting_thread handle closing the fd. */
>                                                    ^^
> Style: 2 spaces between the period and "*/".

Fixed.

>> +      /* Close fds which other threads wanted to close.  */
>> +      for (int fd = 0; fd <= max_desc; ++fd)
>> +	{
>> +	  if (fd_callback_info[fd].waiting_thread == current_thread
>> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
>> +	    {
>> +	      fprintf (stderr, "closing deferred fd %d\n", fd);
>> +	      emacs_close (fd);
>> +	      /* Ignore any events which happened on this fd. */
>> +	      FD_CLR (fd, &Available);
>> +	      FD_CLR (fd, &Writeok);
>> +	      fd_callback_info[fd].flags = 0;
>
> Why do it here and not when we deactivate processes?

Because that would close file descriptors that another thread is
currently selecting on, causing EBADF.

> Until now we never closed any descriptors inside
> wait_reading_process_output.

Just to be clear, we can call deactivate_process from
wait_reading_process_output, through status_notify, through process
filters, or other similar.  That closes descriptors.

> More generally, why this design and not a simpler change which only
> deactivates processes whose I/O is waited by the current thread, as
> discussed previously?

I looked into that, but I determined that it's too complicated.  It
would require us to keep track of a new kind of process state: a process
which has been deleted, but hasn't been deactivated yet.  This would
need to be supported for each kind of process, and we'd need to record
new kinds of state to finish the deactivation later.

It's simpler to keep the ability to delete any process at any time, and
have that fully deactivate the process as it does today.  Doing that
only causes one issue: if another thread is selecting on the process's
fds, that thread will break if it tries to handle events on the fds, or
the thread will simply get EBADF from select.  So, we detect that by
checking waiting_thread, and marking the fds so that the other thread
knows not to touch them, and to instead just close them to finish
deleting the process.

> I think this alternative would be easier to understand and reason
> about, because (as established in another discussion)
> wait_reading_process_output frequently loops more than we expect it
> to.

I'm not sure what you mean by that, why does wait_reading_process_output
looping more than we expect cause issues?  I don't think it would.

> Looking at all the callers of close_process_fd, it sounds like it's
> too low-level to do this.  For example, it is called from
> create_process (including in the forked child) to close the unneeded
> ends of the pipe, where we probably don't want this.  And
> clear_fd_callback_data is also used for the keyboard input descriptor,
> which is probably also not relevant.

True.

I looked into fixing these, and that suggested to me an altogether
simpler approach.  We can keep the logic entirely contained in
wait_reading_process_output, just adding a bit of code around the select
call.

And we don't need to defer deleting the fds at all: just recognize when
it's happened, and be careful to not touch those fds afterwards.

Attached.

[0001-Ignore-descriptors-deleted-during-select.patch (text/x-patch, inline)]

From b37e71f47913e1011c4a7857e166e78eddc2a184 Mon Sep 17 00:00:00 2001
From: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Sat, 30 Aug 2025 11:42:19 -0400
Subject: [PATCH] Ignore descriptors deleted during select

* src/process.c (wait_reading_process_output): Save the fds
we're waiting on and check that we're still waiting after
select. (bug#79334)
---
 src/process.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

diff --git a/src/process.c b/src/process.c
index 1612248a11d..11af2f55f63 100644
--- a/src/process.c
+++ b/src/process.c
@@ -5643,6 +5643,7 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
 	  nfds = read_kbd ? 0 : 1;
 	  no_avail = 1;
 	  FD_ZERO (&Available);
+	  xerrno = 0;
 	}
       else
 	{
@@ -5756,6 +5757,18 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
 	      && timespec_cmp (short_timeout, timeout) < 0)
 	    timeout = short_timeout;
 #endif
+      /* Save the fds we're currently waiting_thread for.  */
+      fd_set waiting_fds;
+      FD_ZERO (&waiting_fds);
+      int max_waiting_fds = 0;
+      for (int fd = 0; fd <= max_desc; ++fd)
+        {
+	  if (fd_callback_info[fd].waiting_thread == current_thread)
+	    {
+	      FD_SET (fd, &waiting_fds);
+	      max_waiting_fds = fd;
+	    }
+	}
 
 	  /* Android requires using a replacement for pselect in
 	     android.c to poll for events.  */
@@ -5809,10 +5822,30 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd,
 		    FD_SET (channel, &Available);
 	    }
 #endif
+	  xerrno = errno;
+	  /* Check if any of waiting_fds was closed by another thread or
+	     a signal handler. */
+	  for (int fd = 0; fd <= max_waiting_fds; ++fd)
+	    {
+	      if (FD_ISSET (fd, &waiting_fds)
+		  && fd_callback_info[fd].waiting_thread != current_thread)
+		{
+		  /* Ignore any events which we happened to see on this
+		     fd before it was closed; we can't do anything for
+		     it now, since it's closed.  */
+		  FD_CLR (fd, &Available);
+		  FD_CLR (fd, &Writeok);
+		  /* If we got an EBADF and no events, this is probably
+		     why; clear it so we can try again.  */
+		  if (nfds < 0 && xerrno == EBADF)
+		    {
+		      nfds = 0;
+		      xerrno = 0;
+		    }
+		}
+	    }
 	}
 
-      xerrno = errno;
-
       /* Make C-g and alarm signals set flags again.  */
       clear_waiting_for_input ();
 
-- 
2.43.7

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Sat, 30 Aug 2025 16:37:02 GMT) Full text and rfc822 format available.

Message #56 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Sat, 30 Aug 2025 19:35:39 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Sat, 30 Aug 2025 11:55:27 -0400
> 
> >> +	      /* Ignore any events which happened on this fd. */
> >> +	      FD_CLR (fd, &Available);
> >> +	      FD_CLR (fd, &Writeok);
> >
> > Is this wise?  How can we be sure that these events were already
> > handled (presumably, in another thread)?  If they were not, we will
> > lose events.  What will happen if we don't ignore them here?
> 
> Any events which might appear at this point were already events we were
> going to lose by closing this file descriptor.  For example, if a
> process produces output just as we do delete-process on it, we never
> read that output.  The only change is that we now have to explicitly
> ignore those events; before, we silently dropped them at the time we
> closed the file descriptor.

Where did we close it?  There's this loop _after_ the place where you
add your loop which clears the bits in Available:

      for (channel = 0; channel <= max_desc; ++channel)
        {
          struct fd_callback_data *d = &fd_callback_info[channel];
          if (d->func
	      && ((d->flags & FOR_READ
		   && FD_ISSET (channel, &Available))
		  || ((d->flags & FOR_WRITE)
		      && FD_ISSET (channel, &Writeok))))
            d->func (channel, d->data);
	}

AFAIU, it would notice the set bits and act upon them.  Now with this
new addition we are clearing them ahead of that loop.  That doesn't
necessarily sound right to me.

> >> +static bool
> >> +other_thread_is_waiting_for (struct fd_callback_data* elem)
> >> +{
> >> +  return elem->waiting_thread != NULL && elem->waiting_thread != current_thread;
> >> +}
> >
> > This should also check that the waiting thread is still alive.  If it
> > isn't, it's okay to close the descriptor.  Otherwise, descriptors
> > belonging to threads that exited might never be closed.
> 
> waiting_thread always points to a living thread, because it's only set
> while a thread is inside wait_reading_process_output.
> 
> (In fact, when a thread sees waiting_thread set to a value which is
> neither NULL nor current_thread, it can only ever point to a thread
> which is currently inside thread_select)

You assume that the waiting_thread field will be cleared when a thread
dies, but that is not guaranteed.  Why isn't it better to be defensive
and make sure the thread is still alive?  If you are right, that test
will always be true, but what if you are wrong in some rare cases?

> >> +	  if (fd_callback_info[fd].waiting_thread == current_thread
> >> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
> >> +	    {
> >> +	      fprintf (stderr, "closing deferred fd %d\n", fd);
> >> +	      emacs_close (fd);
> >> +	      /* Ignore any events which happened on this fd. */
> >> +	      FD_CLR (fd, &Available);
> >> +	      FD_CLR (fd, &Writeok);
> >> +	      fd_callback_info[fd].flags = 0;
> >
> > Why do it here and not when we deactivate processes?
> 
> Because that would close file descriptors that another thread is
> currently selecting on, causing EBADF.

No, I mean to deactivate only the processes whose I/O channels are
locked to the current thread.

> > Until now we never closed any descriptors inside
> > wait_reading_process_output.
> 
> Just to be clear, we can call deactivate_process from
> wait_reading_process_output, through status_notify, through process
> filters, or other similar.  That closes descriptors.

Yes, and so I suggest doing this logic of deactivating only the
processes that are locked to the current thread there.

> > More generally, why this design and not a simpler change which only
> > deactivates processes whose I/O is waited by the current thread, as
> > discussed previously?
> 
> I looked into that, but I determined that it's too complicated.  It
> would require us to keep track of a new kind of process state: a process
> which has been deleted, but hasn't been deactivated yet.

What I had in mind was neither delete it nor deactivate it.  IOW, the
loop which goes through the list of processes should only pay
attention to processes locked to this thread and those that are not
locked to any live thread.  It should leave the other processes to
when their threads call status_notify.

> > Looking at all the callers of close_process_fd, it sounds like it's
> > too low-level to do this.  For example, it is called from
> > create_process (including in the forked child) to close the unneeded
> > ends of the pipe, where we probably don't want this.  And
> > clear_fd_callback_data is also used for the keyboard input descriptor,
> > which is probably also not relevant.
> 
> True.
> 
> I looked into fixing these, and that suggested to me an altogether
> simpler approach.  We can keep the logic entirely contained in
> wait_reading_process_output, just adding a bit of code around the select
> call.
> 
> And we don't need to defer deleting the fds at all: just recognize when
> it's happened, and be careful to not touch those fds afterwards.

Hmm... I don't understand: why do we need to do this?  The calls to
compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
set bits in the Available mask passed top pselect already avoid
setting bits for descriptors on which some other thread waits.  So
pselect cannot possibly indicate any output ready on any descriptors
belonging to other threads, and this loop you suggest to add:

> +      /* Save the fds we're currently waiting_thread for.  */
> +      fd_set waiting_fds;
> +      FD_ZERO (&waiting_fds);
> +      int max_waiting_fds = 0;
> +      for (int fd = 0; fd <= max_desc; ++fd)
> +        {
> +	  if (fd_callback_info[fd].waiting_thread == current_thread)
> +	    {
> +	      FD_SET (fd, &waiting_fds);
> +	      max_waiting_fds = fd;
> +	    }
> +	}

should be unnecessary.  Or what am I missing?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Sat, 30 Aug 2025 18:40:02 GMT) Full text and rfc822 format available.

Message #59 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: sbaugh <at> janestreet.com
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, 
 [PATCH] Defer closing file descriptors used by other threads
Date: Sat, 30 Aug 2025 21:38:52 +0300

> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
> Date: Sat, 30 Aug 2025 19:35:39 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> > >> +	  if (fd_callback_info[fd].waiting_thread == current_thread
> > >> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
> > >> +	    {
> > >> +	      fprintf (stderr, "closing deferred fd %d\n", fd);
> > >> +	      emacs_close (fd);
> > >> +	      /* Ignore any events which happened on this fd. */
> > >> +	      FD_CLR (fd, &Available);
> > >> +	      FD_CLR (fd, &Writeok);
> > >> +	      fd_callback_info[fd].flags = 0;
> > >
> > > Why do it here and not when we deactivate processes?
> > 
> > Because that would close file descriptors that another thread is
> > currently selecting on, causing EBADF.
> 
> No, I mean to deactivate only the processes whose I/O channels are
> locked to the current thread.
> 
> > > Until now we never closed any descriptors inside
> > > wait_reading_process_output.
> > 
> > Just to be clear, we can call deactivate_process from
> > wait_reading_process_output, through status_notify, through process
> > filters, or other similar.  That closes descriptors.
> 
> Yes, and so I suggest doing this logic of deactivating only the
> processes that are locked to the current thread there.

Btw, as a nice bonus of that approach, process sentinel will run in
the context of the thread to which the process is locked.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Sun, 31 Aug 2025 03:21:02 GMT) Full text and rfc822 format available.

Message #62 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dmitry <at> gutov.dev>
To: Spencer Baugh <sbaugh <at> janestreet.com>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Sun, 31 Aug 2025 06:20:24 +0300

On 29/08/2025 19:11, Spencer Baugh wrote:
> On Fri, Aug 29, 2025, 11:57 AM Eli Zaretskii <eliz <at> gnu.org 
> <mailto:eliz <at> gnu.org>> wrote:
> 
>      > From: Spencer Baugh <sbaugh <at> janestreet.com
>     <mailto:sbaugh <at> janestreet.com>>
>      > Date: Fri, 29 Aug 2025 11:43:21 -0400
>      > Cc: 79334 <at> debbugs.gnu.org <mailto:79334 <at> debbugs.gnu.org>,
>     eggert <at> cs.ucla.edu <mailto:eggert <at> cs.ucla.edu>, Dmitry Gutov
>     <dmitry <at> gutov.dev <mailto:dmitry <at> gutov.dev>>
>      >
>      > Okay, so then how about we delete this unsafe thread_select and
>     put some thread-yield calls around
>      > wait_reading_process_output?
> 
>     I'm not objected to not calling thread_select there, but can we have
>     tests that will show us the gains?
> 
> 
> The primary gain is that it becomes easier to reason about locking.  I'm 
> not sure how to write a test for that :)

I think what could help in this area is some benchmarking scenario, and 
the goal would be to show that it doesn't regress (or within the margin 
of error) when 'pselect' is used directly.

A script which would create a bunch of threads and some amount of 
processes spawned by them with certain characteristics. Depending on 
what workload switching in that thread_select could help optimize.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Mon, 01 Sep 2025 00:41:02 GMT) Full text and rfc822 format available.

Message #65 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dmitry <at> gutov.dev>
To: Eli Zaretskii <eliz <at> gnu.org>, Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu
Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily
Date: Mon, 1 Sep 2025 03:39:50 +0300

On 29/08/2025 18:41, Eli Zaretskii wrote:
> For example, try to run something in a non-main
> thread and then attempt to interact with Emacs in the main thread.

Could that be code that does straight computation and does not switch 
threads at all, unless sleep-for is added?

Perhaps relatedly, diff-hl in "threaded" mode has a much better 
interactivity since switching to 'make-process' from 'call-process' for 
its external program interaction. But you're probably thinking of 
different kind of examples.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Mon, 01 Sep 2025 16:19:02 GMT) Full text and rfc822 format available.

Message #68 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Mon, 01 Sep 2025 12:18:37 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:
>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> >> +static bool
>> >> +other_thread_is_waiting_for (struct fd_callback_data* elem)
>> >> +{
>> >> +  return elem->waiting_thread != NULL && elem->waiting_thread != current_thread;
>> >> +}
>> >
>> > This should also check that the waiting thread is still alive.  If it
>> > isn't, it's okay to close the descriptor.  Otherwise, descriptors
>> > belonging to threads that exited might never be closed.
>> 
>> waiting_thread always points to a living thread, because it's only set
>> while a thread is inside wait_reading_process_output.
>> 
>> (In fact, when a thread sees waiting_thread set to a value which is
>> neither NULL nor current_thread, it can only ever point to a thread
>> which is currently inside thread_select)
>
> You assume that the waiting_thread field will be cleared when a thread
> dies, but that is not guaranteed.  Why isn't it better to be defensive
> and make sure the thread is still alive?  If you are right, that test
> will always be true, but what if you are wrong in some rare cases?

Fair enough; the patch has changed a lot so this
other_thread_is_waiting_for function doesn't exist anymore, but if I
re-add it or similar checks, I will check that the thread is alive.

>> >> +	  if (fd_callback_info[fd].waiting_thread == current_thread
>> >> +	      && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD))
>> >> +	    {
>> >> +	      fprintf (stderr, "closing deferred fd %d\n", fd);
>> >> +	      emacs_close (fd);
>> >> +	      /* Ignore any events which happened on this fd. */
>> >> +	      FD_CLR (fd, &Available);
>> >> +	      FD_CLR (fd, &Writeok);
>> >> +	      fd_callback_info[fd].flags = 0;
>> >
>> > Why do it here and not when we deactivate processes?
>> 
>> Because that would close file descriptors that another thread is
>> currently selecting on, causing EBADF.
>
> No, I mean to deactivate only the processes whose I/O channels are
> locked to the current thread.
>
>> > Until now we never closed any descriptors inside
>> > wait_reading_process_output.
>> 
>> Just to be clear, we can call deactivate_process from
>> wait_reading_process_output, through status_notify, through process
>> filters, or other similar.  That closes descriptors.
>
> Yes, and so I suggest doing this logic of deactivating only the
> processes that are locked to the current thread there.
>
>> > More generally, why this design and not a simpler change which only
>> > deactivates processes whose I/O is waited by the current thread, as
>> > discussed previously?
>> 
>> I looked into that, but I determined that it's too complicated.  It
>> would require us to keep track of a new kind of process state: a process
>> which has been deleted, but hasn't been deactivated yet.
>
> What I had in mind was neither delete it nor deactivate it.  IOW, the
> loop which goes through the list of processes should only pay
> attention to processes locked to this thread and those that are not
> locked to any live thread.  It should leave the other processes to
> when their threads call status_notify.

There are too many ways in which we delete or deactivate processes to do
that.  In particular, Lisp programs should be able to call
delete-process on a process without getting an error just because some
thread is currently calling select and happens to be monitoring that
process.  To emit such an error would make programming with processes
and threads extremely difficult.

>> > Looking at all the callers of close_process_fd, it sounds like it's
>> > too low-level to do this.  For example, it is called from
>> > create_process (including in the forked child) to close the unneeded
>> > ends of the pipe, where we probably don't want this.  And
>> > clear_fd_callback_data is also used for the keyboard input descriptor,
>> > which is probably also not relevant.
>> 
>> True.
>> 
>> I looked into fixing these, and that suggested to me an altogether
>> simpler approach.  We can keep the logic entirely contained in
>> wait_reading_process_output, just adding a bit of code around the select
>> call.
>> 
>> And we don't need to defer deleting the fds at all: just recognize when
>> it's happened, and be careful to not touch those fds afterwards.
>
> Hmm... I don't understand: why do we need to do this?  The calls to
> compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
> set bits in the Available mask passed top pselect already avoid
> setting bits for descriptors on which some other thread waits.  So
> pselect cannot possibly indicate any output ready on any descriptors
> belonging to other threads, and this loop you suggest to add:
>
>> +      /* Save the fds we're currently waiting_thread for.  */
>> +      fd_set waiting_fds;
>> +      FD_ZERO (&waiting_fds);
>> +      int max_waiting_fds = 0;
>> +      for (int fd = 0; fd <= max_desc; ++fd)
>> +        {
>> +	  if (fd_callback_info[fd].waiting_thread == current_thread)
>> +	    {
>> +	      FD_SET (fd, &waiting_fds);
>> +	      max_waiting_fds = fd;
>> +	    }
>> +	}
>
> should be unnecessary.  Or what am I missing?

That's exactly the problem.  A file descriptor for which we set
waiting_thread can have delete_read_fd or delete_write_fd called on it
by another thread, or by a signal handler, while we're waiting in
pselect.

We need to catch the fact that the file descriptor has been removed from
the set of file descriptors we're supposed to monitor, and avoid
reacting to events on it.  We can catch this easily by seeing that
waiting_thread has been cleared or changed while we were blocked in
select.

BTW, the fact that this problematic case is also possible via signal
handlers (namely the SIGCHLD handler) is a separate bug introduced by
the refactoring which added fd_callback_info.  It means that with a
certain interleaving, I think it should be possible to get buggy
behavior by just using processes without threads.  It's quite rare
though and I haven't been able to figure out what exact path that would
cause a problem.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Mon, 01 Sep 2025 19:26:02 GMT) Full text and rfc822 format available.

Message #71 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Mon, 01 Sep 2025 22:25:26 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Mon, 01 Sep 2025 12:18:37 -0400
> 
> > You assume that the waiting_thread field will be cleared when a thread
> > dies, but that is not guaranteed.  Why isn't it better to be defensive
> > and make sure the thread is still alive?  If you are right, that test
> > will always be true, but what if you are wrong in some rare cases?
> 
> Fair enough; the patch has changed a lot so this
> other_thread_is_waiting_for function doesn't exist anymore, but if I
> re-add it or similar checks, I will check that the thread is alive.

Thanks.

> > What I had in mind was neither delete it nor deactivate it.  IOW, the
> > loop which goes through the list of processes should only pay
> > attention to processes locked to this thread and those that are not
> > locked to any live thread.  It should leave the other processes to
> > when their threads call status_notify.
> 
> There are too many ways in which we delete or deactivate processes to do
> that.

We should audit all of them and prevent that from happening when the
process is locked to another thread.

> In particular, Lisp programs should be able to call
> delete-process on a process without getting an error just because some
> thread is currently calling select and happens to be monitoring that
> process.  To emit such an error would make programming with processes
> and threads extremely difficult.

This is a problem two orders of magnitude smaller than what we are
trying to fix here.  So my suggestion is to leave it for later.
There's no special reason why Lisp programs "should be able" to delete
a process locked to another thread.  If we don't want to signal an
error (though I don't see why not), we could perhaps do nothing
silently, until we get to dealing with that later.

For now, sabotaging threads that wait for some sub-process is a much
more grave problem, and we must fix it if we want threads and
processes to live in peace in Emacs.

> > Hmm... I don't understand: why do we need to do this?  The calls to
> > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
> > set bits in the Available mask passed top pselect already avoid
> > setting bits for descriptors on which some other thread waits.  So
> > pselect cannot possibly indicate any output ready on any descriptors
> > belonging to other threads, and this loop you suggest to add:
> >
> >> +      /* Save the fds we're currently waiting_thread for.  */
> >> +      fd_set waiting_fds;
> >> +      FD_ZERO (&waiting_fds);
> >> +      int max_waiting_fds = 0;
> >> +      for (int fd = 0; fd <= max_desc; ++fd)
> >> +        {
> >> +	  if (fd_callback_info[fd].waiting_thread == current_thread)
> >> +	    {
> >> +	      FD_SET (fd, &waiting_fds);
> >> +	      max_waiting_fds = fd;
> >> +	    }
> >> +	}
> >
> > should be unnecessary.  Or what am I missing?
> 
> That's exactly the problem.  A file descriptor for which we set
> waiting_thread can have delete_read_fd or delete_write_fd called on it
> by another thread, or by a signal handler, while we're waiting in
> pselect.

We need to prevent that from happening, not to go along with it.  But
anyway, I don't see why you need another fd_set for this.  Why not
just look at the descriptors in the Available set?

> BTW, the fact that this problematic case is also possible via signal
> handlers (namely the SIGCHLD handler) is a separate bug introduced by
> the refactoring which added fd_callback_info.  It means that with a
> certain interleaving, I think it should be possible to get buggy
> behavior by just using processes without threads.  It's quite rare
> though and I haven't been able to figure out what exact path that would
> cause a problem.

Let's not attempt to solve too many problems at the same time.  How to
make SIGCHLD safer when threads are present is an outstanding issue,
and the solution will IMO depend in whether signals are delivered to
the main thread and whether or not the build uses the child signal
pipe.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Mon, 01 Sep 2025 19:41:01 GMT) Full text and rfc822 format available.

Message #74 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev>
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other threads
Date: Mon, 1 Sep 2025 15:40:37 -0400

[Message part 1 (text/plain, inline)]

On Mon, Sep 1, 2025, 3:25 PM Eli Zaretskii <eliz <at> gnu.org> wrote:

> > From: Spencer Baugh <sbaugh <at> janestreet.com>
> > Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> > Date: Mon, 01 Sep 2025 12:18:37 -0400
> >
> > > You assume that the waiting_thread field will be cleared when a thread
> > > dies, but that is not guaranteed.  Why isn't it better to be defensive
> > > and make sure the thread is still alive?  If you are right, that test
> > > will always be true, but what if you are wrong in some rare cases?
> >
> > Fair enough; the patch has changed a lot so this
> > other_thread_is_waiting_for function doesn't exist anymore, but if I
> > re-add it or similar checks, I will check that the thread is alive.
>
> Thanks.
>
> > > What I had in mind was neither delete it nor deactivate it.  IOW, the
> > > loop which goes through the list of processes should only pay
> > > attention to processes locked to this thread and those that are not
> > > locked to any live thread.  It should leave the other processes to
> > > when their threads call status_notify.
> >
> > There are too many ways in which we delete or deactivate processes to do
> > that.
>
> We should audit all of them and prevent that from happening when the
> process is locked to another thread.
>
> > In particular, Lisp programs should be able to call
> > delete-process on a process without getting an error just because some
> > thread is currently calling select and happens to be monitoring that
> > process.  To emit such an error would make programming with processes
> > and threads extremely difficult.
>
> This is a problem two orders of magnitude smaller than what we are
> trying to fix here.  So my suggestion is to leave it for later.
> There's no special reason why Lisp programs "should be able" to delete
> a process locked to another thread.  If we don't want to signal an
> error (though I don't see why not), we could perhaps do nothing
> silently, until we get to dealing with that later.
>

Silently doing nothing is also an unacceptably confusing result, in my
opinion.

But it's okay: I believe it's possible to solve the problem without
introducing such bad behavior.

For now, sabotaging threads that wait for some sub-process is a much
> more grave problem, and we must fix it if we want threads and
> processes to live in peace in Emacs.
>

Of course.

> > Hmm... I don't understand: why do we need to do this?  The calls to
> > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
> > > set bits in the Available mask passed top pselect already avoid
> > > setting bits for descriptors on which some other thread waits.  So
> > > pselect cannot possibly indicate any output ready on any descriptors
> > > belonging to other threads, and this loop you suggest to add:
> > >
> > >> +      /* Save the fds we're currently waiting_thread for.  */
> > >> +      fd_set waiting_fds;
> > >> +      FD_ZERO (&waiting_fds);
> > >> +      int max_waiting_fds = 0;
> > >> +      for (int fd = 0; fd <= max_desc; ++fd)
> > >> +        {
> > >> +    if (fd_callback_info[fd].waiting_thread == current_thread)
> > >> +      {
> > >> +        FD_SET (fd, &waiting_fds);
> > >> +        max_waiting_fds = fd;
> > >> +      }
> > >> +  }
> > >
> > > should be unnecessary.  Or what am I missing?
> >
> > That's exactly the problem.  A file descriptor for which we set
> > waiting_thread can have delete_read_fd or delete_write_fd called on it
> > by another thread, or by a signal handler, while we're waiting in
> > pselect.
>
> We need to prevent that from happening, not to go along with it.  But
> anyway, I don't see why you need another fd_set for this.  Why not
> just look at the descriptors in the Available set?
>

The Available set is cleared by the select call, and afterwards only
contains descriptors which has events.  In that case we could just look at
the descriptors in Available and check waiting_thread for each of them.

But if there was an EBADF, then Available is left in an undefined state
(see pselect(1)) and we no longer know what descriptors we originally set
waiting_thread on.  So we can't check to see if one of our descriptors was
"stolen" (had waiting_thread set to a different value), which would mean
the EBADF is spurious and we should continue running.

Though, I guess we could alternatively just assume that EBADF is always
spurious.

> BTW, the fact that this problematic case is also possible via signal
> > handlers (namely the SIGCHLD handler) is a separate bug introduced by
> > the refactoring which added fd_callback_info.  It means that with a
> > certain interleaving, I think it should be possible to get buggy
> > behavior by just using processes without threads.  It's quite rare
> > though and I haven't been able to figure out what exact path that would
> > cause a problem.
>
> Let's not attempt to solve too many problems at the same time.  How to
> make SIGCHLD safer when threads are present is an outstanding issue,
> and the solution will IMO depend in whether signals are delivered to
> the main thread and whether or not the build uses the child signal
> pipe.
>

You misread me: I said SIGCHLD is unsafe even without threads.

>

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Tue, 02 Sep 2025 11:19:02 GMT) Full text and rfc822 format available.

Message #77 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other threads
Date: Tue, 02 Sep 2025 14:18:31 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Date: Mon, 1 Sep 2025 15:40:37 -0400
> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev>
> 
>  > In particular, Lisp programs should be able to call
>  > delete-process on a process without getting an error just because some
>  > thread is currently calling select and happens to be monitoring that
>  > process.  To emit such an error would make programming with processes
>  > and threads extremely difficult.
> 
>  This is a problem two orders of magnitude smaller than what we are
>  trying to fix here.  So my suggestion is to leave it for later.
>  There's no special reason why Lisp programs "should be able" to delete
>  a process locked to another thread.  If we don't want to signal an
>  error (though I don't see why not), we could perhaps do nothing
>  silently, until we get to dealing with that later.
> 
> Silently doing nothing is also an unacceptably confusing result, in my opinion.

Why?  Ignoring invalid actions is not something new to Emacs.

>  > > Hmm... I don't understand: why do we need to do this?  The calls to
>  > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
>  > > set bits in the Available mask passed top pselect already avoid
>  > > setting bits for descriptors on which some other thread waits.  So
>  > > pselect cannot possibly indicate any output ready on any descriptors
>  > > belonging to other threads, and this loop you suggest to add:
>  > >
>  > >> +      /* Save the fds we're currently waiting_thread for.  */
>  > >> +      fd_set waiting_fds;
>  > >> +      FD_ZERO (&waiting_fds);
>  > >> +      int max_waiting_fds = 0;
>  > >> +      for (int fd = 0; fd <= max_desc; ++fd)
>  > >> +        {
>  > >> +    if (fd_callback_info[fd].waiting_thread == current_thread)
>  > >> +      {
>  > >> +        FD_SET (fd, &waiting_fds);
>  > >> +        max_waiting_fds = fd;
>  > >> +      }
>  > >> +  }
>  > >
>  > > should be unnecessary.  Or what am I missing?
>  > 
>  > That's exactly the problem.  A file descriptor for which we set
>  > waiting_thread can have delete_read_fd or delete_write_fd called on it
>  > by another thread, or by a signal handler, while we're waiting in
>  > pselect.
> 
>  We need to prevent that from happening, not to go along with it.  But
>  anyway, I don't see why you need another fd_set for this.  Why not
>  just look at the descriptors in the Available set?
> 
> The Available set is cleared by the select call, and afterwards only contains descriptors which has events.  In
> that case we could just look at the descriptors in Available and check waiting_thread for each of them.

Yes, and we don't care about descriptors that are not ready to be
read.

> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer
> know what descriptors we originally set waiting_thread on.

In that case pselect returns a negative result, no?  And if that
happens, we don't examine the descriptors at all, right?  So why is
this situation a problem we need to handle with this additional fd_set?

> Though, I guess we could alternatively just assume that EBADF is always spurious.

Assume and do what?

>  > BTW, the fact that this problematic case is also possible via signal
>  > handlers (namely the SIGCHLD handler) is a separate bug introduced by
>  > the refactoring which added fd_callback_info.  It means that with a
>  > certain interleaving, I think it should be possible to get buggy
>  > behavior by just using processes without threads.  It's quite rare
>  > though and I haven't been able to figure out what exact path that would
>  > cause a problem.
> 
>  Let's not attempt to solve too many problems at the same time.  How to
>  make SIGCHLD safer when threads are present is an outstanding issue,
>  and the solution will IMO depend in whether signals are delivered to
>  the main thread and whether or not the build uses the child signal
>  pipe.
> 
> You misread me: I said SIGCHLD is unsafe even without threads.

Please describe how.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Tue, 02 Sep 2025 13:39:02 GMT) Full text and rfc822 format available.

Message #80 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Tue, 02 Sep 2025 09:38:15 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:
>>  > > Hmm... I don't understand: why do we need to do this?  The calls to
>>  > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which
>>  > > set bits in the Available mask passed top pselect already avoid
>>  > > setting bits for descriptors on which some other thread waits.  So
>>  > > pselect cannot possibly indicate any output ready on any descriptors
>>  > > belonging to other threads, and this loop you suggest to add:
>>  > >
>>  > >> +      /* Save the fds we're currently waiting_thread for.  */
>>  > >> +      fd_set waiting_fds;
>>  > >> +      FD_ZERO (&waiting_fds);
>>  > >> +      int max_waiting_fds = 0;
>>  > >> +      for (int fd = 0; fd <= max_desc; ++fd)
>>  > >> +        {
>>  > >> +    if (fd_callback_info[fd].waiting_thread == current_thread)
>>  > >> +      {
>>  > >> +        FD_SET (fd, &waiting_fds);
>>  > >> +        max_waiting_fds = fd;
>>  > >> +      }
>>  > >> +  }
>>  > >
>>  > > should be unnecessary.  Or what am I missing?
>>  > 
>>  > That's exactly the problem.  A file descriptor for which we set
>>  > waiting_thread can have delete_read_fd or delete_write_fd called on it
>>  > by another thread, or by a signal handler, while we're waiting in
>>  > pselect.
>> 
>>  We need to prevent that from happening, not to go along with it.  But
>>  anyway, I don't see why you need another fd_set for this.  Why not
>>  just look at the descriptors in the Available set?
>> 
>> The Available set is cleared by the select call, and afterwards only contains descriptors which has events.  In
>> that case we could just look at the descriptors in Available and check waiting_thread for each of them.
>
> Yes, and we don't care about descriptors that are not ready to be
> read.

Right.

>> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer
>> know what descriptors we originally set waiting_thread on.
>
> In that case pselect returns a negative result, no?  And if that
> happens, we don't examine the descriptors at all, right?  So why is
> this situation a problem we need to handle with this additional fd_set?

Because we currently call emacs_abort when we get EBADF:

if (nfds < 0)
  {
    if (xerrno == EINTR)
      no_avail = 1;
    else if (xerrno == EBADF)
      emacs_abort ();
    else
      report_file_errno ("Failed select", Qnil, xerrno);
  }

>> Though, I guess we could alternatively just assume that EBADF is always spurious.
>
> Assume and do what?

Follow one of the other two branches in the above snippet of code:

- Treat EBADF the same way we do EINTR: there were no events on any fds, so
  we continue running wait_reading_process_output.

- Treat EBADF the same way we treat other errno values: signal an error.
  That would at least be better than calling emacs_abort.

>>  > BTW, the fact that this problematic case is also possible via signal
>>  > handlers (namely the SIGCHLD handler) is a separate bug introduced by
>>  > the refactoring which added fd_callback_info.  It means that with a
>>  > certain interleaving, I think it should be possible to get buggy
>>  > behavior by just using processes without threads.  It's quite rare
>>  > though and I haven't been able to figure out what exact path that would
>>  > cause a problem.
>> 
>>  Let's not attempt to solve too many problems at the same time.  How to
>>  make SIGCHLD safer when threads are present is an outstanding issue,
>>  and the solution will IMO depend in whether signals are delivered to
>>  the main thread and whether or not the build uses the child signal
>>  pipe.
>> 
>> You misread me: I said SIGCHLD is unsafe even without threads.
>
> Please describe how.

1. Call wait_reading_process_output
2. wait_reading_process_output calls compute_input_wait_mask, including
   SOME-FD in the mask.
3. Receive SIGCHLD, calling handle_child_signal
4. handle_child_signal calls delete_read_fd for SOME-FD
5. Signal handler finishes
6. wait_reading_process_output calls select and returns events for
   SOME-FD, even though SOME-FD has been deleted from fd_callback_info
   and the process associated with it has died.  It seems possible that
   this can cause an issue in some code path.

This couldn't happen before the refactoring which added
fd_callback_info, because handle_child_signal directly mutated the
fd_set which is passed to pselect.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Tue, 02 Sep 2025 14:41:02 GMT) Full text and rfc822 format available.

Message #83 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Tue, 02 Sep 2025 17:39:39 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Tue, 02 Sep 2025 09:38:15 -0400
> 
> >> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer
> >> know what descriptors we originally set waiting_thread on.
> >
> > In that case pselect returns a negative result, no?  And if that
> > happens, we don't examine the descriptors at all, right?  So why is
> > this situation a problem we need to handle with this additional fd_set?
> 
> Because we currently call emacs_abort when we get EBADF:

So, if the danger of getting EBADF due to threads-and-processes
mishaps is real (I'm saying "if" because I don't think we had bug
reports about that), we should change how we treat EBADF in that case.
Which should probably be a separate discussion anyway.  In any case,
the distinct return value of pselect in that case means that none of
the code which handles other return values should be bothered about
this possibility.

> >> Though, I guess we could alternatively just assume that EBADF is always spurious.
> >
> > Assume and do what?
> 
> Follow one of the other two branches in the above snippet of code:
> 
> - Treat EBADF the same way we do EINTR: there were no events on any fds, so
>   we continue running wait_reading_process_output.
> 
> - Treat EBADF the same way we treat other errno values: signal an error.
>   That would at least be better than calling emacs_abort.

Let's discuss this issue separately, okay?  We are already trying to
cover too many different scenarios, which makes this discussion hard
to follow and the probability of losing focus high.

> >> You misread me: I said SIGCHLD is unsafe even without threads.
> >
> > Please describe how.
> 
> 1. Call wait_reading_process_output
> 2. wait_reading_process_output calls compute_input_wait_mask, including
>    SOME-FD in the mask.
> 3. Receive SIGCHLD, calling handle_child_signal
> 4. handle_child_signal calls delete_read_fd for SOME-FD
> 5. Signal handler finishes
> 6. wait_reading_process_output calls select and returns events for
>    SOME-FD, even though SOME-FD has been deleted from fd_callback_info
>    and the process associated with it has died.  It seems possible that
>    this can cause an issue in some code path.
> 
> This couldn't happen before the refactoring which added
> fd_callback_info, because handle_child_signal directly mutated the
> fd_set which is passed to pselect.

Are you talking about systems that don't use the "child signal pipe"?
Because if that pipe is in use, then the SIGCHLD handler writes to it,
and that will cause pselect to return with that descriptor in
Available, and we then handle the death of the process.  In addition,
delete_read_fd just clears the data from the fd_callback_info of the
descriptor, and the loop that examines the bits in Available is AFAICT
careful enough to test these fields of the callback info before doing
anything because the descriptor was returned by pselect as ready.  So
what kind of problems could have happened in step 6 above?

And once again, this is a separate issue, so let's discuss it
separately.  In this bug, let's focus on what happens in status_notify
and in general when we deactivate/delete a process which exited.  Can
we return to that now?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Tue, 02 Sep 2025 15:00:02 GMT) Full text and rfc822 format available.

Message #86 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Spencer Baugh <sbaugh <at> janestreet.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Tue, 02 Sep 2025 10:59:39 -0400

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Spencer Baugh <sbaugh <at> janestreet.com>
>> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
>> Date: Tue, 02 Sep 2025 09:38:15 -0400
>> 
>> >> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer
>> >> know what descriptors we originally set waiting_thread on.
>> >
>> > In that case pselect returns a negative result, no?  And if that
>> > happens, we don't examine the descriptors at all, right?  So why is
>> > this situation a problem we need to handle with this additional fd_set?
>> 
>> Because we currently call emacs_abort when we get EBADF:
>
> So, if the danger of getting EBADF due to threads-and-processes
> mishaps is real (I'm saying "if" because I don't think we had bug
> reports about that),

It is real.  I have seen it plenty at my site.  I described this issue
in my initial email.  It is the issue I care about solving.  This bug is
my report of the issue.  Solving it is my highest priority.

> we should change how we treat EBADF in that case.  Which should
> probably be a separate discussion anyway.  In any case, the distinct
> return value of pselect in that case means that none of the code which
> handles other return values should be bothered about this possibility.
>
>> >> Though, I guess we could alternatively just assume that EBADF is always spurious.
>> >
>> > Assume and do what?
>> 
>> Follow one of the other two branches in the above snippet of code:
>> 
>> - Treat EBADF the same way we do EINTR: there were no events on any fds, so
>>   we continue running wait_reading_process_output.
>> 
>> - Treat EBADF the same way we treat other errno values: signal an error.
>>   That would at least be better than calling emacs_abort.
>
> Let's discuss this issue separately, okay?  We are already trying to
> cover too many different scenarios, which makes this discussion hard
> to follow and the probability of losing focus high.

It's not a separate issue, it's the entire point of this bug.  So yes,
let's stay focused on this issue and not other issues.

Which of these two options do you think we should go with?

Or there are other options.  For example, we could detect the cause of
the EBADF by remembering the original Available set before calling
pselect, and checking if the fd_callback_info entry for any of those
file descriptors was cleared.

>> >> You misread me: I said SIGCHLD is unsafe even without threads.
>> >
>> > Please describe how.
>> 
>> 1. Call wait_reading_process_output
>> 2. wait_reading_process_output calls compute_input_wait_mask, including
>>    SOME-FD in the mask.
>> 3. Receive SIGCHLD, calling handle_child_signal
>> 4. handle_child_signal calls delete_read_fd for SOME-FD
>> 5. Signal handler finishes
>> 6. wait_reading_process_output calls select and returns events for
>>    SOME-FD, even though SOME-FD has been deleted from fd_callback_info
>>    and the process associated with it has died.  It seems possible that
>>    this can cause an issue in some code path.
>> 
>> This couldn't happen before the refactoring which added
>> fd_callback_info, because handle_child_signal directly mutated the
>> fd_set which is passed to pselect.
>
> Are you talking about systems that don't use the "child signal pipe"?

No.  This is on GNU/Linux.

> Because if that pipe is in use, then the SIGCHLD handler writes to it,
> and that will cause pselect to return with that descriptor in
> Available, and we then handle the death of the process.  In addition,
> delete_read_fd just clears the data from the fd_callback_info of the
> descriptor, and the loop that examines the bits in Available is AFAICT
> careful enough to test these fields of the callback info before doing
> anything because the descriptor was returned by pselect as ready.  So
> what kind of problems could have happened in step 6 above?

I'm not sure.

> And once again, this is a separate issue, so let's discuss it
> separately.

Okay.  Let me just say again that I suspect there are bugs here which
can be triggered by both threads and signal handlers.  We can leave it
at that.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#79334; Package emacs. (Tue, 02 Sep 2025 16:00:02 GMT) Full text and rfc822 format available.

Message #89 received at 79334 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Spencer Baugh <sbaugh <at> janestreet.com>
Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev
Subject: Re: bug#79334: [PATCH] Don't release thread select lock
 unnecessarily, [PATCH] Defer closing file descriptors used by other
 threads
Date: Tue, 02 Sep 2025 18:59:28 +0300

> From: Spencer Baugh <sbaugh <at> janestreet.com>
> Cc: 79334 <at> debbugs.gnu.org,  eggert <at> cs.ucla.edu,  dmitry <at> gutov.dev
> Date: Tue, 02 Sep 2025 10:59:39 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> Because we currently call emacs_abort when we get EBADF:
> >
> > So, if the danger of getting EBADF due to threads-and-processes
> > mishaps is real (I'm saying "if" because I don't think we had bug
> > reports about that),
> 
> It is real.  I have seen it plenty at my site.  I described this issue
> in my initial email.  It is the issue I care about solving.  This bug is
> my report of the issue.  Solving it is my highest priority.

OK, so let's see if fixing deletion of processes fixes that problem.
If you see that in your usage, it should be relatively easy to test.
Then we can take it from there.

> > Let's discuss this issue separately, okay?  We are already trying to
> > cover too many different scenarios, which makes this discussion hard
> > to follow and the probability of losing focus high.
> 
> It's not a separate issue, it's the entire point of this bug.  So yes,
> let's stay focused on this issue and not other issues.
> 
> Which of these two options do you think we should go with?

I would like fix the process deletion and closing of the descriptors
first, because that can definitely be one reason for EBADF.  We can
then discuss additional measures.  From where I stand, it is best to
completely eliminate EBADF due to these problems; if we succeed, the
code which deals with EBADF could maybe be left as it is, since this
should then be some spurious problem with no known reasons.  If it
turns out that we cannot eliminate EBADF completely by fixing the
process-deactivation code, then yes, we will need to do something when
we get EBADF as result.

So what happened to the changes in the code that handles deactivation
of a process? can we return to that and arrive at some agreed-upon
solution?  AFAIR, last time I asked you why not fix these problems
inside status_notify (and the few other places which use
FOR_EACH_PROCESS).

> > Because if that pipe is in use, then the SIGCHLD handler writes to it,
> > and that will cause pselect to return with that descriptor in
> > Available, and we then handle the death of the process.  In addition,
> > delete_read_fd just clears the data from the fd_callback_info of the
> > descriptor, and the loop that examines the bits in Available is AFAICT
> > careful enough to test these fields of the callback info before doing
> > anything because the descriptor was returned by pselect as ready.  So
> > what kind of problems could have happened in step 6 above?
> 
> I'm not sure.

That's my reading of the code.  Of course, I could be wrong, but then
we need a description of the scenario details which causes such
problems, including the control and data flow which make it possible.

> > And once again, this is a separate issue, so let's discuss it
> > separately.
> 
> Okay.  Let me just say again that I suspect there are bugs here which
> can be triggered by both threads and signal handlers.  We can leave it
> at that.

Yes, I agree that signals add a non-trivial complexity to this stuff,
due to their asynchronous nature and their ability to run code on the
"wrong" thread.

This bug report was last modified 8 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #79334 [PATCH] Don't release thread select lock unnecessarily

GNU bug report logs - #79334
[PATCH] Don't release thread select lock unnecessarily