Package: emacs;
Reported by: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Thu, 28 Aug 2025 21:10:02 UTC
Severity: normal
Tags: patch
To reply to this bug, email your comments to 79334 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
dmitry <at> gutov.dev, bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Thu, 28 Aug 2025 21:10:02 GMT) Full text and rfc822 format available.Spencer Baugh <sbaugh <at> janestreet.com>
:dmitry <at> gutov.dev, bug-gnu-emacs <at> gnu.org
.
(Thu, 28 Aug 2025 21:10:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: bug-gnu-emacs <at> gnu.org Subject: [PATCH] Don't release thread select lock unnecessarily Date: Thu, 28 Aug 2025 17:08:43 -0400
[Message part 1 (text/plain, inline)]
Tags: patch Previously, we were using thread_select to release the thread lock around two different calls to select in wait_reading_process_output. The main call to select was fine, but we also used thread_select for a pselect call which returns immediately (with a timeout of 0) and is supposed to just check if any file descriptors are active. If we actually thread switch at that pselect, it's likely to break internal state for Emacs: namely, the call to status_notify immediately after can close file descriptors which another thread is selecting on, causing that thread to get EBADF from select and then call emacs_abort. We don't need to thread switch here, so don't. * src/process.c (wait_reading_process_output): Remove unnecessary thread_select wrapper. In GNU Emacs 30.1.90 (build 29, x86_64-pc-linux-gnu, X toolkit, cairo version 1.15.12, Xaw scroll bars) of 2025-08-26 built on igm-qws-u22796a Repository revision: 54857fe2fb0ed033afcba231f920f8f7fa185333 Repository branch: emacs-30 Windowing system distributor 'The X.Org Foundation', version 11.0.12011000 System Description: Rocky Linux 8.10 (Green Obsidian) Configured using: 'configure --with-x-toolkit=lucid --without-gpm --without-gconf --without-selinux --without-imagemagick --with-modules --with-gif=no --with-cairo --with-rsvg --without-compress-install --with-tree-sitter --with-native-compilation=aot PKG_CONFIG_PATH=/usr/local/home/garnish/libtree-sitter/0.22.6-1/lib/pkgconfig/'
[0001-Don-t-release-thread-select-lock-unnecessarily.patch (text/patch, attachment)]
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 12:26:02 GMT) Full text and rfc822 format available.Message #8 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com>, Paul Eggert <eggert <at> cs.ucla.edu> Cc: 79334 <at> debbugs.gnu.org Subject: Re: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 15:24:59 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Date: Thu, 28 Aug 2025 17:08:43 -0400 > > Previously, we were using thread_select to release the thread > lock around two different calls to select in > wait_reading_process_output. The main call to select was fine, > but we also used thread_select for a pselect call which returns > immediately (with a timeout of 0) and is supposed to just check > if any file descriptors are active. > > If we actually thread switch at that pselect, it's likely to > break internal state for Emacs: namely, the call to > status_notify immediately after can close file descriptors which > another thread is selecting on, causing that thread to get EBADF > from select and then call emacs_abort. > > We don't need to thread switch here, so don't. > > * src/process.c (wait_reading_process_output): Remove > unnecessary thread_select wrapper. Thanks. However, I don't think I follow the logic behind this change. I agree that status_notify does things that can be dangerous (more about that below), but the call to status_notify will happen whether we call thread_select or pselect directly, so if status_notify can do some damage, it will do that regardless of whether we call pselect directly or not. I think you assume that the problematic call to status_notify will happen in another thread, if we allow switching threads at this point, is that right? IOW, I think you assume that, if we switch to another thread, that other thread could call status_notify, and thus close some descriptors on which the current thread, the one which calls thread_select at this point, is waiting. If that is your assumption, then the actual situation could also be the opposite: the current thread, the one where you want to call pselect, will call status_notify after pselect returns, and that will close some descriptors used by other threads. In that case, calling pselect directly will make things worse, not better, because it makes this closing of descriptors imminent. So, while I agree we have a problem here, I think the place and the way we should solve it are different. Which brings us to status_notify. It indeed can close the descriptors of processes which terminated, so we should probably make it safer. For example, we could leave alone descriptors marked with non-NULL waiting_thread if that thread is still alive and is different from the current thread. Would that fix the problem? If yes, I think we should install such a change. We should perhaps also understand better what happens with the current code if and when descriptors are closed by some other thread in this scenario. A thread using such a descriptor could be either stuck in the pselect call, or it could be before or after the call to pselect. Paul, what happens if a descriptor passed to pselect becomes closed while pselect waits? does pselect return with EBADF, or does pselect handle that gracefully, like by considering such descriptors to be never "ready"? If the thread is before or after pselect, then either calling pselect or attempting to read from a closed descriptor after pselect returns will cause EBADF. Is our code prepared to deal with such a situation (e.g., by ignoring the failed read), or will that cause some Lisp API to signal an error or produce unexpected results? If the latter, we should probably handle EBADF specially, because I don't believe we could completely prevent this from happening.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 13:08:02 GMT) Full text and rfc822 format available.Message #11 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>, Dmitry Gutov <dmitry <at> gutov.dev> Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 09:07:50 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Date: Thu, 28 Aug 2025 17:08:43 -0400 >> >> Previously, we were using thread_select to release the thread >> lock around two different calls to select in >> wait_reading_process_output. The main call to select was fine, >> but we also used thread_select for a pselect call which returns >> immediately (with a timeout of 0) and is supposed to just check >> if any file descriptors are active. >> >> If we actually thread switch at that pselect, it's likely to >> break internal state for Emacs: namely, the call to >> status_notify immediately after can close file descriptors which >> another thread is selecting on, causing that thread to get EBADF >> from select and then call emacs_abort. >> >> We don't need to thread switch here, so don't. >> >> * src/process.c (wait_reading_process_output): Remove >> unnecessary thread_select wrapper. > > Thanks. > > However, I don't think I follow the logic behind this change. I agree > that status_notify does things that can be dangerous (more about that > below), but the call to status_notify will happen whether we call > thread_select or pselect directly, so if status_notify can do some > damage, it will do that regardless of whether we call pselect directly > or not. I think you assume that the problematic call to status_notify > will happen in another thread, if we allow switching threads at this > point, is that right? IOW, I think you assume that, if we switch to > another thread, that other thread could call status_notify, and thus > close some descriptors on which the current thread, the one which > calls thread_select at this point, is waiting. If that is your > assumption, then the actual situation could also be the opposite: the > current thread, the one where you want to call pselect, will call > status_notify after pselect returns, and that will close some > descriptors used by other threads. In that case, calling pselect > directly will make things worse, not better, because it makes this > closing of descriptors imminent. Yes. I think there's multiple problems in this code and I haven't tracked them all down yet. That being said, I'm confident that it's incorrect and unnecessary to be using thread_select instead of pselect here, so we should stop doing that in addition to any other fix. > So, while I agree we have a problem here, I think the place and the > way we should solve it are different. > > Which brings us to status_notify. It indeed can close the descriptors > of processes which terminated, so we should probably make it safer. > For example, we could leave alone descriptors marked with non-NULL > waiting_thread if that thread is still alive and is different from the > current thread. Would that fix the problem? If yes, I think we > should install such a change. That was my initial idea for a fix as well. But I'm not sure about the specific details. Namely, perhaps we need to be checking waiting_thread everywhere that we're doing FOR_EACH_PROCESS, not just in status_notify. And maybe some new functions/macros should be added to simplify this; maybe a version of FOR_EACH_PROCESS which only iterates over processes for which waiting_thread == current_thread. > We should perhaps also understand better what happens with the current > code if and when descriptors are closed by some other thread in this > scenario. A thread using such a descriptor could be either stuck in > the pselect call, or it could be before or after the call to pselect. > Paul, what happens if a descriptor passed to pselect becomes closed > while pselect waits? does pselect return with EBADF, or does pselect > handle that gracefully, like by considering such descriptors to be > never "ready"? > > If the thread is before or after pselect, then either calling pselect > or attempting to read from a closed descriptor after pselect returns > will cause EBADF. Is our code prepared to deal with such a situation > (e.g., by ignoring the failed read), or will that cause some Lisp API > to signal an error or produce unexpected results? If the latter, we > should probably handle EBADF specially, because I don't believe we > could completely prevent this from happening. (I'm more hopeful: I think we can completely prevent this from happening, we just need to be more careful about our locking.)
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 13:38:03 GMT) Full text and rfc822 format available.Message #14 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 16:36:51 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 79334 <at> debbugs.gnu.org, Dmitry Gutov > <dmitry <at> gutov.dev> > Date: Fri, 29 Aug 2025 09:07:50 -0400 > > Eli Zaretskii <eliz <at> gnu.org> writes: > > > However, I don't think I follow the logic behind this change. I agree > > that status_notify does things that can be dangerous (more about that > > below), but the call to status_notify will happen whether we call > > thread_select or pselect directly, so if status_notify can do some > > damage, it will do that regardless of whether we call pselect directly > > or not. I think you assume that the problematic call to status_notify > > will happen in another thread, if we allow switching threads at this > > point, is that right? IOW, I think you assume that, if we switch to > > another thread, that other thread could call status_notify, and thus > > close some descriptors on which the current thread, the one which > > calls thread_select at this point, is waiting. If that is your > > assumption, then the actual situation could also be the opposite: the > > current thread, the one where you want to call pselect, will call > > status_notify after pselect returns, and that will close some > > descriptors used by other threads. In that case, calling pselect > > directly will make things worse, not better, because it makes this > > closing of descriptors imminent. > > Yes. I think there's multiple problems in this code and I haven't > tracked them all down yet. > > That being said, I'm confident that it's incorrect and unnecessary to be > using thread_select instead of pselect here, so we should stop doing > that in addition to any other fix. If we fix status_notify, what other reasons remain for not calling thread_select there? > > So, while I agree we have a problem here, I think the place and the > > way we should solve it are different. > > > > Which brings us to status_notify. It indeed can close the descriptors > > of processes which terminated, so we should probably make it safer. > > For example, we could leave alone descriptors marked with non-NULL > > waiting_thread if that thread is still alive and is different from the > > current thread. Would that fix the problem? If yes, I think we > > should install such a change. > > That was my initial idea for a fix as well. But I'm not sure about the > specific details. Namely, perhaps we need to be checking waiting_thread > everywhere that we're doing FOR_EACH_PROCESS, not just in status_notify. > And maybe some new functions/macros should be added to simplify this; > maybe a version of FOR_EACH_PROCESS which only iterates over processes > for which waiting_thread == current_thread. You may be right, we should audit all the users of FOR_EACH_PROCESS. At least one of them (get-buffer-process) is probably okay as it is, but others might need to avoid looking at processes locked to other threads, indeed. > > We should perhaps also understand better what happens with the current > > code if and when descriptors are closed by some other thread in this > > scenario. A thread using such a descriptor could be either stuck in > > the pselect call, or it could be before or after the call to pselect. > > Paul, what happens if a descriptor passed to pselect becomes closed > > while pselect waits? does pselect return with EBADF, or does pselect > > handle that gracefully, like by considering such descriptors to be > > never "ready"? > > > > If the thread is before or after pselect, then either calling pselect > > or attempting to read from a closed descriptor after pselect returns > > will cause EBADF. Is our code prepared to deal with such a situation > > (e.g., by ignoring the failed read), or will that cause some Lisp API > > to signal an error or produce unexpected results? If the latter, we > > should probably handle EBADF specially, because I don't believe we > > could completely prevent this from happening. > > (I'm more hopeful: I think we can completely prevent this from > happening, we just need to be more careful about our locking.) If we can prevent that completely, it's even better.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 13:46:01 GMT) Full text and rfc822 format available.Message #17 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 09:45:07 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 79334 <at> debbugs.gnu.org, Dmitry Gutov >> <dmitry <at> gutov.dev> >> Date: Fri, 29 Aug 2025 09:07:50 -0400 >> >> Eli Zaretskii <eliz <at> gnu.org> writes: >> >> > However, I don't think I follow the logic behind this change. I agree >> > that status_notify does things that can be dangerous (more about that >> > below), but the call to status_notify will happen whether we call >> > thread_select or pselect directly, so if status_notify can do some >> > damage, it will do that regardless of whether we call pselect directly >> > or not. I think you assume that the problematic call to status_notify >> > will happen in another thread, if we allow switching threads at this >> > point, is that right? IOW, I think you assume that, if we switch to >> > another thread, that other thread could call status_notify, and thus >> > close some descriptors on which the current thread, the one which >> > calls thread_select at this point, is waiting. If that is your >> > assumption, then the actual situation could also be the opposite: the >> > current thread, the one where you want to call pselect, will call >> > status_notify after pselect returns, and that will close some >> > descriptors used by other threads. In that case, calling pselect >> > directly will make things worse, not better, because it makes this >> > closing of descriptors imminent. >> >> Yes. I think there's multiple problems in this code and I haven't >> tracked them all down yet. >> >> That being said, I'm confident that it's incorrect and unnecessary to be >> using thread_select instead of pselect here, so we should stop doing >> that in addition to any other fix. > > If we fix status_notify, what other reasons remain for not calling > thread_select there? Primarily: Calling thread_select here complicates reasoning about the code. We should not release the lock when not necessary, because it makes it harder to think about possible thread interleavings. And it's not necessary here because this pselect call returns immediately. (The only reason the pselect call wouldn't return immediately is if we released the lock and then switched to another thread) Also, avoiding thread_select here will slightly improve performance by avoiding lock operations.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 14:02:02 GMT) Full text and rfc822 format available.Message #20 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 17:00:56 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Fri, 29 Aug 2025 09:45:07 -0400 > > Eli Zaretskii <eliz <at> gnu.org> writes: > > >> That being said, I'm confident that it's incorrect and unnecessary to be > >> using thread_select instead of pselect here, so we should stop doing > >> that in addition to any other fix. > > > > If we fix status_notify, what other reasons remain for not calling > > thread_select there? > > Primarily: Calling thread_select here complicates reasoning about the > code. We should not release the lock when not necessary, because it > makes it harder to think about possible thread interleavings. And it's > not necessary here because this pselect call returns immediately. How do we define "necessary" in this context? > (The only reason the pselect call wouldn't return immediately is if we > released the lock and then switched to another thread) You seem to assume that a call to thread_select is only necessary when some potentially prolonged operation or wait is expected? But I'm not sure this is correct, because that could prevent other threads from running for too long, and force programmers to inject sleep-for etc. in their programs. By allowing thread switches as frequently as possible lets other threads more opportunities to run (to some extent simulating preemptive scheduling), which I think is a Good Thing overall, no?
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 14:50:03 GMT) Full text and rfc822 format available.Message #23 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 10:49:34 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev >> Date: Fri, 29 Aug 2025 09:45:07 -0400 >> >> Eli Zaretskii <eliz <at> gnu.org> writes: >> >> >> That being said, I'm confident that it's incorrect and unnecessary to be >> >> using thread_select instead of pselect here, so we should stop doing >> >> that in addition to any other fix. >> > >> > If we fix status_notify, what other reasons remain for not calling >> > thread_select there? >> >> Primarily: Calling thread_select here complicates reasoning about the >> code. We should not release the lock when not necessary, because it >> makes it harder to think about possible thread interleavings. And it's >> not necessary here because this pselect call returns immediately. > > How do we define "necessary" in this context? It's necessary to release the lock around long pselect calls because otherwise other threads will not run while Emacs is waiting for input, which means they probably won't run at all. >> (The only reason the pselect call wouldn't return immediately is if we >> released the lock and then switched to another thread) > > You seem to assume that a call to thread_select is only necessary when > some potentially prolonged operation or wait is expected? But I'm not > sure this is correct, because that could prevent other threads from > running for too long, and force programmers to inject sleep-for > etc. in their programs. By allowing thread switches as frequently as > possible lets other threads more opportunities to run (to some extent > simulating preemptive scheduling), which I think is a Good Thing > overall, no? Three points: - Most importantly: We can only allow thread switches at known safe points. It's not clear that this is a safe point. (Actually, we know it currently isn't, because of the status_notify issue. But I suspect there may be other issues which are harder to analyze) - Adding more thread switches is not necessarily a good thing. It can degrade whole-system performance, because context switching is slow. - wait_reading_process_output already does a thread switch around the pselect, so it's not important for one to happen here.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 15:42:02 GMT) Full text and rfc822 format available.Message #26 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 18:41:00 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Fri, 29 Aug 2025 10:49:34 -0400 > > Eli Zaretskii <eliz <at> gnu.org> writes: > > >> Primarily: Calling thread_select here complicates reasoning about the > >> code. We should not release the lock when not necessary, because it > >> makes it harder to think about possible thread interleavings. And it's > >> not necessary here because this pselect call returns immediately. > > > > How do we define "necessary" in this context? > > It's necessary to release the lock around long pselect calls because > otherwise other threads will not run while Emacs is waiting for input, > which means they probably won't run at all. > > >> (The only reason the pselect call wouldn't return immediately is if we > >> released the lock and then switched to another thread) > > > > You seem to assume that a call to thread_select is only necessary when > > some potentially prolonged operation or wait is expected? But I'm not > > sure this is correct, because that could prevent other threads from > > running for too long, and force programmers to inject sleep-for > > etc. in their programs. By allowing thread switches as frequently as > > possible lets other threads more opportunities to run (to some extent > > simulating preemptive scheduling), which I think is a Good Thing > > overall, no? > > Three points: > > - Most importantly: We can only allow thread switches at known safe > points. It's not clear that this is a safe point. (Actually, we know > it currently isn't, because of the status_notify issue. But I suspect > there may be other issues which are harder to analyze) This reason will be removed once we fix status_notify. > - Adding more thread switches is not necessarily a good thing. It can > degrade whole-system performance, because context switching is slow. > > - wait_reading_process_output already does a thread switch around the > pselect, so it's not important for one to happen here. You basically give the same single reason in several different wordings. I see your point, but my experience with threads in Emacs is the opposite: there seem to be too few thread switch opportunities, so many programs don't work as expected unless one sprinkles sleep-for in various places. For example, try to run something in a non-main thread and then attempt to interact with Emacs in the main thread. IOW, it is quite easy for a Lisp program to hog the global lock and starve the other threads, so each additional thread-switch opportunity is IME a blessing, not a curse.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 15:44:01 GMT) Full text and rfc822 format available.Message #29 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev> Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 11:43:21 -0400
[Message part 1 (text/plain, inline)]
On Fri, Aug 29, 2025, 11:41 AM Eli Zaretskii <eliz <at> gnu.org> wrote: > > From: Spencer Baugh <sbaugh <at> janestreet.com> > > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > > Date: Fri, 29 Aug 2025 10:49:34 -0400 > > > > Eli Zaretskii <eliz <at> gnu.org> writes: > > > > >> Primarily: Calling thread_select here complicates reasoning about the > > >> code. We should not release the lock when not necessary, because it > > >> makes it harder to think about possible thread interleavings. And > it's > > >> not necessary here because this pselect call returns immediately. > > > > > > How do we define "necessary" in this context? > > > > It's necessary to release the lock around long pselect calls because > > otherwise other threads will not run while Emacs is waiting for input, > > which means they probably won't run at all. > > > > >> (The only reason the pselect call wouldn't return immediately is if we > > >> released the lock and then switched to another thread) > > > > > > You seem to assume that a call to thread_select is only necessary when > > > some potentially prolonged operation or wait is expected? But I'm not > > > sure this is correct, because that could prevent other threads from > > > running for too long, and force programmers to inject sleep-for > > > etc. in their programs. By allowing thread switches as frequently as > > > possible lets other threads more opportunities to run (to some extent > > > simulating preemptive scheduling), which I think is a Good Thing > > > overall, no? > > > > Three points: > > > > - Most importantly: We can only allow thread switches at known safe > > points. It's not clear that this is a safe point. (Actually, we know > > it currently isn't, because of the status_notify issue. But I suspect > > there may be other issues which are harder to analyze) > > This reason will be removed once we fix status_notify. > > > - Adding more thread switches is not necessarily a good thing. It can > > degrade whole-system performance, because context switching is slow. > > > > - wait_reading_process_output already does a thread switch around the > > pselect, so it's not important for one to happen here. > > You basically give the same single reason in several different > wordings. > > I see your point, but my experience with threads in Emacs is the > opposite: there seem to be too few thread switch opportunities, so > many programs don't work as expected unless one sprinkles sleep-for in > various places. For example, try to run something in a non-main > thread and then attempt to interact with Emacs in the main thread. > > IOW, it is quite easy for a Lisp program to hog the global lock and > starve the other threads, so each additional thread-switch opportunity > is IME a blessing, not a curse. > Okay, so then how about we delete this unsafe thread_select and put some thread-yield calls around wait_reading_process_output? >
[Message part 2 (text/html, inline)]
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 15:58:02 GMT) Full text and rfc822 format available.Message #32 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 18:57:33 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Date: Fri, 29 Aug 2025 11:43:21 -0400 > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev> > > Okay, so then how about we delete this unsafe thread_select and put some thread-yield calls around > wait_reading_process_output? I'm not objected to not calling thread_select there, but can we have tests that will show us the gains? Also, I think we should fix the problems with status_notify and other users of FOR_EACH_PROCESS before we make the change with thread_select, because current code in status_notify is definitely wrong, and affects the related behavior.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 16:12:01 GMT) Full text and rfc822 format available.Message #35 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev> Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 12:11:02 -0400
[Message part 1 (text/plain, inline)]
On Fri, Aug 29, 2025, 11:57 AM Eli Zaretskii <eliz <at> gnu.org> wrote: > > From: Spencer Baugh <sbaugh <at> janestreet.com> > > Date: Fri, 29 Aug 2025 11:43:21 -0400 > > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov < > dmitry <at> gutov.dev> > > > > Okay, so then how about we delete this unsafe thread_select and put some > thread-yield calls around > > wait_reading_process_output? > > I'm not objected to not calling thread_select there, but can we have > tests that will show us the gains? > The primary gain is that it becomes easier to reason about locking. I'm not sure how to write a test for that :) Also, I think we should fix the problems with status_notify and other > users of FOR_EACH_PROCESS before we make the change with > thread_select, because current code in status_notify is definitely > wrong, and affects the related behavior. > Yes, I'll work on a patch which does both. (I'm not going to bother making a change which doesn't rely on the change to remove the thread_select, because this code is hard enough to reason about already) >
[Message part 2 (text/html, inline)]
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 17:33:03 GMT) Full text and rfc822 format available.Message #38 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 13:31:52 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 79334 <at> debbugs.gnu.org, Dmitry Gutov >> <dmitry <at> gutov.dev> >> Date: Fri, 29 Aug 2025 09:07:50 -0400 >> >> Eli Zaretskii <eliz <at> gnu.org> writes: >> > Which brings us to status_notify. It indeed can close the descriptors >> > of processes which terminated, so we should probably make it safer. >> > For example, we could leave alone descriptors marked with non-NULL >> > waiting_thread if that thread is still alive and is different from the >> > current thread. Would that fix the problem? If yes, I think we >> > should install such a change. >> >> That was my initial idea for a fix as well. But I'm not sure about the >> specific details. Namely, perhaps we need to be checking waiting_thread >> everywhere that we're doing FOR_EACH_PROCESS, not just in status_notify. >> And maybe some new functions/macros should be added to simplify this; >> maybe a version of FOR_EACH_PROCESS which only iterates over processes >> for which waiting_thread == current_thread. > > You may be right, we should audit all the users of FOR_EACH_PROCESS. > At least one of them (get-buffer-process) is probably okay as it is, > but others might need to avoid looking at processes locked to other > threads, indeed. While doing this, I had another thought... Do you know if we have any code which handles the fact that delete-process might be called on a process while another thread is calling pselect on its fds? I don't think we do, which suggests that might also be problematic. I'll work on a fix which encompasses that as well.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 19:51:02 GMT) Full text and rfc822 format available.Message #41 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Fri, 29 Aug 2025 15:50:00 -0400
[Message part 1 (text/plain, inline)]
Spencer Baugh <sbaugh <at> janestreet.com> writes: > While doing this, I had another thought... > > Do you know if we have any code which handles the fact that > delete-process might be called on a process while another thread is > calling pselect on its fds? > > I don't think we do, which suggests that might also be problematic. > I'll work on a fix which encompasses that as well. OK, I think this patch fixes the issue. Many different pieces of code in Emacs can close file descriptors. We need to be generically careful to not close file descriptors which some other thread is currently waiting on. In my patch, we do this by letting the waiting_thread itself close those file descriptors, after it returns from select. Unfortunately I don't have a reliable test which produces the failure case, but in this patch I added a fprintf when we close a deferred fd. This prints whenever some thread would have closed an fd that another thread was waiting on. If I run the following program (with emacs -Q --batch): (defun my-break-thread () (let ((proc (make-process :name "foo" :command '("sleep" "1")))) (while (process-live-p proc) (accept-process-output proc 0.05)))) (while t (make-thread #'my-break-thread "thread1") (thread-join (make-thread #'my-break-thread "thread2"))) I get about one print every 10 seconds. Each print is a problem case which has been averted, so I do think this patch fixes the bug. (Obviously, we should remove the fprintf before applying the change)
[0001-Defer-closing-file-descriptors-used-by-other-threads.patch (text/x-patch, inline)]
From 4dda5e4fff3067cac53d6273ed3bd327437d00c8 Mon Sep 17 00:00:00 2001 From: Spencer Baugh <sbaugh <at> janestreet.com> Date: Fri, 29 Aug 2025 15:39:00 -0400 Subject: [PATCH] Defer closing file descriptors used by other threads * src/process.c (other_thread_is_waiting_for): Add. (clear_fd_callback_data): Don't clear WAITING_FOR_CLOSE_FD. (close_process_fd): Set WAITING_FOR_CLOSE_FD for fds with a waiting_thread. (wait_reading_process_output): Close fds with WAITING_FOR_CLOSE_FD. (bug#79334) --- src/process.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/src/process.c b/src/process.c index b33bebb9a35..8273ba78095 100644 --- a/src/process.c +++ b/src/process.c @@ -448,7 +448,9 @@ make_lisp_proc (struct Lisp_Process *p) /* This descriptor refers to a process. */ PROCESS_FD = 8, /* A non-blocking connect. Only valid if FOR_WRITE is set. */ - NON_BLOCKING_CONNECT_FD = 16 + NON_BLOCKING_CONNECT_FD = 16, + /* This fd is waiting to be closed. */ + WAITING_FOR_CLOSE_FD = 32, }; static struct fd_callback_data @@ -466,14 +468,30 @@ make_lisp_proc (struct Lisp_Process *p) struct thread_state *waiting_thread; } fd_callback_info[FD_SETSIZE]; +static bool +other_thread_is_waiting_for (struct fd_callback_data* elem) +{ + return elem->waiting_thread != NULL && elem->waiting_thread != current_thread; +} + static void clear_fd_callback_data(struct fd_callback_data* elem) { elem->func = NULL; elem->data = NULL; - elem->flags = 0; elem->thread = NULL; - elem->waiting_thread = NULL; + if (other_thread_is_waiting_for (elem)) + { + /* Preserve WAITING_FOR_CLOSE_FD; see process_close_fd. Leaving + flags non-zero means recompute_max_desc won't reduce max_desc + past this element. */ + elem->flags &= WAITING_FOR_CLOSE_FD; + } + else + { + elem->flags = 0; + elem->waiting_thread = NULL; + } } @@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr) if (0 <= fd) { *fd_addr = -1; - emacs_close (fd); + if (other_thread_is_waiting_for (&fd_callback_info[fd])) + /* Let waiting_thread handle closing the fd. */ + fd_callback_info[fd].flags = WAITING_FOR_CLOSE_FD; + else + emacs_close (fd); } } @@ -5816,6 +5838,22 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd, /* Make C-g and alarm signals set flags again. */ clear_waiting_for_input (); + /* Close fds which other threads wanted to close. */ + for (int fd = 0; fd <= max_desc; ++fd) + { + if (fd_callback_info[fd].waiting_thread == current_thread + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) + { + fprintf (stderr, "closing deferred fd %d\n", fd); + emacs_close (fd); + /* Ignore any events which happened on this fd. */ + FD_CLR (fd, &Available); + FD_CLR (fd, &Writeok); + fd_callback_info[fd].flags = 0; + } + } + recompute_max_desc (); + /* If we woke up due to SIGWINCH, actually change size now. */ do_pending_window_change (0); -- 2.43.7
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Fri, 29 Aug 2025 20:26:01 GMT) Full text and rfc822 format available.Message #44 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Fri, 29 Aug 2025 16:25:39 -0400
[Message part 1 (text/plain, inline)]
Spencer Baugh <sbaugh <at> janestreet.com> writes: > Spencer Baugh <sbaugh <at> janestreet.com> writes: >> While doing this, I had another thought... >> >> Do you know if we have any code which handles the fact that >> delete-process might be called on a process while another thread is >> calling pselect on its fds? >> >> I don't think we do, which suggests that might also be problematic. >> I'll work on a fix which encompasses that as well. > > OK, I think this patch fixes the issue. > > Many different pieces of code in Emacs can close file descriptors. We > need to be generically careful to not close file descriptors which some > other thread is currently waiting on. In my patch, we do this by > letting the waiting_thread itself close those file descriptors, after it > returns from select. > > Unfortunately I don't have a reliable test which produces the failure > case, but in this patch I added a fprintf when we close a deferred fd. > This prints whenever some thread would have closed an fd that another > thread was waiting on. If I run the following program (with emacs -Q > --batch): > > (defun my-break-thread () > (let ((proc (make-process > :name "foo" > :command '("sleep" "1")))) > (while (process-live-p proc) > (accept-process-output proc 0.05)))) > (while t > (make-thread #'my-break-thread "thread1") > (thread-join (make-thread #'my-break-thread "thread2"))) > > I get about one print every 10 seconds. Each print is a problem case > which has been averted, so I do think this patch fixes the bug. > > (Obviously, we should remove the fprintf before applying the change) Of course, I immediately found a bug in my patch (I wasn't clearing waiting_thread after closing a deferred fd). Fixed in the attached. In related news: running process-tests.el in a loop also triggers my "Closing deferred fd" message. Which suggests that src/process-tests.el was probably also suffering from bugs in this area, which could have been causing hangs or flaky failures. Which should now hopefully be fixed.
[0001-Defer-closing-file-descriptors-used-by-other-threads.patch (text/x-patch, inline)]
From 9e46aa50f5de7d719a6e427fe44bb6b8b8e226fc Mon Sep 17 00:00:00 2001 From: Spencer Baugh <sbaugh <at> janestreet.com> Date: Fri, 29 Aug 2025 15:39:00 -0400 Subject: [PATCH] Defer closing file descriptors used by other threads * src/process.c (other_thread_is_waiting_for): Add. (clear_fd_callback_data): Don't clear WAITING_FOR_CLOSE_FD. (close_process_fd): Set WAITING_FOR_CLOSE_FD for fds with a waiting_thread. (wait_reading_process_output): Close fds with WAITING_FOR_CLOSE_FD. (bug#79334) --- src/process.c | 46 ++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 42 insertions(+), 4 deletions(-) diff --git a/src/process.c b/src/process.c index b33bebb9a35..9180b92c61f 100644 --- a/src/process.c +++ b/src/process.c @@ -448,7 +448,9 @@ make_lisp_proc (struct Lisp_Process *p) /* This descriptor refers to a process. */ PROCESS_FD = 8, /* A non-blocking connect. Only valid if FOR_WRITE is set. */ - NON_BLOCKING_CONNECT_FD = 16 + NON_BLOCKING_CONNECT_FD = 16, + /* This fd is waiting to be closed. */ + WAITING_FOR_CLOSE_FD = 32, }; static struct fd_callback_data @@ -466,14 +468,30 @@ make_lisp_proc (struct Lisp_Process *p) struct thread_state *waiting_thread; } fd_callback_info[FD_SETSIZE]; +static bool +other_thread_is_waiting_for (struct fd_callback_data* elem) +{ + return elem->waiting_thread != NULL && elem->waiting_thread != current_thread; +} + static void clear_fd_callback_data(struct fd_callback_data* elem) { elem->func = NULL; elem->data = NULL; - elem->flags = 0; elem->thread = NULL; - elem->waiting_thread = NULL; + if (other_thread_is_waiting_for (elem)) + { + /* Preserve WAITING_FOR_CLOSE_FD; see process_close_fd. Leaving + flags non-zero means recompute_max_desc won't reduce max_desc + past this element. */ + elem->flags &= WAITING_FOR_CLOSE_FD; + } + else + { + elem->flags = 0; + elem->waiting_thread = NULL; + } } @@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr) if (0 <= fd) { *fd_addr = -1; - emacs_close (fd); + if (other_thread_is_waiting_for (&fd_callback_info[fd])) + /* Let waiting_thread handle closing the fd. */ + fd_callback_info[fd].flags = WAITING_FOR_CLOSE_FD; + else + emacs_close (fd); } } @@ -5816,6 +5838,22 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd, /* Make C-g and alarm signals set flags again. */ clear_waiting_for_input (); + /* Close fds which other threads wanted to close. */ + for (int fd = 0; fd <= max_desc; ++fd) + { + if (fd_callback_info[fd].waiting_thread == current_thread + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) + { + fprintf (stderr, "closing deferred fd %d\n", fd); + emacs_close (fd); + /* Ignore any events which happened on this fd. */ + FD_CLR (fd, &Available); + FD_CLR (fd, &Writeok); + clear_fd_callback_data (&fd_callback_info[fd]); + } + } + recompute_max_desc (); + /* If we woke up due to SIGWINCH, actually change size now. */ do_pending_window_change (0); -- 2.43.7
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Sat, 30 Aug 2025 06:19:01 GMT) Full text and rfc822 format available.Message #47 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Sat, 30 Aug 2025 09:18:12 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Fri, 29 Aug 2025 15:50:00 -0400 > > Many different pieces of code in Emacs can close file descriptors. We > need to be generically careful to not close file descriptors which some > other thread is currently waiting on. In my patch, we do this by > letting the waiting_thread itself close those file descriptors, after it > returns from select. Not literally "when returning from pselect", I hope, but when the process is being deleted _and_ the waiting thread returns from pselect, right? IOW, we don't close the descriptor each time pselecft returns in the right thread, then reopen it before the next call to pselect, because that won't work. > +static bool > +other_thread_is_waiting_for (struct fd_callback_data* elem) > +{ > + return elem->waiting_thread != NULL && elem->waiting_thread != current_thread; > +} This should also check that the waiting thread is still alive. If it isn't, it's okay to close the descriptor. Otherwise, descriptors belonging to threads that exited might never be closed. > @@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr) > if (0 <= fd) > { > *fd_addr = -1; > - emacs_close (fd); > + if (other_thread_is_waiting_for (&fd_callback_info[fd])) > + /* Let waiting_thread handle closing the fd. */ ^^ Style: 2 spaces between the period and "*/". > + /* Close fds which other threads wanted to close. */ > + for (int fd = 0; fd <= max_desc; ++fd) > + { > + if (fd_callback_info[fd].waiting_thread == current_thread > + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) > + { > + fprintf (stderr, "closing deferred fd %d\n", fd); > + emacs_close (fd); > + /* Ignore any events which happened on this fd. */ > + FD_CLR (fd, &Available); > + FD_CLR (fd, &Writeok); > + fd_callback_info[fd].flags = 0; Why do it here and not when we deactivate processes? Until now we never closed any descriptors inside wait_reading_process_output. More generally, why this design and not a simpler change which only deactivates processes whose I/O is waited by the current thread, as discussed previously? I think this alternative would be easier to understand and reason about, because (as established in another discussion) wait_reading_process_output frequently loops more than we expect it to. Looking at all the callers of close_process_fd, it sounds like it's too low-level to do this. For example, it is called from create_process (including in the forked child) to close the unneeded ends of the pipe, where we probably don't want this. And clear_fd_callback_data is also used for the keyboard input descriptor, which is probably also not relevant. So I wonder whether it's clean to have this logic at such a low level, where the purpose of the descriptor is barely known, and not at a higher level, where we know what each descriptor is used for, and therefore the purpose and the intent of the logic is much more clear.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Sat, 30 Aug 2025 06:37:02 GMT) Full text and rfc822 format available.Message #50 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Sat, 30 Aug 2025 09:35:42 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Fri, 29 Aug 2025 16:25:39 -0400 > > In related news: running process-tests.el in a loop also triggers my > "Closing deferred fd" message. Which suggests that src/process-tests.el > was probably also suffering from bugs in this area, which could have > been causing hangs or flaky failures. Which should now hopefully be > fixed. process-tests uses questionable setup, apart of being based on non-portable assumptions. At the time, I was unable to convince the author of the tests to spell out the assumptions and expectations from what should happen in those situations, without which it was impossible to decide whether some failures on MS-Windows are bugs or just non-portable code that cannot work on Windows. Some of those tests always fail for me on Windows. > + for (int fd = 0; fd <= max_desc; ++fd) > + { > + if (fd_callback_info[fd].waiting_thread == current_thread > + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) ^^^^^^^^^^^^^^^^^^^^^^^ Shouldn't this test that the WAITING_FOR_CLOSE_FD bit is set, not that it is the _only_ bit set? Even if the code is correct today, it might not be future-proof if we add some additional flags at some point. > + /* Ignore any events which happened on this fd. */ > + FD_CLR (fd, &Available); > + FD_CLR (fd, &Writeok); Is this wise? How can we be sure that these events were already handled (presumably, in another thread)? If they were not, we will lose events. What will happen if we don't ignore them here?
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Sat, 30 Aug 2025 15:56:02 GMT) Full text and rfc822 format available.Message #53 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Sat, 30 Aug 2025 11:55:27 -0400
[Message part 1 (text/plain, inline)]
(I'm responding to both your emails with review feedback here) Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev >> Date: Fri, 29 Aug 2025 16:25:39 -0400 >> + for (int fd = 0; fd <= max_desc; ++fd) >> + { >> + if (fd_callback_info[fd].waiting_thread == current_thread >> + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) > ^^^^^^^^^^^^^^^^^^^^^^^ > Shouldn't this test that the WAITING_FOR_CLOSE_FD bit is set, not that > it is the _only_ bit set? Even if the code is correct today, it might > not be future-proof if we add some additional flags at some point. True. Fixed. >> + /* Ignore any events which happened on this fd. */ >> + FD_CLR (fd, &Available); >> + FD_CLR (fd, &Writeok); > > Is this wise? How can we be sure that these events were already > handled (presumably, in another thread)? If they were not, we will > lose events. What will happen if we don't ignore them here? Any events which might appear at this point were already events we were going to lose by closing this file descriptor. For example, if a process produces output just as we do delete-process on it, we never read that output. The only change is that we now have to explicitly ignore those events; before, we silently dropped them at the time we closed the file descriptor. >> Many different pieces of code in Emacs can close file descriptors. We >> need to be generically careful to not close file descriptors which some >> other thread is currently waiting on. In my patch, we do this by >> letting the waiting_thread itself close those file descriptors, after it >> returns from select. > > Not literally "when returning from pselect", I hope, but when the > process is being deleted _and_ the waiting thread returns from > pselect, right? IOW, we don't close the descriptor each time pselecft > returns in the right thread, then reopen it before the next call to > pselect, because that won't work. Right. >> +static bool >> +other_thread_is_waiting_for (struct fd_callback_data* elem) >> +{ >> + return elem->waiting_thread != NULL && elem->waiting_thread != current_thread; >> +} > > This should also check that the waiting thread is still alive. If it > isn't, it's okay to close the descriptor. Otherwise, descriptors > belonging to threads that exited might never be closed. waiting_thread always points to a living thread, because it's only set while a thread is inside wait_reading_process_output. (In fact, when a thread sees waiting_thread set to a value which is neither NULL nor current_thread, it can only ever point to a thread which is currently inside thread_select) >> @@ -2113,7 +2131,11 @@ close_process_fd (int *fd_addr) >> if (0 <= fd) >> { >> *fd_addr = -1; >> - emacs_close (fd); >> + if (other_thread_is_waiting_for (&fd_callback_info[fd])) >> + /* Let waiting_thread handle closing the fd. */ > ^^ > Style: 2 spaces between the period and "*/". Fixed. >> + /* Close fds which other threads wanted to close. */ >> + for (int fd = 0; fd <= max_desc; ++fd) >> + { >> + if (fd_callback_info[fd].waiting_thread == current_thread >> + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) >> + { >> + fprintf (stderr, "closing deferred fd %d\n", fd); >> + emacs_close (fd); >> + /* Ignore any events which happened on this fd. */ >> + FD_CLR (fd, &Available); >> + FD_CLR (fd, &Writeok); >> + fd_callback_info[fd].flags = 0; > > Why do it here and not when we deactivate processes? Because that would close file descriptors that another thread is currently selecting on, causing EBADF. > Until now we never closed any descriptors inside > wait_reading_process_output. Just to be clear, we can call deactivate_process from wait_reading_process_output, through status_notify, through process filters, or other similar. That closes descriptors. > More generally, why this design and not a simpler change which only > deactivates processes whose I/O is waited by the current thread, as > discussed previously? I looked into that, but I determined that it's too complicated. It would require us to keep track of a new kind of process state: a process which has been deleted, but hasn't been deactivated yet. This would need to be supported for each kind of process, and we'd need to record new kinds of state to finish the deactivation later. It's simpler to keep the ability to delete any process at any time, and have that fully deactivate the process as it does today. Doing that only causes one issue: if another thread is selecting on the process's fds, that thread will break if it tries to handle events on the fds, or the thread will simply get EBADF from select. So, we detect that by checking waiting_thread, and marking the fds so that the other thread knows not to touch them, and to instead just close them to finish deleting the process. > I think this alternative would be easier to understand and reason > about, because (as established in another discussion) > wait_reading_process_output frequently loops more than we expect it > to. I'm not sure what you mean by that, why does wait_reading_process_output looping more than we expect cause issues? I don't think it would. > Looking at all the callers of close_process_fd, it sounds like it's > too low-level to do this. For example, it is called from > create_process (including in the forked child) to close the unneeded > ends of the pipe, where we probably don't want this. And > clear_fd_callback_data is also used for the keyboard input descriptor, > which is probably also not relevant. True. I looked into fixing these, and that suggested to me an altogether simpler approach. We can keep the logic entirely contained in wait_reading_process_output, just adding a bit of code around the select call. And we don't need to defer deleting the fds at all: just recognize when it's happened, and be careful to not touch those fds afterwards. Attached.
[0001-Ignore-descriptors-deleted-during-select.patch (text/x-patch, inline)]
From b37e71f47913e1011c4a7857e166e78eddc2a184 Mon Sep 17 00:00:00 2001 From: Spencer Baugh <sbaugh <at> janestreet.com> Date: Sat, 30 Aug 2025 11:42:19 -0400 Subject: [PATCH] Ignore descriptors deleted during select * src/process.c (wait_reading_process_output): Save the fds we're waiting on and check that we're still waiting after select. (bug#79334) --- src/process.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/src/process.c b/src/process.c index 1612248a11d..11af2f55f63 100644 --- a/src/process.c +++ b/src/process.c @@ -5643,6 +5643,7 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd, nfds = read_kbd ? 0 : 1; no_avail = 1; FD_ZERO (&Available); + xerrno = 0; } else { @@ -5756,6 +5757,18 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd, && timespec_cmp (short_timeout, timeout) < 0) timeout = short_timeout; #endif + /* Save the fds we're currently waiting_thread for. */ + fd_set waiting_fds; + FD_ZERO (&waiting_fds); + int max_waiting_fds = 0; + for (int fd = 0; fd <= max_desc; ++fd) + { + if (fd_callback_info[fd].waiting_thread == current_thread) + { + FD_SET (fd, &waiting_fds); + max_waiting_fds = fd; + } + } /* Android requires using a replacement for pselect in android.c to poll for events. */ @@ -5809,10 +5822,30 @@ wait_reading_process_output (intmax_t time_limit, int nsecs, int read_kbd, FD_SET (channel, &Available); } #endif + xerrno = errno; + /* Check if any of waiting_fds was closed by another thread or + a signal handler. */ + for (int fd = 0; fd <= max_waiting_fds; ++fd) + { + if (FD_ISSET (fd, &waiting_fds) + && fd_callback_info[fd].waiting_thread != current_thread) + { + /* Ignore any events which we happened to see on this + fd before it was closed; we can't do anything for + it now, since it's closed. */ + FD_CLR (fd, &Available); + FD_CLR (fd, &Writeok); + /* If we got an EBADF and no events, this is probably + why; clear it so we can try again. */ + if (nfds < 0 && xerrno == EBADF) + { + nfds = 0; + xerrno = 0; + } + } + } } - xerrno = errno; - /* Make C-g and alarm signals set flags again. */ clear_waiting_for_input (); -- 2.43.7
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Sat, 30 Aug 2025 16:37:02 GMT) Full text and rfc822 format available.Message #56 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Sat, 30 Aug 2025 19:35:39 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Sat, 30 Aug 2025 11:55:27 -0400 > > >> + /* Ignore any events which happened on this fd. */ > >> + FD_CLR (fd, &Available); > >> + FD_CLR (fd, &Writeok); > > > > Is this wise? How can we be sure that these events were already > > handled (presumably, in another thread)? If they were not, we will > > lose events. What will happen if we don't ignore them here? > > Any events which might appear at this point were already events we were > going to lose by closing this file descriptor. For example, if a > process produces output just as we do delete-process on it, we never > read that output. The only change is that we now have to explicitly > ignore those events; before, we silently dropped them at the time we > closed the file descriptor. Where did we close it? There's this loop _after_ the place where you add your loop which clears the bits in Available: for (channel = 0; channel <= max_desc; ++channel) { struct fd_callback_data *d = &fd_callback_info[channel]; if (d->func && ((d->flags & FOR_READ && FD_ISSET (channel, &Available)) || ((d->flags & FOR_WRITE) && FD_ISSET (channel, &Writeok)))) d->func (channel, d->data); } AFAIU, it would notice the set bits and act upon them. Now with this new addition we are clearing them ahead of that loop. That doesn't necessarily sound right to me. > >> +static bool > >> +other_thread_is_waiting_for (struct fd_callback_data* elem) > >> +{ > >> + return elem->waiting_thread != NULL && elem->waiting_thread != current_thread; > >> +} > > > > This should also check that the waiting thread is still alive. If it > > isn't, it's okay to close the descriptor. Otherwise, descriptors > > belonging to threads that exited might never be closed. > > waiting_thread always points to a living thread, because it's only set > while a thread is inside wait_reading_process_output. > > (In fact, when a thread sees waiting_thread set to a value which is > neither NULL nor current_thread, it can only ever point to a thread > which is currently inside thread_select) You assume that the waiting_thread field will be cleared when a thread dies, but that is not guaranteed. Why isn't it better to be defensive and make sure the thread is still alive? If you are right, that test will always be true, but what if you are wrong in some rare cases? > >> + if (fd_callback_info[fd].waiting_thread == current_thread > >> + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) > >> + { > >> + fprintf (stderr, "closing deferred fd %d\n", fd); > >> + emacs_close (fd); > >> + /* Ignore any events which happened on this fd. */ > >> + FD_CLR (fd, &Available); > >> + FD_CLR (fd, &Writeok); > >> + fd_callback_info[fd].flags = 0; > > > > Why do it here and not when we deactivate processes? > > Because that would close file descriptors that another thread is > currently selecting on, causing EBADF. No, I mean to deactivate only the processes whose I/O channels are locked to the current thread. > > Until now we never closed any descriptors inside > > wait_reading_process_output. > > Just to be clear, we can call deactivate_process from > wait_reading_process_output, through status_notify, through process > filters, or other similar. That closes descriptors. Yes, and so I suggest doing this logic of deactivating only the processes that are locked to the current thread there. > > More generally, why this design and not a simpler change which only > > deactivates processes whose I/O is waited by the current thread, as > > discussed previously? > > I looked into that, but I determined that it's too complicated. It > would require us to keep track of a new kind of process state: a process > which has been deleted, but hasn't been deactivated yet. What I had in mind was neither delete it nor deactivate it. IOW, the loop which goes through the list of processes should only pay attention to processes locked to this thread and those that are not locked to any live thread. It should leave the other processes to when their threads call status_notify. > > Looking at all the callers of close_process_fd, it sounds like it's > > too low-level to do this. For example, it is called from > > create_process (including in the forked child) to close the unneeded > > ends of the pipe, where we probably don't want this. And > > clear_fd_callback_data is also used for the keyboard input descriptor, > > which is probably also not relevant. > > True. > > I looked into fixing these, and that suggested to me an altogether > simpler approach. We can keep the logic entirely contained in > wait_reading_process_output, just adding a bit of code around the select > call. > > And we don't need to defer deleting the fds at all: just recognize when > it's happened, and be careful to not touch those fds afterwards. Hmm... I don't understand: why do we need to do this? The calls to compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which set bits in the Available mask passed top pselect already avoid setting bits for descriptors on which some other thread waits. So pselect cannot possibly indicate any output ready on any descriptors belonging to other threads, and this loop you suggest to add: > + /* Save the fds we're currently waiting_thread for. */ > + fd_set waiting_fds; > + FD_ZERO (&waiting_fds); > + int max_waiting_fds = 0; > + for (int fd = 0; fd <= max_desc; ++fd) > + { > + if (fd_callback_info[fd].waiting_thread == current_thread) > + { > + FD_SET (fd, &waiting_fds); > + max_waiting_fds = fd; > + } > + } should be unnecessary. Or what am I missing?
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Sat, 30 Aug 2025 18:40:02 GMT) Full text and rfc822 format available.Message #59 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: sbaugh <at> janestreet.com Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Sat, 30 Aug 2025 21:38:52 +0300
> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Sat, 30 Aug 2025 19:35:39 +0300 > From: Eli Zaretskii <eliz <at> gnu.org> > > > >> + if (fd_callback_info[fd].waiting_thread == current_thread > > >> + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) > > >> + { > > >> + fprintf (stderr, "closing deferred fd %d\n", fd); > > >> + emacs_close (fd); > > >> + /* Ignore any events which happened on this fd. */ > > >> + FD_CLR (fd, &Available); > > >> + FD_CLR (fd, &Writeok); > > >> + fd_callback_info[fd].flags = 0; > > > > > > Why do it here and not when we deactivate processes? > > > > Because that would close file descriptors that another thread is > > currently selecting on, causing EBADF. > > No, I mean to deactivate only the processes whose I/O channels are > locked to the current thread. > > > > Until now we never closed any descriptors inside > > > wait_reading_process_output. > > > > Just to be clear, we can call deactivate_process from > > wait_reading_process_output, through status_notify, through process > > filters, or other similar. That closes descriptors. > > Yes, and so I suggest doing this logic of deactivating only the > processes that are locked to the current thread there. Btw, as a nice bonus of that approach, process sentinel will run in the context of the thread to which the process is locked.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Sun, 31 Aug 2025 03:21:02 GMT) Full text and rfc822 format available.Message #62 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Dmitry Gutov <dmitry <at> gutov.dev> To: Spencer Baugh <sbaugh <at> janestreet.com>, Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Sun, 31 Aug 2025 06:20:24 +0300
On 29/08/2025 19:11, Spencer Baugh wrote: > On Fri, Aug 29, 2025, 11:57 AM Eli Zaretskii <eliz <at> gnu.org > <mailto:eliz <at> gnu.org>> wrote: > > > From: Spencer Baugh <sbaugh <at> janestreet.com > <mailto:sbaugh <at> janestreet.com>> > > Date: Fri, 29 Aug 2025 11:43:21 -0400 > > Cc: 79334 <at> debbugs.gnu.org <mailto:79334 <at> debbugs.gnu.org>, > eggert <at> cs.ucla.edu <mailto:eggert <at> cs.ucla.edu>, Dmitry Gutov > <dmitry <at> gutov.dev <mailto:dmitry <at> gutov.dev>> > > > > Okay, so then how about we delete this unsafe thread_select and > put some thread-yield calls around > > wait_reading_process_output? > > I'm not objected to not calling thread_select there, but can we have > tests that will show us the gains? > > > The primary gain is that it becomes easier to reason about locking. I'm > not sure how to write a test for that :) I think what could help in this area is some benchmarking scenario, and the goal would be to show that it doesn't regress (or within the margin of error) when 'pselect' is used directly. A script which would create a bunch of threads and some amount of processes spawned by them with certain characteristics. Depending on what workload switching in that thread_select could help optimize.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Mon, 01 Sep 2025 00:41:02 GMT) Full text and rfc822 format available.Message #65 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Dmitry Gutov <dmitry <at> gutov.dev> To: Eli Zaretskii <eliz <at> gnu.org>, Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily Date: Mon, 1 Sep 2025 03:39:50 +0300
On 29/08/2025 18:41, Eli Zaretskii wrote: > For example, try to run something in a non-main > thread and then attempt to interact with Emacs in the main thread. Could that be code that does straight computation and does not switch threads at all, unless sleep-for is added? Perhaps relatedly, diff-hl in "threaded" mode has a much better interactivity since switching to 'make-process' from 'call-process' for its external program interaction. But you're probably thinking of different kind of examples.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Mon, 01 Sep 2025 16:19:02 GMT) Full text and rfc822 format available.Message #68 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Mon, 01 Sep 2025 12:18:37 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> >> +static bool >> >> +other_thread_is_waiting_for (struct fd_callback_data* elem) >> >> +{ >> >> + return elem->waiting_thread != NULL && elem->waiting_thread != current_thread; >> >> +} >> > >> > This should also check that the waiting thread is still alive. If it >> > isn't, it's okay to close the descriptor. Otherwise, descriptors >> > belonging to threads that exited might never be closed. >> >> waiting_thread always points to a living thread, because it's only set >> while a thread is inside wait_reading_process_output. >> >> (In fact, when a thread sees waiting_thread set to a value which is >> neither NULL nor current_thread, it can only ever point to a thread >> which is currently inside thread_select) > > You assume that the waiting_thread field will be cleared when a thread > dies, but that is not guaranteed. Why isn't it better to be defensive > and make sure the thread is still alive? If you are right, that test > will always be true, but what if you are wrong in some rare cases? Fair enough; the patch has changed a lot so this other_thread_is_waiting_for function doesn't exist anymore, but if I re-add it or similar checks, I will check that the thread is alive. >> >> + if (fd_callback_info[fd].waiting_thread == current_thread >> >> + && (fd_callback_info[fd].flags == WAITING_FOR_CLOSE_FD)) >> >> + { >> >> + fprintf (stderr, "closing deferred fd %d\n", fd); >> >> + emacs_close (fd); >> >> + /* Ignore any events which happened on this fd. */ >> >> + FD_CLR (fd, &Available); >> >> + FD_CLR (fd, &Writeok); >> >> + fd_callback_info[fd].flags = 0; >> > >> > Why do it here and not when we deactivate processes? >> >> Because that would close file descriptors that another thread is >> currently selecting on, causing EBADF. > > No, I mean to deactivate only the processes whose I/O channels are > locked to the current thread. > >> > Until now we never closed any descriptors inside >> > wait_reading_process_output. >> >> Just to be clear, we can call deactivate_process from >> wait_reading_process_output, through status_notify, through process >> filters, or other similar. That closes descriptors. > > Yes, and so I suggest doing this logic of deactivating only the > processes that are locked to the current thread there. > >> > More generally, why this design and not a simpler change which only >> > deactivates processes whose I/O is waited by the current thread, as >> > discussed previously? >> >> I looked into that, but I determined that it's too complicated. It >> would require us to keep track of a new kind of process state: a process >> which has been deleted, but hasn't been deactivated yet. > > What I had in mind was neither delete it nor deactivate it. IOW, the > loop which goes through the list of processes should only pay > attention to processes locked to this thread and those that are not > locked to any live thread. It should leave the other processes to > when their threads call status_notify. There are too many ways in which we delete or deactivate processes to do that. In particular, Lisp programs should be able to call delete-process on a process without getting an error just because some thread is currently calling select and happens to be monitoring that process. To emit such an error would make programming with processes and threads extremely difficult. >> > Looking at all the callers of close_process_fd, it sounds like it's >> > too low-level to do this. For example, it is called from >> > create_process (including in the forked child) to close the unneeded >> > ends of the pipe, where we probably don't want this. And >> > clear_fd_callback_data is also used for the keyboard input descriptor, >> > which is probably also not relevant. >> >> True. >> >> I looked into fixing these, and that suggested to me an altogether >> simpler approach. We can keep the logic entirely contained in >> wait_reading_process_output, just adding a bit of code around the select >> call. >> >> And we don't need to defer deleting the fds at all: just recognize when >> it's happened, and be careful to not touch those fds afterwards. > > Hmm... I don't understand: why do we need to do this? The calls to > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which > set bits in the Available mask passed top pselect already avoid > setting bits for descriptors on which some other thread waits. So > pselect cannot possibly indicate any output ready on any descriptors > belonging to other threads, and this loop you suggest to add: > >> + /* Save the fds we're currently waiting_thread for. */ >> + fd_set waiting_fds; >> + FD_ZERO (&waiting_fds); >> + int max_waiting_fds = 0; >> + for (int fd = 0; fd <= max_desc; ++fd) >> + { >> + if (fd_callback_info[fd].waiting_thread == current_thread) >> + { >> + FD_SET (fd, &waiting_fds); >> + max_waiting_fds = fd; >> + } >> + } > > should be unnecessary. Or what am I missing? That's exactly the problem. A file descriptor for which we set waiting_thread can have delete_read_fd or delete_write_fd called on it by another thread, or by a signal handler, while we're waiting in pselect. We need to catch the fact that the file descriptor has been removed from the set of file descriptors we're supposed to monitor, and avoid reacting to events on it. We can catch this easily by seeing that waiting_thread has been cleared or changed while we were blocked in select. BTW, the fact that this problematic case is also possible via signal handlers (namely the SIGCHLD handler) is a separate bug introduced by the refactoring which added fd_callback_info. It means that with a certain interleaving, I think it should be possible to get buggy behavior by just using processes without threads. It's quite rare though and I haven't been able to figure out what exact path that would cause a problem.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Mon, 01 Sep 2025 19:26:02 GMT) Full text and rfc822 format available.Message #71 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Mon, 01 Sep 2025 22:25:26 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Mon, 01 Sep 2025 12:18:37 -0400 > > > You assume that the waiting_thread field will be cleared when a thread > > dies, but that is not guaranteed. Why isn't it better to be defensive > > and make sure the thread is still alive? If you are right, that test > > will always be true, but what if you are wrong in some rare cases? > > Fair enough; the patch has changed a lot so this > other_thread_is_waiting_for function doesn't exist anymore, but if I > re-add it or similar checks, I will check that the thread is alive. Thanks. > > What I had in mind was neither delete it nor deactivate it. IOW, the > > loop which goes through the list of processes should only pay > > attention to processes locked to this thread and those that are not > > locked to any live thread. It should leave the other processes to > > when their threads call status_notify. > > There are too many ways in which we delete or deactivate processes to do > that. We should audit all of them and prevent that from happening when the process is locked to another thread. > In particular, Lisp programs should be able to call > delete-process on a process without getting an error just because some > thread is currently calling select and happens to be monitoring that > process. To emit such an error would make programming with processes > and threads extremely difficult. This is a problem two orders of magnitude smaller than what we are trying to fix here. So my suggestion is to leave it for later. There's no special reason why Lisp programs "should be able" to delete a process locked to another thread. If we don't want to signal an error (though I don't see why not), we could perhaps do nothing silently, until we get to dealing with that later. For now, sabotaging threads that wait for some sub-process is a much more grave problem, and we must fix it if we want threads and processes to live in peace in Emacs. > > Hmm... I don't understand: why do we need to do this? The calls to > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which > > set bits in the Available mask passed top pselect already avoid > > setting bits for descriptors on which some other thread waits. So > > pselect cannot possibly indicate any output ready on any descriptors > > belonging to other threads, and this loop you suggest to add: > > > >> + /* Save the fds we're currently waiting_thread for. */ > >> + fd_set waiting_fds; > >> + FD_ZERO (&waiting_fds); > >> + int max_waiting_fds = 0; > >> + for (int fd = 0; fd <= max_desc; ++fd) > >> + { > >> + if (fd_callback_info[fd].waiting_thread == current_thread) > >> + { > >> + FD_SET (fd, &waiting_fds); > >> + max_waiting_fds = fd; > >> + } > >> + } > > > > should be unnecessary. Or what am I missing? > > That's exactly the problem. A file descriptor for which we set > waiting_thread can have delete_read_fd or delete_write_fd called on it > by another thread, or by a signal handler, while we're waiting in > pselect. We need to prevent that from happening, not to go along with it. But anyway, I don't see why you need another fd_set for this. Why not just look at the descriptors in the Available set? > BTW, the fact that this problematic case is also possible via signal > handlers (namely the SIGCHLD handler) is a separate bug introduced by > the refactoring which added fd_callback_info. It means that with a > certain interleaving, I think it should be possible to get buggy > behavior by just using processes without threads. It's quite rare > though and I haven't been able to figure out what exact path that would > cause a problem. Let's not attempt to solve too many problems at the same time. How to make SIGCHLD safer when threads are present is an outstanding issue, and the solution will IMO depend in whether signals are delivered to the main thread and whether or not the build uses the child signal pipe.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Mon, 01 Sep 2025 19:41:01 GMT) Full text and rfc822 format available.Message #74 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev> Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Mon, 1 Sep 2025 15:40:37 -0400
[Message part 1 (text/plain, inline)]
On Mon, Sep 1, 2025, 3:25 PM Eli Zaretskii <eliz <at> gnu.org> wrote: > > From: Spencer Baugh <sbaugh <at> janestreet.com> > > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > > Date: Mon, 01 Sep 2025 12:18:37 -0400 > > > > > You assume that the waiting_thread field will be cleared when a thread > > > dies, but that is not guaranteed. Why isn't it better to be defensive > > > and make sure the thread is still alive? If you are right, that test > > > will always be true, but what if you are wrong in some rare cases? > > > > Fair enough; the patch has changed a lot so this > > other_thread_is_waiting_for function doesn't exist anymore, but if I > > re-add it or similar checks, I will check that the thread is alive. > > Thanks. > > > > What I had in mind was neither delete it nor deactivate it. IOW, the > > > loop which goes through the list of processes should only pay > > > attention to processes locked to this thread and those that are not > > > locked to any live thread. It should leave the other processes to > > > when their threads call status_notify. > > > > There are too many ways in which we delete or deactivate processes to do > > that. > > We should audit all of them and prevent that from happening when the > process is locked to another thread. > > > In particular, Lisp programs should be able to call > > delete-process on a process without getting an error just because some > > thread is currently calling select and happens to be monitoring that > > process. To emit such an error would make programming with processes > > and threads extremely difficult. > > This is a problem two orders of magnitude smaller than what we are > trying to fix here. So my suggestion is to leave it for later. > There's no special reason why Lisp programs "should be able" to delete > a process locked to another thread. If we don't want to signal an > error (though I don't see why not), we could perhaps do nothing > silently, until we get to dealing with that later. > Silently doing nothing is also an unacceptably confusing result, in my opinion. But it's okay: I believe it's possible to solve the problem without introducing such bad behavior. For now, sabotaging threads that wait for some sub-process is a much > more grave problem, and we must fix it if we want threads and > processes to live in peace in Emacs. > Of course. > > Hmm... I don't understand: why do we need to do this? The calls to > > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which > > > set bits in the Available mask passed top pselect already avoid > > > setting bits for descriptors on which some other thread waits. So > > > pselect cannot possibly indicate any output ready on any descriptors > > > belonging to other threads, and this loop you suggest to add: > > > > > >> + /* Save the fds we're currently waiting_thread for. */ > > >> + fd_set waiting_fds; > > >> + FD_ZERO (&waiting_fds); > > >> + int max_waiting_fds = 0; > > >> + for (int fd = 0; fd <= max_desc; ++fd) > > >> + { > > >> + if (fd_callback_info[fd].waiting_thread == current_thread) > > >> + { > > >> + FD_SET (fd, &waiting_fds); > > >> + max_waiting_fds = fd; > > >> + } > > >> + } > > > > > > should be unnecessary. Or what am I missing? > > > > That's exactly the problem. A file descriptor for which we set > > waiting_thread can have delete_read_fd or delete_write_fd called on it > > by another thread, or by a signal handler, while we're waiting in > > pselect. > > We need to prevent that from happening, not to go along with it. But > anyway, I don't see why you need another fd_set for this. Why not > just look at the descriptors in the Available set? > The Available set is cleared by the select call, and afterwards only contains descriptors which has events. In that case we could just look at the descriptors in Available and check waiting_thread for each of them. But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer know what descriptors we originally set waiting_thread on. So we can't check to see if one of our descriptors was "stolen" (had waiting_thread set to a different value), which would mean the EBADF is spurious and we should continue running. Though, I guess we could alternatively just assume that EBADF is always spurious. > BTW, the fact that this problematic case is also possible via signal > > handlers (namely the SIGCHLD handler) is a separate bug introduced by > > the refactoring which added fd_callback_info. It means that with a > > certain interleaving, I think it should be possible to get buggy > > behavior by just using processes without threads. It's quite rare > > though and I haven't been able to figure out what exact path that would > > cause a problem. > > Let's not attempt to solve too many problems at the same time. How to > make SIGCHLD safer when threads are present is an outstanding issue, > and the solution will IMO depend in whether signals are delivered to > the main thread and whether or not the build uses the child signal > pipe. > You misread me: I said SIGCHLD is unsafe even without threads. >
[Message part 2 (text/html, inline)]
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Tue, 02 Sep 2025 11:19:02 GMT) Full text and rfc822 format available.Message #77 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Tue, 02 Sep 2025 14:18:31 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Date: Mon, 1 Sep 2025 15:40:37 -0400 > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, Dmitry Gutov <dmitry <at> gutov.dev> > > > In particular, Lisp programs should be able to call > > delete-process on a process without getting an error just because some > > thread is currently calling select and happens to be monitoring that > > process. To emit such an error would make programming with processes > > and threads extremely difficult. > > This is a problem two orders of magnitude smaller than what we are > trying to fix here. So my suggestion is to leave it for later. > There's no special reason why Lisp programs "should be able" to delete > a process locked to another thread. If we don't want to signal an > error (though I don't see why not), we could perhaps do nothing > silently, until we get to dealing with that later. > > Silently doing nothing is also an unacceptably confusing result, in my opinion. Why? Ignoring invalid actions is not something new to Emacs. > > > Hmm... I don't understand: why do we need to do this? The calls to > > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which > > > set bits in the Available mask passed top pselect already avoid > > > setting bits for descriptors on which some other thread waits. So > > > pselect cannot possibly indicate any output ready on any descriptors > > > belonging to other threads, and this loop you suggest to add: > > > > > >> + /* Save the fds we're currently waiting_thread for. */ > > >> + fd_set waiting_fds; > > >> + FD_ZERO (&waiting_fds); > > >> + int max_waiting_fds = 0; > > >> + for (int fd = 0; fd <= max_desc; ++fd) > > >> + { > > >> + if (fd_callback_info[fd].waiting_thread == current_thread) > > >> + { > > >> + FD_SET (fd, &waiting_fds); > > >> + max_waiting_fds = fd; > > >> + } > > >> + } > > > > > > should be unnecessary. Or what am I missing? > > > > That's exactly the problem. A file descriptor for which we set > > waiting_thread can have delete_read_fd or delete_write_fd called on it > > by another thread, or by a signal handler, while we're waiting in > > pselect. > > We need to prevent that from happening, not to go along with it. But > anyway, I don't see why you need another fd_set for this. Why not > just look at the descriptors in the Available set? > > The Available set is cleared by the select call, and afterwards only contains descriptors which has events. In > that case we could just look at the descriptors in Available and check waiting_thread for each of them. Yes, and we don't care about descriptors that are not ready to be read. > But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer > know what descriptors we originally set waiting_thread on. In that case pselect returns a negative result, no? And if that happens, we don't examine the descriptors at all, right? So why is this situation a problem we need to handle with this additional fd_set? > Though, I guess we could alternatively just assume that EBADF is always spurious. Assume and do what? > > BTW, the fact that this problematic case is also possible via signal > > handlers (namely the SIGCHLD handler) is a separate bug introduced by > > the refactoring which added fd_callback_info. It means that with a > > certain interleaving, I think it should be possible to get buggy > > behavior by just using processes without threads. It's quite rare > > though and I haven't been able to figure out what exact path that would > > cause a problem. > > Let's not attempt to solve too many problems at the same time. How to > make SIGCHLD safer when threads are present is an outstanding issue, > and the solution will IMO depend in whether signals are delivered to > the main thread and whether or not the build uses the child signal > pipe. > > You misread me: I said SIGCHLD is unsafe even without threads. Please describe how.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Tue, 02 Sep 2025 13:39:02 GMT) Full text and rfc822 format available.Message #80 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Tue, 02 Sep 2025 09:38:15 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> > > Hmm... I don't understand: why do we need to do this? The calls to >> > > compute_input_wait_mask, compute_non_keyboard_wait_mask etc., which >> > > set bits in the Available mask passed top pselect already avoid >> > > setting bits for descriptors on which some other thread waits. So >> > > pselect cannot possibly indicate any output ready on any descriptors >> > > belonging to other threads, and this loop you suggest to add: >> > > >> > >> + /* Save the fds we're currently waiting_thread for. */ >> > >> + fd_set waiting_fds; >> > >> + FD_ZERO (&waiting_fds); >> > >> + int max_waiting_fds = 0; >> > >> + for (int fd = 0; fd <= max_desc; ++fd) >> > >> + { >> > >> + if (fd_callback_info[fd].waiting_thread == current_thread) >> > >> + { >> > >> + FD_SET (fd, &waiting_fds); >> > >> + max_waiting_fds = fd; >> > >> + } >> > >> + } >> > > >> > > should be unnecessary. Or what am I missing? >> > >> > That's exactly the problem. A file descriptor for which we set >> > waiting_thread can have delete_read_fd or delete_write_fd called on it >> > by another thread, or by a signal handler, while we're waiting in >> > pselect. >> >> We need to prevent that from happening, not to go along with it. But >> anyway, I don't see why you need another fd_set for this. Why not >> just look at the descriptors in the Available set? >> >> The Available set is cleared by the select call, and afterwards only contains descriptors which has events. In >> that case we could just look at the descriptors in Available and check waiting_thread for each of them. > > Yes, and we don't care about descriptors that are not ready to be > read. Right. >> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer >> know what descriptors we originally set waiting_thread on. > > In that case pselect returns a negative result, no? And if that > happens, we don't examine the descriptors at all, right? So why is > this situation a problem we need to handle with this additional fd_set? Because we currently call emacs_abort when we get EBADF: if (nfds < 0) { if (xerrno == EINTR) no_avail = 1; else if (xerrno == EBADF) emacs_abort (); else report_file_errno ("Failed select", Qnil, xerrno); } >> Though, I guess we could alternatively just assume that EBADF is always spurious. > > Assume and do what? Follow one of the other two branches in the above snippet of code: - Treat EBADF the same way we do EINTR: there were no events on any fds, so we continue running wait_reading_process_output. - Treat EBADF the same way we treat other errno values: signal an error. That would at least be better than calling emacs_abort. >> > BTW, the fact that this problematic case is also possible via signal >> > handlers (namely the SIGCHLD handler) is a separate bug introduced by >> > the refactoring which added fd_callback_info. It means that with a >> > certain interleaving, I think it should be possible to get buggy >> > behavior by just using processes without threads. It's quite rare >> > though and I haven't been able to figure out what exact path that would >> > cause a problem. >> >> Let's not attempt to solve too many problems at the same time. How to >> make SIGCHLD safer when threads are present is an outstanding issue, >> and the solution will IMO depend in whether signals are delivered to >> the main thread and whether or not the build uses the child signal >> pipe. >> >> You misread me: I said SIGCHLD is unsafe even without threads. > > Please describe how. 1. Call wait_reading_process_output 2. wait_reading_process_output calls compute_input_wait_mask, including SOME-FD in the mask. 3. Receive SIGCHLD, calling handle_child_signal 4. handle_child_signal calls delete_read_fd for SOME-FD 5. Signal handler finishes 6. wait_reading_process_output calls select and returns events for SOME-FD, even though SOME-FD has been deleted from fd_callback_info and the process associated with it has died. It seems possible that this can cause an issue in some code path. This couldn't happen before the refactoring which added fd_callback_info, because handle_child_signal directly mutated the fd_set which is passed to pselect.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Tue, 02 Sep 2025 14:41:02 GMT) Full text and rfc822 format available.Message #83 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Tue, 02 Sep 2025 17:39:39 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Tue, 02 Sep 2025 09:38:15 -0400 > > >> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer > >> know what descriptors we originally set waiting_thread on. > > > > In that case pselect returns a negative result, no? And if that > > happens, we don't examine the descriptors at all, right? So why is > > this situation a problem we need to handle with this additional fd_set? > > Because we currently call emacs_abort when we get EBADF: So, if the danger of getting EBADF due to threads-and-processes mishaps is real (I'm saying "if" because I don't think we had bug reports about that), we should change how we treat EBADF in that case. Which should probably be a separate discussion anyway. In any case, the distinct return value of pselect in that case means that none of the code which handles other return values should be bothered about this possibility. > >> Though, I guess we could alternatively just assume that EBADF is always spurious. > > > > Assume and do what? > > Follow one of the other two branches in the above snippet of code: > > - Treat EBADF the same way we do EINTR: there were no events on any fds, so > we continue running wait_reading_process_output. > > - Treat EBADF the same way we treat other errno values: signal an error. > That would at least be better than calling emacs_abort. Let's discuss this issue separately, okay? We are already trying to cover too many different scenarios, which makes this discussion hard to follow and the probability of losing focus high. > >> You misread me: I said SIGCHLD is unsafe even without threads. > > > > Please describe how. > > 1. Call wait_reading_process_output > 2. wait_reading_process_output calls compute_input_wait_mask, including > SOME-FD in the mask. > 3. Receive SIGCHLD, calling handle_child_signal > 4. handle_child_signal calls delete_read_fd for SOME-FD > 5. Signal handler finishes > 6. wait_reading_process_output calls select and returns events for > SOME-FD, even though SOME-FD has been deleted from fd_callback_info > and the process associated with it has died. It seems possible that > this can cause an issue in some code path. > > This couldn't happen before the refactoring which added > fd_callback_info, because handle_child_signal directly mutated the > fd_set which is passed to pselect. Are you talking about systems that don't use the "child signal pipe"? Because if that pipe is in use, then the SIGCHLD handler writes to it, and that will cause pselect to return with that descriptor in Available, and we then handle the death of the process. In addition, delete_read_fd just clears the data from the fd_callback_info of the descriptor, and the loop that examines the bits in Available is AFAICT careful enough to test these fields of the callback info before doing anything because the descriptor was returned by pselect as ready. So what kind of problems could have happened in step 6 above? And once again, this is a separate issue, so let's discuss it separately. In this bug, let's focus on what happens in status_notify and in general when we deactivate/delete a process which exited. Can we return to that now?
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Tue, 02 Sep 2025 15:00:02 GMT) Full text and rfc822 format available.Message #86 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Tue, 02 Sep 2025 10:59:39 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev >> Date: Tue, 02 Sep 2025 09:38:15 -0400 >> >> >> But if there was an EBADF, then Available is left in an undefined state (see pselect(1)) and we no longer >> >> know what descriptors we originally set waiting_thread on. >> > >> > In that case pselect returns a negative result, no? And if that >> > happens, we don't examine the descriptors at all, right? So why is >> > this situation a problem we need to handle with this additional fd_set? >> >> Because we currently call emacs_abort when we get EBADF: > > So, if the danger of getting EBADF due to threads-and-processes > mishaps is real (I'm saying "if" because I don't think we had bug > reports about that), It is real. I have seen it plenty at my site. I described this issue in my initial email. It is the issue I care about solving. This bug is my report of the issue. Solving it is my highest priority. > we should change how we treat EBADF in that case. Which should > probably be a separate discussion anyway. In any case, the distinct > return value of pselect in that case means that none of the code which > handles other return values should be bothered about this possibility. > >> >> Though, I guess we could alternatively just assume that EBADF is always spurious. >> > >> > Assume and do what? >> >> Follow one of the other two branches in the above snippet of code: >> >> - Treat EBADF the same way we do EINTR: there were no events on any fds, so >> we continue running wait_reading_process_output. >> >> - Treat EBADF the same way we treat other errno values: signal an error. >> That would at least be better than calling emacs_abort. > > Let's discuss this issue separately, okay? We are already trying to > cover too many different scenarios, which makes this discussion hard > to follow and the probability of losing focus high. It's not a separate issue, it's the entire point of this bug. So yes, let's stay focused on this issue and not other issues. Which of these two options do you think we should go with? Or there are other options. For example, we could detect the cause of the EBADF by remembering the original Available set before calling pselect, and checking if the fd_callback_info entry for any of those file descriptors was cleared. >> >> You misread me: I said SIGCHLD is unsafe even without threads. >> > >> > Please describe how. >> >> 1. Call wait_reading_process_output >> 2. wait_reading_process_output calls compute_input_wait_mask, including >> SOME-FD in the mask. >> 3. Receive SIGCHLD, calling handle_child_signal >> 4. handle_child_signal calls delete_read_fd for SOME-FD >> 5. Signal handler finishes >> 6. wait_reading_process_output calls select and returns events for >> SOME-FD, even though SOME-FD has been deleted from fd_callback_info >> and the process associated with it has died. It seems possible that >> this can cause an issue in some code path. >> >> This couldn't happen before the refactoring which added >> fd_callback_info, because handle_child_signal directly mutated the >> fd_set which is passed to pselect. > > Are you talking about systems that don't use the "child signal pipe"? No. This is on GNU/Linux. > Because if that pipe is in use, then the SIGCHLD handler writes to it, > and that will cause pselect to return with that descriptor in > Available, and we then handle the death of the process. In addition, > delete_read_fd just clears the data from the fd_callback_info of the > descriptor, and the loop that examines the bits in Available is AFAICT > careful enough to test these fields of the callback info before doing > anything because the descriptor was returned by pselect as ready. So > what kind of problems could have happened in step 6 above? I'm not sure. > And once again, this is a separate issue, so let's discuss it > separately. Okay. Let me just say again that I suspect there are bugs here which can be triggered by both threads and signal handlers. We can leave it at that.
bug-gnu-emacs <at> gnu.org
:bug#79334
; Package emacs
.
(Tue, 02 Sep 2025 16:00:02 GMT) Full text and rfc822 format available.Message #89 received at 79334 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Spencer Baugh <sbaugh <at> janestreet.com> Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev Subject: Re: bug#79334: [PATCH] Don't release thread select lock unnecessarily, [PATCH] Defer closing file descriptors used by other threads Date: Tue, 02 Sep 2025 18:59:28 +0300
> From: Spencer Baugh <sbaugh <at> janestreet.com> > Cc: 79334 <at> debbugs.gnu.org, eggert <at> cs.ucla.edu, dmitry <at> gutov.dev > Date: Tue, 02 Sep 2025 10:59:39 -0400 > > Eli Zaretskii <eliz <at> gnu.org> writes: > > >> Because we currently call emacs_abort when we get EBADF: > > > > So, if the danger of getting EBADF due to threads-and-processes > > mishaps is real (I'm saying "if" because I don't think we had bug > > reports about that), > > It is real. I have seen it plenty at my site. I described this issue > in my initial email. It is the issue I care about solving. This bug is > my report of the issue. Solving it is my highest priority. OK, so let's see if fixing deletion of processes fixes that problem. If you see that in your usage, it should be relatively easy to test. Then we can take it from there. > > Let's discuss this issue separately, okay? We are already trying to > > cover too many different scenarios, which makes this discussion hard > > to follow and the probability of losing focus high. > > It's not a separate issue, it's the entire point of this bug. So yes, > let's stay focused on this issue and not other issues. > > Which of these two options do you think we should go with? I would like fix the process deletion and closing of the descriptors first, because that can definitely be one reason for EBADF. We can then discuss additional measures. From where I stand, it is best to completely eliminate EBADF due to these problems; if we succeed, the code which deals with EBADF could maybe be left as it is, since this should then be some spurious problem with no known reasons. If it turns out that we cannot eliminate EBADF completely by fixing the process-deactivation code, then yes, we will need to do something when we get EBADF as result. So what happened to the changes in the code that handles deactivation of a process? can we return to that and arrive at some agreed-upon solution? AFAIR, last time I asked you why not fix these problems inside status_notify (and the few other places which use FOR_EACH_PROCESS). > > Because if that pipe is in use, then the SIGCHLD handler writes to it, > > and that will cause pselect to return with that descriptor in > > Available, and we then handle the death of the process. In addition, > > delete_read_fd just clears the data from the fd_callback_info of the > > descriptor, and the loop that examines the bits in Available is AFAICT > > careful enough to test these fields of the callback info before doing > > anything because the descriptor was returned by pselect as ready. So > > what kind of problems could have happened in step 6 above? > > I'm not sure. That's my reading of the code. Of course, I could be wrong, but then we need a description of the scenario details which causes such problems, including the control and data flow which make it possible. > > And once again, this is a separate issue, so let's discuss it > > separately. > > Okay. Let me just say again that I suspect there are bugs here which > can be triggered by both threads and signal handlers. We can leave it > at that. Yes, I agree that signals add a non-trivial complexity to this stuff, due to their asynchronous nature and their ability to run code on the "wrong" thread.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.