Package: emacs;
Reported by: Spencer Baugh <sbaugh <at> janestreet.com>
Date: Thu, 28 Aug 2025 19:46:02 UTC
Severity: normal
Found in version 31.0.50
Message #14 received at 79333 <at> debbugs.gnu.org (full text, mbox):
From: Spencer Baugh <sbaugh <at> janestreet.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: dmitry <at> gutov.dev, 79333 <at> debbugs.gnu.org Subject: Re: bug#79333: 31.0.50; Processes (still) aren't actually locked to threads Date: Fri, 29 Aug 2025 08:55:41 -0400
Eli Zaretskii <eliz <at> gnu.org> writes: >> Cc: Eli Zaretskii <eliz <at> gnu.org>, Dmitry Gutov <dmitry <at> gutov.dev> >> From: Spencer Baugh <sbaugh <at> janestreet.com> >> Date: Thu, 28 Aug 2025 15:45:18 -0400 >> >> >> 1. emacs -Q >> 2. Eval this: >> >> (define-advice shell-command-sentinel (:before (process signal)) >> (message "process thread: %s, current thread: %s" >> (process-thread process) >> (current-thread))) >> (make-thread >> (lambda () >> (async-shell-command "sleep 2") >> ;; Do nothing. >> (thread-join (make-thread (lambda () (while t (sit-for 1))))))) >> (sit-for 3) >> >> 3. It hangs due to some thread bug, independent of the main bug I'm >> reporting here. Just hit C-g. (Notably, it doesn't hang when run in >> emacs --batch) >> >> 4. Observe in *Messages* a message like this: >> >> process thread: #<thread 0x3a25bf60>, current thread: #<thread 0x7f87c009e920> >> >> The sentinel ran in a thread which is not process-thread. > > I believe that's because of the way we process SIGCHLD on Posix > systems: we write to a special file descriptor to wake pselect. That > file descriptor is used by all the subprocesses, so it cannot be made > thread-specific. See child_signal_init and child_signal_notify. > > Since this descriptor is shared by all the subprocesses, the sentinel > on Posix systems can run in the context of some random thread that > succeeds to grab the global lock after it returns from pselect. > > Making sentinels run in the context of the thread that started the > process would require to redesign this part of Emacs. Until then, we > will need to document this subtlety. Makes sense. >> So even after c93be71e45, processes aren't actually locked to threads. > > Only the sentinel is not locked. The filter can also be run in a different thread, because it's run for any remaining output at the time the process terminates. This code demonstrates that: (define-advice shell-command-sentinel (:before (process signal)) (message "process thread: %s, current thread: %s" (process-thread process) (current-thread))) (define-advice comint-output-filter (:before (process string)) (message "filter: %s, current thread: %s" (process-thread process) (current-thread))) (make-thread (lambda () (async-shell-command "sleep 2 && echo hi") ;; Do nothing. (thread-join (make-thread (lambda () (while t (sit-for 1))))))) (sit-for 3) So, the following are unlocked: - calls into the filter triggered by process state changes - calls into the sentinel And the following are locked: - other calls into the filter Seems hard to document. >> (Once again, I think we should take the opportunity here to delete the >> code for locking processes to threads, since IMO it is not useful, and >> it is still broken) > > I did explain why it is useful, and you haven't brought up any > arguments to the contrary. And if "broken" means that the sentinel > can run in the context of a random thread, then how come you are > asking to leave it in this broken state? I was working on an example to demonstrate how process locking can cause problems for unrelated Lisp code when I found this bug. Here's a finished example of the problem. Suppose I have the following Lisp program which doesn't use threads: (run-at-time .1 nil #'async-shell-command "sleep 1 && echo foobar && sleep inf") (with-current-buffer (get-buffer-create shell-command-buffer-name-async) (while (string-empty-p (buffer-string)) (message "waiting for some process output") (sit-for 1)) (message "buffer contents: %s" (buffer-string))) This is intended to represent some arbitrary package which calls make-process in a timer or a hook. The command "sleep 1 && echo foobar && sleep inf" is chosen to represent some interactive executable like a shell or REPL. This code runs just fine, with the output appearing in the buffer as expected. Now suppose I have some other unrelated package which runs the following code using threads: (make-thread (lambda () (accept-process-output nil 1) (thread-join (make-thread (lambda () (while t (message "doing work") (sit-for 10))))))) This is intended to represent thread 1 waiting for some other thread 2 to complete some long-running task. Crucially, thread 1 is blocked in thread-join, which doesn't run wait_reading_process_output, so thread 1 won't read output from any locked processes. Running the second piece of code hangs the first piece of code, even though neither of them are buggy, and they aren't visibly interacting, and they can be written by totally different authors. Specifically: The timer can be run in the thread created by the second piece of code, and then the process will be created locked to that thread. Then the process's output will never be read by thread 1. So the first piece of code will hang. Here's a complete example, with some logging. (Note that the problem doesn't occur on every run because it's a race condition.) (define-advice start-process (:filter-return (proc)) (message "start-process: %s" (process-thread proc)) proc) (run-at-time .5 nil #'async-shell-command "sleep 1 && echo foobar && sleep inf") (make-thread (lambda () (accept-process-output nil 1) (thread-join (make-thread (lambda () (while t (message "doing work") (sit-for 10))))))) (with-current-buffer (get-buffer-create shell-command-buffer-name-async) (while (string-empty-p (buffer-string)) (message "waiting for some process output") (sit-for 1)) (message "buffer contents: %s" (buffer-string))) This isn't just an academic problem. Most Lisp programs that start processes will run into this problem if there are any Lisp threads running. One way to fix this would be to make it so that thread-join and condition-wait both call wait_reading_process_output. Then there wouldn't be a way to block a thread without calling wait_reading_process_output. Unfortunately, that's difficult to do in a portable way, because we would need to integrate waiting for a condition variable into the wait_reading_process_output event loop, which is impossible on GNU/Linux. Another way to fix this would be to make timers run only in the thread which started them. However, this is insufficient, because the same problem can occur with hooks. Any hook can be run by an unrelated thread, and processes started in that hook may hang if the thread is doing work which doesn't involve calling wait_reading_process_output. As far as I can tell, the only possible fix for this problem is to not lock processes to threads. This problem seems worse than the problems prevented by locking processes to threads, so I think this is the right fix. (Especially because, as the original bug demonstrates, we aren't fully locking processes to threads, neither sentinels nor filters, so we aren't actually getting the benefits of that locking, only the costs)
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.