Package: emacs;
Reported by: Jorgen Schaefer <contact <at> jorgenschaefer.de>
Date: Fri, 23 May 2014 15:54:02 UTC
Severity: normal
Tags: moreinfo
Found in version 24.4.50
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Message #23 received at 17561 <at> debbugs.gnu.org (full text, mbox):
From: Jorgen Schaefer <forcer <at> forcix.cx> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 17561 <at> debbugs.gnu.org Subject: Re: Emacs can forget processes Date: Mon, 26 May 2014 20:49:52 +0200
[Message part 1 (text/plain, inline)]
On Mon, 26 May 2014 10:08:38 -0700 Paul Eggert <eggert <at> cs.ucla.edu> wrote: > Jorgen Schaefer wrote: > > The trace starts with a number of these triplets, which seem to be > > "Emacs behaving normally". > > > > 16300 09:41:27.072717 pselect6(20, [3 4 5 6 8 10 14 15 19], [], > > NULL, {0, 176410474}, {NULL, 8}) = 0 (Timeout) <0.176635> 16300 > > 09:41:27.249649 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 > > <0.000011> 16300 09:41:27.249881 rt_sigprocmask(SIG_UNBLOCK, [WINCH > > IO], NULL, 8) = 0 <0.000010> > > I don't observe this behavior when running Emacs on Fedora 20 (this > is the latest emacs-24 version, running your little test). I ran: > > strace -o /tmp/tr -fttT src/bootstrap-emacs -nw -Q > > I did see this: > > 29589 09:44:10.979952 ioctl(4, FIONREAD, [0]) = 0 <0.000023> > 29589 09:44:10.980031 ioctl(4, FIONREAD, [0]) = 0 <0.000024> > 29589 09:44:10.980143 pselect6(6, [4 5], [], NULL, {0, 499810466}, > {NULL, 8}) = 0 (Timeout) <0.500499> > 29589 09:44:11.480745 poll([{fd=5, events=POLLIN}], 1, 0) = 0 > (Timeout) <0.000030> > 29589 09:44:11.480861 read(5, 0x7fff777b3820, 16) = -1 EAGAIN > (Resource temporarily unavailable) <0.000029> > 29589 09:44:11.481214 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) > = 0 <0.000030> > 29589 09:44:11.481505 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, > 8) = 0 <0.000027> > 29589 09:44:11.481594 ioctl(4, FIONREAD, [0]) = 0 <0.000030> > > but that's not really the same. Could you please try running emacs > -nw -Q with your test case, and see whether it behaves like the > pattern on my platform? If so, we might try to investigate why Emacs > changes from this pattern to the pattern that you observe. In "normal mode" (with no processes running), Emacs simply hangs in a pselect6 with a "pretty long timeout". I have attached the full traceback of the emacs -Q session. > > Then there's a large bunch of syscalls related to my command to > > restart the process, with the "\r" now being me sending the M-x > > command: > > Which M-x command was that? M-x list-processes? M-x elpy-rpc-restart https://github.com/jorgenschaefer/elpy/blob/master/elpy.el#L1654-1662 The call to `kill-process' was added only recently as part of debugging this problem, the bug showed up with only the `kill-buffer' call, too. The Elpy RPC processes are where I notice this bug the most, as there are often quite a few of them, and I kill them regularly during development, but I have seen this with e.g. openssl processes, too. > > 16300 09:41:28.298391 read(3, "\r", 1) = 1 <0.000012> > > 16300 09:41:28.298438 ioctl(3, FIONREAD, [0]) = 0 <0.000009> > > 16300 09:41:28.298480 ioctl(3, FIONREAD, [0]) = 0 <0.000009> > > 16300 09:41:28.312476 write(3, "\r", 1) = 1 <0.000021> > > 16300 09:41:28.317392 kill(4294953129, SIGHUP) = 0 <0.002235> > > 16300 09:41:28.321642 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, > > 8) = 0 <0.000017> 16300 09:41:28.321841 write(3, "\33[K\33[H\n\n", > > 8) = 8 <0.000018> 16300 09:41:28.321909 rt_sigprocmask(SIG_UNBLOCK, > > [WINCH IO], NULL, 8) = 0 <0.000012> > > When I try your test, the child process is running in parallel with > the parent, the 'kill' terminates the child, and the parent is > signaled. In contrast your trace shows no child, leading me to guess > that the child has already exited (so the parent is killing a > zombie), which means it's not the test case you sent but some other > process (since the test case you sent waits on a pty so the child > shouldn't exit). That was for the already-existing process that exhibited the problem. I wasn't able to reproduce it with the exact same code I sent this time. Sorry, I should have been more clear about this. > > Killing the process via the process list (hence the "d"): > > You typed "d" in the window generated by M-x list-processes? When I > try that, it says "d is undefined". "d runs the command process-menu-delete-process" This is on a recent trunk version. It seems this was added after the 24.4 branch. (As I can't reliably reproduce this, I have to use whichever Emacs I happen to be running at that point.) I have also attached a file the-bug.txt.gz, which shows the strace of a running Emacs session which starts to exhibit the bug, so we now have a trace that includes the time when the bug actually happens. It shows me starting Python subprocesses (elpy RPC processes) and killing them, in an attempt to reproduce the problem. Eventually, this succeeds, and Emacs goes into the loop reading from fd 15. I snipped a few megabytes of logs there (marked "[...]"). Interestingly, I did not kill anything there, I ran my "reproduction case", and Emacs cleaned up the broken process by itself. If I read this correctly, the "broken" process in question is 27012 (16300 is emacs). It gets killed at some point, right after a RET is read (the elpy-rpc-restart), and goes straight into the loop: 16300 20:01:25.347174 kill(4294940284, SIGKILL <unfinished ...> 27012 20:01:25.349483 +++ killed by SIGKILL +++ 16300 20:01:25.349515 <... kill resumed> ) = 0 <0.002304> 16300 20:01:25.349649 kill(4294940284, SIGHUP) = 0 <0.000019> 16300 20:01:25.354299 rt_sigprocmask(SIG_BLOCK, [WINCH IO], NULL, 8) = 0 <0.0000 16300 20:01:25.354585 write(3, "\33[K\33[16;1H", 10) = 10 <0.000025> 16300 20:01:25.354687 rt_sigprocmask(SIG_UNBLOCK, [WINCH IO], NULL, 8) = 0 <0.00 16300 20:01:25.354753 --- SIGIO (I/O possible) @ 0 (0) --- 16300 20:01:25.354789 rt_sigreturn(0x1d) = 0 <0.000016> 16300 20:01:25.354862 ioctl(3, FIONREAD, [0]) = 0 <0.000017> 16300 20:01:25.355015 ioctl(3, FIONREAD, [0]) = 0 <0.000017> 16300 20:01:25.355098 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 670925 16300 20:01:25.355222 read(15, "", 4096) = 0 <0.000014> 16300 20:01:25.355455 ioctl(3, FIONREAD, [0]) = 0 <0.000017> 16300 20:01:25.355541 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 626503 16300 20:01:25.355642 read(15, "", 4096) = 0 <0.000015> 16300 20:01:25.355710 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 609628 16300 20:01:25.355804 read(15, "", 4096) = 0 <0.000015> 16300 20:01:25.355869 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 593577 16300 20:01:25.355962 read(15, "", 4096) = 0 <0.000014> 16300 20:01:25.356027 pselect6(20, [3 4 5 6 8 10 14 15 19], [], NULL, {0, 577782 16300 20:01:25.356156 read(15, "", 4096) = 0 <0.000015> Comparing this with the "normal" code, I do not see a SIGCHLD. Emacs does a SIGKILL (delete-process) followed by a SIGHUP (from killing the buffer?), but there's no SIGCHLD. The last rt_sigprocmask with SIG_SETMASK before that sets QUIT ALRM PROF, so that does not seem to be the culprit. Strange stuff. (And I see you noticed the discussion emacs-devel about epa :-)) Regards, Jorgen
[emacs-Q-trace.txt (text/plain, attachment)]
[the-bug.txt.gz (application/gzip, attachment)]
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.