Package: emacs;
Reported by: Dima Kogan <dima <at> secretsauce.net>
Date: Wed, 17 Jul 2024 20:58:01 UTC
Severity: normal
Found in version 31.0.50
Done: Dima Kogan <dima <at> secretsauce.net>
Bug is archived. No further changes may be made.
Message #17 received at 72165 <at> debbugs.gnu.org (full text, mbox):
From: Dima Kogan <dima <at> secretsauce.net> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 72165 <at> debbugs.gnu.org Subject: Re: bug#72165: 31.0.50; Intermittent crashing with recent emacs build Date: Thu, 18 Jul 2024 00:25:14 -0700
Thank you very much for replying, Eli. > So when you say that "anecdotally, the 2024/04/30 build has been very > stable", what exactly do you mean? It sounds like both that build and > the one from 2024/07/09 crash in the same way, so why do you consider > the April one "very stable"? Sorry, I wasn't clear. I've been using the April build for many months, and haven't seen any crashing at all until today. Today I tried to debug the mu4e modeline problem, and saw it crash. Then I updated to the latest build (2024/07/09) hoping it would be fixed, and kept seeing crashing, as I continued to debug. So whatever the problem is, it started in April or earlier. Here're some notes about the mu4e problem that looks correlated with this crash, maybe. I'm hazy on the details here, so there's no bug report yet, but I've at least pinpointed the mechanism. truncate-string-to-width() in international/mule-util.el has (condition-case nil (while (< column end-column) (setq last-column column last-idx idx ch (aref str idx) column (+ column (char-width ch)) idx (1+ idx))) (args-out-of-range (setq idx str-len))) The intent was that we might have idx >= length(str), so (aref str idx) would signal args-out-of-range, and the condition-case would catch it. But this is reliably not happening under some (probably over-specified) conditions: - mu4e is running, with multiple mail contexts; it shows the selected context in its modeline, which eventually calls truncate-string-to-width() - I have some remote file opened with TRAMP - I run (shell-command) from the remote buffer In at least this scenario, args-out-of-range errors from the above (aref ...) are uncaught (100% of the time with my config), and appear in the *Messages* buffer. I was debugging this by tweaking and re-evaluating my local copy of truncate-string-to-width() and other related functions in the *scratch* buffer, while looking at the *Messages* buffer in another window. Will get back to this in a sec. Here's what I see in the core dump: (gdb) p current_thread->m_current_buffer->text->z $22 = 32192 (gdb) p current_thread->m_current_buffer->text->z_byte $23 = 32178 (gdb) p current_thread->m_current_buffer->pt $24 = 32192 (gdb) p current_thread->m_current_buffer->pt_byte $25 = 32178 So that tells me that the failing condition isn't the one gdb flagged, but the one immediately after: if (BYTEPOS (opoint) < CHARPOS (opoint)) emacs_abort (); The compiler optimizations could be responsible for the discrepancy. Am I understanding correctly that this check makes sure that BYTEPOS >= CHARPOS, which must always be true because sizeof(emacs character) is always >= 1byte? The buffer name: (gdb) p current_thread->m_current_buffer->name_ $26 = XIL(0x7fc685b24c1c) (gdb) xstring $27 = (struct Lisp_String *) 0x7fc685b24c18 "*Messages*" I confirm that the text is our own text: (gdb) p ¤t_thread->m_current_buffer->own_text $43 = (struct buffer_text *) 0x7fc685a107e0 (gdb) p current_thread->m_current_buffer->text $44 = (struct buffer_text *) 0x7fc685a107e0 The full structure: (gdb) p current_thread->m_current_buffer->own_text $45 = { beg = 0x561d7100f800 ... z = 32192, z_byte = 32178, gpt = 32191, gpt_byte = 32177, gap_size = 1313, modiff = 69879, chars_modiff = 69879, save_modiff = 1, overlay_modiff = 10, compact = 53392, beg_unchanged = 0, end_unchanged = 0, unchanged_modified = 69373, overlay_unchanged_modified = 6, intervals = 0x0, markers = 0x561d6da79bc0, inhibit_shrinking = false, redisplay = true } Looks like gpt and gpt_byte have a similar inconsistency as z and zbyte. Looking at the definitions in buffer.h, I guess the above means that the gap starts at gpt_byte-1 = 32176 Let's look at the last bit of the buffer: (gdb) printf "%.2200s\n", ¤t_thread->m_current_buffer->text->beg[30000] share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-mime-parts.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-modeline.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-notification.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-obsolete.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-org.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-pkg.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-query-items.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-search.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-server.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-speedbar.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-thread.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-update.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-vars.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-view.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-window.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e.el Checking /usr/share/emacs/site-lisp/elpa/mu4e-1.12.5/mu4e-actions.el 0 matching files marked Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) [3 times] Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range #("<fastmail>" 1 9 (face mu4e-context-face help-echo "mu4e context: fastmail")) 10) Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) [5 times] Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range #("<fastmail>" 1 9 (face mu4e-context-face help-echo "mu4e context: fastmail")) 10) QuitError during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) [2 times] Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range #("<fastmail>" 1 9 (face mu4e-context-face help-echo "mu4e context: fastmail")) 10) Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) [5 times] This particular print ("Error during redisplay") happens (I think) when I removed the (condition-case ...) stuff above to let the (aref ...) fail. I wouldn't crash most of the time. Also I'm not at all confident that this is the only scenario where it crashed, but maybe. Let's look just at the last little bit, to count the bytes: (gdb) printf "%.200s\n", ¤t_thread->m_current_buffer->text->beg[32000] mail>" 1 9 (face mu4e-context-face help-echo "mu4e context: fastmail")) 10) Error during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) [5 times] I asked for at most 200 bytes (up to byte 32200). I got exactly 176 bytes, so the text ends where the gap supposedly begins. That makes sense. Let's look a bit past the end, INTO the gap (gdb) x /3cb ¤t_thread->m_current_buffer->text->beg[32176] 0x561d710175b0: 0 '\000' 0 '\000' 114 'r' So we have two trailing \0 bytes. Past them: (gdb) printf "%.200s\n", ¤t_thread->m_current_buffer->text->beg[32178] rror during redisplay: (eval (mu4e--modeline-string) t) signaled (args-out-of-range "" 0) Theory: there's a race condition between error handling that ends up writing to *Messages* and the logic that aggregates duplicated messages into things like [5 times]. People usually don't have lots of errors happening, and they usually don't stare at the *Messages* buffer, so this is easily missed. Anything more you would suggest? I saw the crashing once every 20min maybe, so reproducing it is probably possible, but not very quick and easy. Does it make sense to try to fix the (condition-case) problem first, since that's easily reproducible? Thank you
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.