Package: emacs;
Reported by: Gemini Lasswell <gazally <at> runbox.com>
Date: Thu, 11 Oct 2018 05:32:01 UTC
Severity: normal
Tags: fixed
Found in version 26.1.50
Fixed in version 27.1
Done: Gemini Lasswell <gazally <at> runbox.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: Gemini Lasswell <gazally <at> runbox.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 33014 <at> debbugs.gnu.org, schwab <at> linux-m68k.org Subject: bug#33014: 26.1.50; 27.0.50; Fatal error after re-evaluating a thread's function Date: Fri, 19 Oct 2018 12:32:32 -0700
[Message part 1 (text/plain, inline)]
Gemini Lasswell <gazally <at> runbox.com> writes: > I set up a single-threaded situation where I could redefine a function > while exec_byte_code was running it, and got a segfault. I've gained > some insights from debugging this version of the bug which I will put > into a separate email. Here's a gdb transcript going through the single-threaded version of this bug. In this transcript I use a file 'repro.el' which I've attached to the end of this message, and is the same as the one in my last message. Start gdb with a breakpoint at Fredraw_display: $ gdb --args ./emacs -Q ... (gdb) b Fredraw_display (gdb) r In Emacs, find the file repro.el and load it with byte-compile-file, then go back to *scratch* and run my-loop: C-x C-f repro.el RET C-u M-x byte-compile-file RET repro.el RET C-x b RET M-x my-loop RET This gets me to the gdb prompt, at a point in execution where the next function called will be my-loop-1, so I set a breakpoint in funcall_lambda, where I can see the bytecode object for my-loop-1 (I edited out the bytestring): Thread 1 "emacs" hit Breakpoint 3, Fredraw_display () at dispnew.c:3027 3027 { (gdb) br funcall_lambda Breakpoint 4 at 0x5cdb00: file eval.c, line 3016. (gdb) c Continuing. Thread 1 "emacs" hit Breakpoint 4, funcall_lambda (fun=XIL(0x31c0235), nargs=nargs <at> entry=0, arg_vector=arg_vector <at> entry=0x7fffffff01c0) at eval.c:3016 3016 { (gdb) clear Deleted breakpoint 4 (gdb) p fun $1 = XIL(0x1630fc5) (gdb) pr #[0 "..." [my-var 0 "Now in recursive edit " recursive-edit format "Leaving recursive edit: %s " (a b c d e) message "foo: %s" last 1 "bar: %s" 2 "baz: %s" "bop: %s" mod 3] 6] Then I skip ahead into exec-byte-code: (gdb) br exec_byte_code Breakpoint 5 at 0x611bb0: file bytecode.c, line 342. (gdb) c Continuing. Thread 1 "emacs" hit Breakpoint 5, exec_byte_code (bytestr=XIL(0x3571d24), vector=XIL(0x31c0195), maxdepth=make_number(4), args_template=args_template <at> entry=XIL(0), nargs=nargs <at> entry=0, args=args <at> entry=0x0) at bytecode.c:342 342 { Here's what's in the register $rbp, and the constants vector: (gdb) clear Deleted breakpoint 5 (gdb) p $rbp $2 = (void *) 0xb0201 (gdb) pr #<INVALID_LISP_OBJECT 0x000b0201> (gdb) p vector $3 = XIL(0x1630f35) (gdb) pr [my-var 0 "Now in recursive edit " recursive-edit format "Leaving recursive edit: %s " (a b c d e) message "foo: %s" last 1 "bar: %s" 2 "baz: %s" "bop: %s" mod 3] Skip ahead, to get to where exec_byte_code has a value for vectorp: (gdb) n 12 366 USE_SAFE_ALLOCA; (gdb) p vectorp $4 = (Lisp_Object *) 0x1630f38 <bss_sbrk_buffer+9164248> (gdb) p *vectorp $5 = XIL(0x2327d80) (gdb) pr my-var (gdb) break mark_vectorlike if ptr->contents == $4 Breakpoint 6 at 0x5ad400: file alloc.c, line 6036. (gdb) c Continuing. The idea is to break when garbage collection finds the constants vector. (I first tried setting a conditional breakpoint in mark_object, which made garbage collection either hang or take more time than I had patience for.) In Emacs type C-x b RET. This causes a gc and a breakpoint hit: Thread 1 "emacs" hit Breakpoint 6, mark_vectorlike (ptr=0x31c0190) at alloc.c:6036 6036 eassert (!VECTOR_MARKED_P (ptr)); (gdb) bt 20 #0 mark_vectorlike (ptr=0x1630f30 <bss_sbrk_buffer+9164240>) at alloc.c:6036 #1 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #2 0x00000000005ad45e in mark_vectorlike ( ptr=0x1611fd0 <bss_sbrk_buffer+9037424>) at alloc.c:6046 #3 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #4 0x00000000005acdf4 in mark_object (arg=...) at alloc.c:6477 #5 0x00000000005acae4 in mark_object (arg=...) at alloc.c:6434 #6 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a8e00 <bss_sbrk_buffer+8606880>) at alloc.c:6046 #7 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a9c30 <bss_sbrk_buffer+8610512>) at alloc.c:6046 #8 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #9 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a7c30 <bss_sbrk_buffer+8602320>) at alloc.c:6046 #10 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #11 0x00000000005ad45e in mark_vectorlike ( ptr=0x15a6e80 <bss_sbrk_buffer+8598816>) at alloc.c:6046 #12 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #13 0x00000000005acdf4 in mark_object (arg=...) at alloc.c:6477 #14 0x00000000005acaa5 in mark_object (arg=...) at alloc.c:6431 #15 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fbed0 <bss_sbrk_buffer+8947056>) at alloc.c:6046 #16 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #17 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fbf50 <bss_sbrk_buffer+8947184>) at alloc.c:6046 #18 0x00000000005aca9c in mark_object (arg=...) at alloc.c:6430 #19 0x00000000005ad45e in mark_vectorlike ( ptr=0x15fcc80 <bss_sbrk_buffer+8950560>) at alloc.c:6046 (More stack frames follow...) Lisp Backtrace: "Automatic GC" (0x0) "eldoc-pre-command-refresh-echo-area" (0xfffefbb0) "recursive-edit" (0xfffeffd8) "my-loop-1" (0xffff0250) "my-loop" (0xffff0650) "funcall-interactively" (0xffff0648) "call-interactively" (0xffff07d0) "command-execute" (0xffff0ab8) "execute-extended-command" (0xffff0ea0) "funcall-interactively" (0xffff0e98) "call-interactively" (0xffff11d0) "command-execute" (0xffff1488) There are 279 frames in the backtrace, and mark_stack and mark_memory aren't there. So I'm guessing the constants vector is getting found via the function definition of 'my-loop-1'. Keep going: (gdb) c Continuing. Now in Emacs do this: M-x eval-buffer RET C-x b RET M-x my-gc RET Execution does not stop at the breakpoint. In Emacs type C-M-c. Result: Thread 1 "emacs" received signal SIGSEGV, Segmentation fault. 0x00000000005bca1b in styled_format (nargs=2, args=0x7ffffffeffd8, message=<optimized out>) at editfns.c:3129 3129 unsigned char format_char = *format++; What's happened to the constants vector and its contents? (gdb) p $3 $6 = XIL(0x1630f35) (gdb) pr #<INVALID_LISP_OBJECT 0x01630f35> (gdb) p *$4 $7 = XIL(0x2327d80) (gdb) pr my-var (gdb) p *($4+5) $8 = XIL(0x359a6f4) (gdb) pr #<INVALID_LISP_OBJECT 0x0359a6f4> (gdb) p *($4+4) $9 = XIL(0x6390) (gdb) pr format Looks like the constants vector was freed, and its contents haven't been overwritten (yet) but the format string has been freed leading to the crash in styled_format. While I was developing this method of reproducing this bug, I went through this exercise without lexical-binding set in repro.el. In that version, the register $rbp when exec_byte_code is called contains the bytecode Lisp_Object (instead of the non-Lisp-object value it contains in the transcript above), and the first thing exec_byte_code does is save it on the stack (presumably because the System V AMD64 ABI calling convention says that called functions which use $rbp should save and restore it). Here's the beginning of the disassembly of exec_byte_code from "objdump -S bytecode.o": 0000000000000020 <exec_byte_code>: executing BYTESTR. */ Lisp_Object exec_byte_code (Lisp_Object bytestr, Lisp_Object vector, Lisp_Object maxdepth, Lisp_Object args_template, ptrdiff_t nargs, Lisp_Object *args) { 20: 55 push %rbp 21: 48 89 e5 mov %rsp,%rbp 24: 41 57 push %r15 26: 41 56 push %r14 28: 41 55 push %r13 2a: 41 54 push %r12 2c: 49 89 ce mov %rcx,%r14 2f: 53 push %rbx So in the non-lexical-binding case the bytecode Lisp_Object is written to the stack by the first instruction in exec_byte_code, and then during the execution of 'my-gc' the breakpoint in mark_vectorlike stops at a point with a much shorter backtrace which includes mark_stack and mark_memory, and mark_memory's pp is pointing to the location on the stack where $rbp was written. The bytecode object and constants vector are consequently not freed, and no segfault happens. I don't follow everything going on in the disassembly of funcall_lambda, but I did figure out (by comparison with a debug session in the multithreaded situation) that the different values in $rbp when funcall_lambda calls exec_byte_code depend on the different code paths following the test of whether the first element of the bytecode object vector (the "args template" as funcall_lambda's comment calls it) is an integer, which in turn depends on whether my-loop-1 was compiled with lexical-binding on. Here is 'repro.el':
[repro.el (text/plain, inline)]
;;; -*- lexical-binding: t -*- (defvar my-var "ok") (defun my-loop-1 () (let ((val 0)) (while t (insert "Now in recursive edit\n") (recursive-edit) (insert (format "Leaving recursive edit: %s\n" my-var)) (let ((things '(a b c d e))) (cond ; ((= val 0) (message "foo: %s" (last things))) ((= val 1) (message "bar: %s" things)) ((= val 2) (message "baz: %s" (car things))) (t (message "bop: %s" (nth 2 things)))) (setq val (mod (1+ val) 3)))))) (defun my-loop () (interactive) (redraw-display) (my-loop-1)) (defun my-gc-1 () (garbage-collect)) (defun my-gc () (interactive) (my-gc-1)) (provide 'repro)
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.