Package: emacs;
Reported by: George P <georgepanagopo <at> gmail.com>
Date: Thu, 15 May 2025 18:46:01 UTC
Severity: normal
Found in version 30.1
View this message in rfc822 format
From: Pip Cet <pipcet <at> protonmail.com> To: George P <georgepanagopo <at> gmail.com> Cc: Eli Zaretskii <eliz <at> gnu.org>, Andrea Corallo <acorallo <at> gnu.org>, 78444 <at> debbugs.gnu.org Subject: bug#78444: 30.1; Crash in GC (vector_marked_p) Date: Tue, 20 May 2025 15:55:46 +0000
"George P" <georgepanagopo <at> gmail.com> writes: > Here they are: Thanks again! The good news is I found *a* nativecomp GC bug, but I'm not convinced it's related to the one you've been seeing. It works only with the right optimization options (-O2 here), because without optimization conservative GC will save the day, but that matches your build. Create a file like this: ;;; -*- lexical-binding: t; -*- (defun f () (fmakunbound 'f) (garbage-collect) (lambda () nil)) then execute: (prog1 nil (native-elisp-load (native-compile "file.el"))) (funcall 'f) What will happen is that while execution is in 'f', there is no reference to the subr f (the backtrace only keeps alive the symbol f, not the subr f, and funcall_general is tail recursive so its stack variables are out of scope by the time we call garbage-collect). That means the compilation unit may get unloaded while we're still in 'f', and then we crash trying to return to it after garbage collection. I *think* that if there were a second handle keeping alive the dynlib (but not the compilation unit), we would return an anonymous lambda which would refer to the now-invalid (but still open) native compilation unit. This would also happen if dlclose failed to unload the library. I've been able to simulate this by intercepting dynlib_close and returning from it immediately (without actually closing the shared object) and executing: (prog1 nil (native-elisp-load (native-compile "file.el"))) (prog1 nil (setq x (funcall 'f))) This succeeded, and when I tried to access 'x' I got a crash, as expected. At this point, the bug seems a little contrived, because it's unusual for a function to fmakunbound the symbol it is bound to, but I think that's pretty much what happens in some recursive load scenarios. If the function redefines itself using (ultimately) defalias, the old function value is sometimes, but not always, stored in the function-history symbol property. I don't believe this mechanism reliably keeps alive the old function (and, thus, the compilation unit) either, and it can be circumvented entirely by using Ffset directly. Again, this is unlikely to be precisely what happened to George here, but it indicates that relying on the current specpdl backtrace to keep alive subrs that are being called doesn't work in all cases. We should probably keep the actual subr around, either in a (hidden?) backtrace slot somewhere or in a local variable in funcall_subr which we ensure cannot be optimized away (by tail recursion or otherwise). Note that the second dynlib handle isn't required for a crash, but the crashes we see without it should be immediate, upon returning from GC, assuming dynlib_close actually synchronously unmaps the library. I think both the symbol being bound to something else and the second handle might happen "naturally" during some recursive reload scenarios. > (gdb) p *(char **)0x8f68108 > $1 = 0x2018b608 "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/lib/emacs/30.1/native-lisp/30.1-1ed0c1e8/cl-print-79bf9fb1-14d0e7d5.eln" > (gdb) x/32gx 0x9e7980 Oops, I messed up there. I meant x/32gx 0x98e7980, though it's no longer necessary (we know it's a native compilation unit because it appears in the right slot in the subr below). > (gdb) p *(char **)0x8ae5588 > > p *(char **)0x15554ec0ff78$2 = 0x2018b6b8 > "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/share/emacs/30.1/lisp/emacs-lisp/cl-print.elc" > (gdb) p *(char **)0x15554ec0ff98 > $3 = 0x15554f56d92e "cl-print" > (gdb) p *(char **)0x15554ec0ff78 > $4 = 0x15554f56d81f "Print OBJECT on STREAM according to its type.\nOutput is further controlled by the variables\n`cl-print-readably', > `cl-print-compiled', along with output\nvariables for the standard printing functions. "... That's the beginning of the function-history of the cl-prin1 symbol (while symbols mark the strings containing their names, they do not put the marked string in last_marked, so we have to deduce this from the docstring). > (gdb) x/32gx 0x38294c8 > 0x38294c8: 0xc00000001200a000 0x000015553389a350 > 0x38294d8: 0x0000000000020001 0x00000000088417d0 > 0x38294e8: 0x0000000000000000 0x0000000000000000 > 0x38294f8: 0x0000000000000025 0x00000000098e7985 > 0x3829508: 0x000000000e754ab0 0x0000000000000000 > 0x3829518: 0x000000001f6463e3 0x400000001200a000 that appears to be the cl-prin1 definition (minargs 1, maxargs 2, native comp unit 0x98e7985). How reproducible did you say this crash was? Unless we can come up with a convincing alternative explanation, it may be worth it to fix funcall_subr to avoid this particular bug, and keep running Emacs in gdb until it crashes again... diff --git a/src/eval.c b/src/eval.c index caae4cb17e2..274319d1196 100644 --- a/src/eval.c +++ b/src/eval.c @@ -3135,6 +3135,9 @@ safe_eval (Lisp_Object sexp) Lisp_Object funcall_subr (struct Lisp_Subr *subr, ptrdiff_t numargs, Lisp_Object *args) { + volatile Lisp_Object keepalive; + XSETSUBR (keepalive, subr); + Lisp_Object ret; eassume (numargs >= 0); if (numargs >= subr->min_args) { @@ -3156,32 +3159,46 @@ funcall_subr (struct Lisp_Subr *subr, ptrdiff_t numargs, Lisp_Object *args) switch (maxargs) { case 0: - return subr->function.a0 (); + ret = subr->function.a0 (); + break; case 1: - return subr->function.a1 (a[0]); + ret = subr->function.a1 (a[0]); + break; case 2: - return subr->function.a2 (a[0], a[1]); + ret = subr->function.a2 (a[0], a[1]); + break; case 3: - return subr->function.a3 (a[0], a[1], a[2]); + ret = subr->function.a3 (a[0], a[1], a[2]); + break; case 4: - return subr->function.a4 (a[0], a[1], a[2], a[3]); + ret = subr->function.a4 (a[0], a[1], a[2], a[3]); + break; case 5: - return subr->function.a5 (a[0], a[1], a[2], a[3], a[4]); + ret = subr->function.a5 (a[0], a[1], a[2], a[3], a[4]); + break; case 6: - return subr->function.a6 (a[0], a[1], a[2], a[3], a[4], a[5]); + ret = subr->function.a6 (a[0], a[1], a[2], a[3], a[4], a[5]); + break; case 7: - return subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5], + ret = subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5], a[6]); + break; case 8: - return subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5], - a[6], a[7]); + ret = subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5], + a[6], a[7]); + break; } - eassume (false); /* In case the compiler is too stupid. */ + keepalive = keepalive; + return ret; } /* Call to n-adic subr. */ if (maxargs == MANY || maxargs > 8) - return subr->function.aMANY (numargs, args); + { + ret = subr->function.aMANY (numargs, args); + keepalive = keepalive; + return ret; + } } /* Anything else is an error. */
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.