> > How reproducible did you say this crash was? > I get it around once a month I would say. My elisp is way too primitive to be able to understand the circumstances that lead to the bug you are describing, but I did find an instance of fmakunbound in a macro in Doom emacs: https://github.com/doomemacs/doomemacs/blob/66f1b25dac30ca97779e8a05e735e14230556492/lisp/doom-lib.el#L442 Can't really say if it's triggering the bug that you are describing, though. Thanks! George On Tue, May 20, 2025 at 11:55 AM Pip Cet wrote: > "George P" writes: > > > Here they are: > > Thanks again! > > The good news is I found *a* nativecomp GC bug, but I'm not convinced > it's related to the one you've been seeing. It works only with the > right optimization options (-O2 here), because without optimization > conservative GC will save the day, but that matches your build. > > Create a file like this: > > ;;; -*- lexical-binding: t; -*- > > (defun f () > (fmakunbound 'f) > (garbage-collect) > (lambda () nil)) > > then execute: > > (prog1 nil (native-elisp-load (native-compile "file.el"))) > (funcall 'f) > > What will happen is that while execution is in 'f', there is no > reference to the subr f (the backtrace only keeps alive the symbol f, > not the subr f, and funcall_general is tail recursive so its stack > variables are out of scope by the time we call garbage-collect). That > means the compilation unit may get unloaded while we're still in 'f', > and then we crash trying to return to it after garbage collection. > > I *think* that if there were a second handle keeping alive the dynlib > (but not the compilation unit), we would return an anonymous lambda > which would refer to the now-invalid (but still open) native compilation > unit. This would also happen if dlclose failed to unload the library. > > I've been able to simulate this by intercepting dynlib_close and > returning from it immediately (without actually closing the shared > object) and executing: > > (prog1 nil (native-elisp-load (native-compile "file.el"))) > (prog1 nil (setq x (funcall 'f))) > > This succeeded, and when I tried to access 'x' I got a crash, as > expected. > > At this point, the bug seems a little contrived, because it's unusual > for a function to fmakunbound the symbol it is bound to, but I think > that's pretty much what happens in some recursive load scenarios. If > the function redefines itself using (ultimately) defalias, the old > function value is sometimes, but not always, stored in the > function-history symbol property. I don't believe this mechanism > reliably keeps alive the old function (and, thus, the compilation unit) > either, and it can be circumvented entirely by using Ffset directly. > > Again, this is unlikely to be precisely what happened to George here, > but it indicates that relying on the current specpdl backtrace to keep > alive subrs that are being called doesn't work in all cases. > > We should probably keep the actual subr around, either in a (hidden?) > backtrace slot somewhere or in a local variable in funcall_subr which we > ensure cannot be optimized away (by tail recursion or otherwise). > > Note that the second dynlib handle isn't required for a crash, but the > crashes we see without it should be immediate, upon returning from GC, > assuming dynlib_close actually synchronously unmaps the library. > > I think both the symbol being bound to something else and the second > handle might happen "naturally" during some recursive reload scenarios. > > > (gdb) p *(char **)0x8f68108 > > $1 = 0x2018b608 > "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/lib/emacs/30.1/native-lisp/30.1-1ed0c1e8/cl-print-79bf9fb1-14d0e7d5.eln" > > (gdb) x/32gx 0x9e7980 > > Oops, I messed up there. I meant x/32gx 0x98e7980, though it's no > longer necessary (we know it's a native compilation unit because it > appears in the right slot in the subr below). > > > (gdb) p *(char **)0x8ae5588 > > > > p *(char **)0x15554ec0ff78$2 = 0x2018b6b8 > > > "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/share/emacs/30.1/lisp/emacs-lisp/cl-print.elc" > > (gdb) p *(char **)0x15554ec0ff98 > > $3 = 0x15554f56d92e "cl-print" > > (gdb) p *(char **)0x15554ec0ff78 > > $4 = 0x15554f56d81f "Print OBJECT on STREAM according to its > type.\nOutput is further controlled by the variables\n`cl-print-readably', > > `cl-print-compiled', along with output\nvariables for the standard > printing functions. "... > > That's the beginning of the function-history of the cl-prin1 symbol > (while symbols mark the strings containing their names, they do not put > the marked string in last_marked, so we have to deduce this from the > docstring). > > > (gdb) x/32gx 0x38294c8 > > 0x38294c8: 0xc00000001200a000 0x000015553389a350 > > 0x38294d8: 0x0000000000020001 0x00000000088417d0 > > 0x38294e8: 0x0000000000000000 0x0000000000000000 > > 0x38294f8: 0x0000000000000025 0x00000000098e7985 > > 0x3829508: 0x000000000e754ab0 0x0000000000000000 > > 0x3829518: 0x000000001f6463e3 0x400000001200a000 > > that appears to be the cl-prin1 definition (minargs 1, maxargs 2, native > comp unit 0x98e7985). > > How reproducible did you say this crash was? Unless we can come up with > a convincing alternative explanation, it may be worth it to fix > funcall_subr to avoid this particular bug, and keep running Emacs in gdb > until it crashes again... > > diff --git a/src/eval.c b/src/eval.c > index caae4cb17e2..274319d1196 100644 > --- a/src/eval.c > +++ b/src/eval.c > @@ -3135,6 +3135,9 @@ safe_eval (Lisp_Object sexp) > Lisp_Object > funcall_subr (struct Lisp_Subr *subr, ptrdiff_t numargs, Lisp_Object > *args) > { > + volatile Lisp_Object keepalive; > + XSETSUBR (keepalive, subr); > + Lisp_Object ret; > eassume (numargs >= 0); > if (numargs >= subr->min_args) > { > @@ -3156,32 +3159,46 @@ funcall_subr (struct Lisp_Subr *subr, ptrdiff_t > numargs, Lisp_Object *args) > switch (maxargs) > { > case 0: > - return subr->function.a0 (); > + ret = subr->function.a0 (); > + break; > case 1: > - return subr->function.a1 (a[0]); > + ret = subr->function.a1 (a[0]); > + break; > case 2: > - return subr->function.a2 (a[0], a[1]); > + ret = subr->function.a2 (a[0], a[1]); > + break; > case 3: > - return subr->function.a3 (a[0], a[1], a[2]); > + ret = subr->function.a3 (a[0], a[1], a[2]); > + break; > case 4: > - return subr->function.a4 (a[0], a[1], a[2], a[3]); > + ret = subr->function.a4 (a[0], a[1], a[2], a[3]); > + break; > case 5: > - return subr->function.a5 (a[0], a[1], a[2], a[3], a[4]); > + ret = subr->function.a5 (a[0], a[1], a[2], a[3], a[4]); > + break; > case 6: > - return subr->function.a6 (a[0], a[1], a[2], a[3], a[4], > a[5]); > + ret = subr->function.a6 (a[0], a[1], a[2], a[3], a[4], a[5]); > + break; > case 7: > - return subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5], > + ret = subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5], > a[6]); > + break; > case 8: > - return subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5], > - a[6], a[7]); > + ret = subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5], > + a[6], a[7]); > + break; > } > - eassume (false); /* In case the compiler is too stupid. */ > + keepalive = keepalive; > + return ret; > } > > /* Call to n-adic subr. */ > if (maxargs == MANY || maxargs > 8) > - return subr->function.aMANY (numargs, args); > + { > + ret = subr->function.aMANY (numargs, args); > + keepalive = keepalive; > + return ret; > + } > } > > /* Anything else is an error. */ > >