GNU bug report logs - #78444
30.1; Crash in GC (vector_marked_p)

Previous Next

Package: emacs;

Reported by: George P <georgepanagopo <at> gmail.com>

Date: Thu, 15 May 2025 18:46:01 UTC

Severity: normal

Found in version 30.1

Full log


View this message in rfc822 format

From: Pip Cet <pipcet <at> protonmail.com>
To: George P <georgepanagopo <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Andrea Corallo <acorallo <at> gnu.org>, 78444 <at> debbugs.gnu.org
Subject: bug#78444: 30.1; Crash in GC (vector_marked_p)
Date: Tue, 20 May 2025 15:55:46 +0000
"George P" <georgepanagopo <at> gmail.com> writes:

> Here they are:

Thanks again!

The good news is I found *a* nativecomp GC bug, but I'm not convinced
it's related to the one you've been seeing.  It works only with the
right optimization options (-O2 here), because without optimization
conservative GC will save the day, but that matches your build.

Create a file like this:

;;; -*- lexical-binding: t; -*-

(defun f ()
  (fmakunbound 'f)
  (garbage-collect)
  (lambda () nil))

then execute:

(prog1 nil (native-elisp-load (native-compile "file.el")))
(funcall 'f)

What will happen is that while execution is in 'f', there is no
reference to the subr f (the backtrace only keeps alive the symbol f,
not the subr f, and funcall_general is tail recursive so its stack
variables are out of scope by the time we call garbage-collect).  That
means the compilation unit may get unloaded while we're still in 'f',
and then we crash trying to return to it after garbage collection.

I *think* that if there were a second handle keeping alive the dynlib
(but not the compilation unit), we would return an anonymous lambda
which would refer to the now-invalid (but still open) native compilation
unit.  This would also happen if dlclose failed to unload the library.

I've been able to simulate this by intercepting dynlib_close and
returning from it immediately (without actually closing the shared
object) and executing:

(prog1 nil (native-elisp-load (native-compile "file.el")))
(prog1 nil (setq x (funcall 'f)))

This succeeded, and when I tried to access 'x' I got a crash, as
expected.

At this point, the bug seems a little contrived, because it's unusual
for a function to fmakunbound the symbol it is bound to, but I think
that's pretty much what happens in some recursive load scenarios.  If
the function redefines itself using (ultimately) defalias, the old
function value is sometimes, but not always, stored in the
function-history symbol property.  I don't believe this mechanism
reliably keeps alive the old function (and, thus, the compilation unit)
either, and it can be circumvented entirely by using Ffset directly.

Again, this is unlikely to be precisely what happened to George here,
but it indicates that relying on the current specpdl backtrace to keep
alive subrs that are being called doesn't work in all cases.

We should probably keep the actual subr around, either in a (hidden?)
backtrace slot somewhere or in a local variable in funcall_subr which we
ensure cannot be optimized away (by tail recursion or otherwise).

Note that the second dynlib handle isn't required for a crash, but the
crashes we see without it should be immediate, upon returning from GC,
assuming dynlib_close actually synchronously unmaps the library.

I think both the symbol being bound to something else and the second
handle might happen "naturally" during some recursive reload scenarios.

> (gdb)  p *(char **)0x8f68108
> $1 = 0x2018b608 "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/lib/emacs/30.1/native-lisp/30.1-1ed0c1e8/cl-print-79bf9fb1-14d0e7d5.eln"
> (gdb) x/32gx 0x9e7980

Oops, I messed up there.  I meant x/32gx 0x98e7980, though it's no
longer necessary (we know it's a native compilation unit because it
appears in the right slot in the subr below).

> (gdb) p *(char **)0x8ae5588
>
>     p *(char **)0x15554ec0ff78$2 = 0x2018b6b8
> "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/share/emacs/30.1/lisp/emacs-lisp/cl-print.elc"
> (gdb)     p *(char **)0x15554ec0ff98
> $3 = 0x15554f56d92e "cl-print"
> (gdb)     p *(char **)0x15554ec0ff78
> $4 = 0x15554f56d81f "Print OBJECT on STREAM according to its type.\nOutput is further controlled by the variables\n`cl-print-readably',
> `cl-print-compiled', along with output\nvariables for the standard printing functions.  "...

That's the beginning of the function-history of the cl-prin1 symbol
(while symbols mark the strings containing their names, they do not put
the marked string in last_marked, so we have to deduce this from the
docstring).

> (gdb) x/32gx 0x38294c8
> 0x38294c8:      0xc00000001200a000      0x000015553389a350
> 0x38294d8:      0x0000000000020001      0x00000000088417d0
> 0x38294e8:      0x0000000000000000      0x0000000000000000
> 0x38294f8:      0x0000000000000025      0x00000000098e7985
> 0x3829508:      0x000000000e754ab0      0x0000000000000000
> 0x3829518:      0x000000001f6463e3      0x400000001200a000

that appears to be the cl-prin1 definition (minargs 1, maxargs 2, native
comp unit 0x98e7985).

How reproducible did you say this crash was?  Unless we can come up with
a convincing alternative explanation, it may be worth it to fix
funcall_subr to avoid this particular bug, and keep running Emacs in gdb
until it crashes again...

diff --git a/src/eval.c b/src/eval.c
index caae4cb17e2..274319d1196 100644
--- a/src/eval.c
+++ b/src/eval.c
@@ -3135,6 +3135,9 @@ safe_eval (Lisp_Object sexp)
 Lisp_Object
 funcall_subr (struct Lisp_Subr *subr, ptrdiff_t numargs, Lisp_Object *args)
 {
+  volatile Lisp_Object keepalive;
+  XSETSUBR (keepalive, subr);
+  Lisp_Object ret;
   eassume (numargs >= 0);
   if (numargs >= subr->min_args)
     {
@@ -3156,32 +3159,46 @@ funcall_subr (struct Lisp_Subr *subr, ptrdiff_t numargs, Lisp_Object *args)
 	  switch (maxargs)
 	    {
 	    case 0:
-	      return subr->function.a0 ();
+	      ret = subr->function.a0 ();
+	      break;
 	    case 1:
-	      return subr->function.a1 (a[0]);
+	      ret = subr->function.a1 (a[0]);
+	      break;
 	    case 2:
-	      return subr->function.a2 (a[0], a[1]);
+	      ret = subr->function.a2 (a[0], a[1]);
+	      break;
 	    case 3:
-	      return subr->function.a3 (a[0], a[1], a[2]);
+	      ret = subr->function.a3 (a[0], a[1], a[2]);
+	      break;
 	    case 4:
-	      return subr->function.a4 (a[0], a[1], a[2], a[3]);
+	      ret = subr->function.a4 (a[0], a[1], a[2], a[3]);
+	      break;
 	    case 5:
-	      return subr->function.a5 (a[0], a[1], a[2], a[3], a[4]);
+	      ret = subr->function.a5 (a[0], a[1], a[2], a[3], a[4]);
+	      break;
 	    case 6:
-	      return subr->function.a6 (a[0], a[1], a[2], a[3], a[4], a[5]);
+	      ret = subr->function.a6 (a[0], a[1], a[2], a[3], a[4], a[5]);
+	      break;
 	    case 7:
-	      return subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5],
+	      ret = subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5],
 					a[6]);
+	      break;
 	    case 8:
-	      return subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5],
-					a[6], a[7]);
+	      ret = subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5],
+				       a[6], a[7]);
+	      break;
 	    }
-	  eassume (false);	/* In case the compiler is too stupid.  */
+	  keepalive = keepalive;
+	  return ret;
 	}
 
       /* Call to n-adic subr.  */
       if (maxargs == MANY || maxargs > 8)
-	return subr->function.aMANY (numargs, args);
+	{
+	  ret = subr->function.aMANY (numargs, args);
+	  keepalive = keepalive;
+	  return ret;
+	}
     }
 
   /* Anything else is an error.  */





This bug report was last modified 3 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.