GNU bug report logs - #78444
30.1; Crash in GC (vector_marked_p)

Previous Next

Package: emacs;

Reported by: George P <georgepanagopo <at> gmail.com>

Date: Thu, 15 May 2025 18:46:01 UTC

Severity: normal

Found in version 30.1

Full log


View this message in rfc822 format

From: George P <georgepanagopo <at> gmail.com>
To: Pip Cet <pipcet <at> protonmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, Andrea Corallo <acorallo <at> gnu.org>, 78444 <at> debbugs.gnu.org
Subject: bug#78444: 30.1; Crash in GC (vector_marked_p)
Date: Tue, 20 May 2025 14:39:56 -0400
[Message part 1 (text/plain, inline)]
>
> How reproducible did you say this crash was?
>

I get it around once a month I would say.

My elisp is way too primitive to be able to understand the circumstances
that lead to the bug you are describing, but I did find an instance of
fmakunbound in a macro in Doom emacs:
https://github.com/doomemacs/doomemacs/blob/66f1b25dac30ca97779e8a05e735e14230556492/lisp/doom-lib.el#L442

Can't really say if it's triggering the bug that you are describing, though.

Thanks!
George

On Tue, May 20, 2025 at 11:55 AM Pip Cet <pipcet <at> protonmail.com> wrote:

> "George P" <georgepanagopo <at> gmail.com> writes:
>
> > Here they are:
>
> Thanks again!
>
> The good news is I found *a* nativecomp GC bug, but I'm not convinced
> it's related to the one you've been seeing.  It works only with the
> right optimization options (-O2 here), because without optimization
> conservative GC will save the day, but that matches your build.
>
> Create a file like this:
>
> ;;; -*- lexical-binding: t; -*-
>
> (defun f ()
>   (fmakunbound 'f)
>   (garbage-collect)
>   (lambda () nil))
>
> then execute:
>
> (prog1 nil (native-elisp-load (native-compile "file.el")))
> (funcall 'f)
>
> What will happen is that while execution is in 'f', there is no
> reference to the subr f (the backtrace only keeps alive the symbol f,
> not the subr f, and funcall_general is tail recursive so its stack
> variables are out of scope by the time we call garbage-collect).  That
> means the compilation unit may get unloaded while we're still in 'f',
> and then we crash trying to return to it after garbage collection.
>
> I *think* that if there were a second handle keeping alive the dynlib
> (but not the compilation unit), we would return an anonymous lambda
> which would refer to the now-invalid (but still open) native compilation
> unit.  This would also happen if dlclose failed to unload the library.
>
> I've been able to simulate this by intercepting dynlib_close and
> returning from it immediately (without actually closing the shared
> object) and executing:
>
> (prog1 nil (native-elisp-load (native-compile "file.el")))
> (prog1 nil (setq x (funcall 'f)))
>
> This succeeded, and when I tried to access 'x' I got a crash, as
> expected.
>
> At this point, the bug seems a little contrived, because it's unusual
> for a function to fmakunbound the symbol it is bound to, but I think
> that's pretty much what happens in some recursive load scenarios.  If
> the function redefines itself using (ultimately) defalias, the old
> function value is sometimes, but not always, stored in the
> function-history symbol property.  I don't believe this mechanism
> reliably keeps alive the old function (and, thus, the compilation unit)
> either, and it can be circumvented entirely by using Ffset directly.
>
> Again, this is unlikely to be precisely what happened to George here,
> but it indicates that relying on the current specpdl backtrace to keep
> alive subrs that are being called doesn't work in all cases.
>
> We should probably keep the actual subr around, either in a (hidden?)
> backtrace slot somewhere or in a local variable in funcall_subr which we
> ensure cannot be optimized away (by tail recursion or otherwise).
>
> Note that the second dynlib handle isn't required for a crash, but the
> crashes we see without it should be immediate, upon returning from GC,
> assuming dynlib_close actually synchronously unmaps the library.
>
> I think both the symbol being bound to something else and the second
> handle might happen "naturally" during some recursive reload scenarios.
>
> > (gdb)  p *(char **)0x8f68108
> > $1 = 0x2018b608
> "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/lib/emacs/30.1/native-lisp/30.1-1ed0c1e8/cl-print-79bf9fb1-14d0e7d5.eln"
> > (gdb) x/32gx 0x9e7980
>
> Oops, I messed up there.  I meant x/32gx 0x98e7980, though it's no
> longer necessary (we know it's a native compilation unit because it
> appears in the right slot in the subr below).
>
> > (gdb) p *(char **)0x8ae5588
> >
> >     p *(char **)0x15554ec0ff78$2 = 0x2018b6b8
> >
> "/nix/store/xdxaa55akicvs3jjrr8d7nmzla4gzbyl-emacs-30.1/share/emacs/30.1/lisp/emacs-lisp/cl-print.elc"
> > (gdb)     p *(char **)0x15554ec0ff98
> > $3 = 0x15554f56d92e "cl-print"
> > (gdb)     p *(char **)0x15554ec0ff78
> > $4 = 0x15554f56d81f "Print OBJECT on STREAM according to its
> type.\nOutput is further controlled by the variables\n`cl-print-readably',
> > `cl-print-compiled', along with output\nvariables for the standard
> printing functions.  "...
>
> That's the beginning of the function-history of the cl-prin1 symbol
> (while symbols mark the strings containing their names, they do not put
> the marked string in last_marked, so we have to deduce this from the
> docstring).
>
> > (gdb) x/32gx 0x38294c8
> > 0x38294c8:      0xc00000001200a000      0x000015553389a350
> > 0x38294d8:      0x0000000000020001      0x00000000088417d0
> > 0x38294e8:      0x0000000000000000      0x0000000000000000
> > 0x38294f8:      0x0000000000000025      0x00000000098e7985
> > 0x3829508:      0x000000000e754ab0      0x0000000000000000
> > 0x3829518:      0x000000001f6463e3      0x400000001200a000
>
> that appears to be the cl-prin1 definition (minargs 1, maxargs 2, native
> comp unit 0x98e7985).
>
> How reproducible did you say this crash was?  Unless we can come up with
> a convincing alternative explanation, it may be worth it to fix
> funcall_subr to avoid this particular bug, and keep running Emacs in gdb
> until it crashes again...
>
> diff --git a/src/eval.c b/src/eval.c
> index caae4cb17e2..274319d1196 100644
> --- a/src/eval.c
> +++ b/src/eval.c
> @@ -3135,6 +3135,9 @@ safe_eval (Lisp_Object sexp)
>  Lisp_Object
>  funcall_subr (struct Lisp_Subr *subr, ptrdiff_t numargs, Lisp_Object
> *args)
>  {
> +  volatile Lisp_Object keepalive;
> +  XSETSUBR (keepalive, subr);
> +  Lisp_Object ret;
>    eassume (numargs >= 0);
>    if (numargs >= subr->min_args)
>      {
> @@ -3156,32 +3159,46 @@ funcall_subr (struct Lisp_Subr *subr, ptrdiff_t
> numargs, Lisp_Object *args)
>           switch (maxargs)
>             {
>             case 0:
> -             return subr->function.a0 ();
> +             ret = subr->function.a0 ();
> +             break;
>             case 1:
> -             return subr->function.a1 (a[0]);
> +             ret = subr->function.a1 (a[0]);
> +             break;
>             case 2:
> -             return subr->function.a2 (a[0], a[1]);
> +             ret = subr->function.a2 (a[0], a[1]);
> +             break;
>             case 3:
> -             return subr->function.a3 (a[0], a[1], a[2]);
> +             ret = subr->function.a3 (a[0], a[1], a[2]);
> +             break;
>             case 4:
> -             return subr->function.a4 (a[0], a[1], a[2], a[3]);
> +             ret = subr->function.a4 (a[0], a[1], a[2], a[3]);
> +             break;
>             case 5:
> -             return subr->function.a5 (a[0], a[1], a[2], a[3], a[4]);
> +             ret = subr->function.a5 (a[0], a[1], a[2], a[3], a[4]);
> +             break;
>             case 6:
> -             return subr->function.a6 (a[0], a[1], a[2], a[3], a[4],
> a[5]);
> +             ret = subr->function.a6 (a[0], a[1], a[2], a[3], a[4], a[5]);
> +             break;
>             case 7:
> -             return subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5],
> +             ret = subr->function.a7 (a[0], a[1], a[2], a[3], a[4], a[5],
>                                         a[6]);
> +             break;
>             case 8:
> -             return subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5],
> -                                       a[6], a[7]);
> +             ret = subr->function.a8 (a[0], a[1], a[2], a[3], a[4], a[5],
> +                                      a[6], a[7]);
> +             break;
>             }
> -         eassume (false);      /* In case the compiler is too stupid.  */
> +         keepalive = keepalive;
> +         return ret;
>         }
>
>        /* Call to n-adic subr.  */
>        if (maxargs == MANY || maxargs > 8)
> -       return subr->function.aMANY (numargs, args);
> +       {
> +         ret = subr->function.aMANY (numargs, args);
> +         keepalive = keepalive;
> +         return ret;
> +       }
>      }
>
>    /* Anything else is an error.  */
>
>
[Message part 2 (text/html, inline)]

This bug report was last modified 3 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.