GNU bug report logs - #75322
SAFE_ALLOCA assumed to root Lisp_Objects/SSDATA(string)

Previous Next

Package: emacs;

Reported by: Pip Cet <pipcet <at> protonmail.com>

Date: Fri, 3 Jan 2025 17:21:02 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Daniel Colascione <dancol <at> dancol.org>
Cc: gerd.moellmann <at> gmail.com, pipcet <at> protonmail.com, 75322 <at> debbugs.gnu.org
Subject: bug#75322: SAFE_ALLOCA assumed to root Lisp_Objects/SSDATA(string)
Date: Mon, 06 Jan 2025 14:59:07 +0200
> Date: Sun, 05 Jan 2025 16:15:47 -0500
> From: Daniel Colascione <dancol <at> dancol.org>
> CC: pipcet <at> protonmail.com, 75322 <at> debbugs.gnu.org
> 
> >  . an automatic variable
> >  . a static variable that is protected by someone
> >  . a global variable that is protected by someone
> >  . a result of dereferencing a pointer that is somehow protected
> >
> >etc. etc., where "protected by someone" means that it is a descendant
> >of some staticpro, or of some root, or...
> 
> Well, yeah. Every other GC program does this. Emacs can too. There's no great burden: all Lisp objects get traced automatically. Everything on the stack or in a register gets traced automatically, and, because the scanning is conservative, pinned. You only have to take extra steps to tell the GC about something when you're going out of your way to go around the GC.
> 
> It's simply not true that to adopt a modern GC every line of code has to change.  I wrote a moving GC for Emacs myself years ago. Worked fine. No rewrite.

The opinions in this thread are that changes in the code _are_ needed.
Maybe that's not a "rewrite" you had in mind, but if we need to make
such substantial changes in many places, that's a "rewrite" in my
book.

> >And if we cannot prove to ourselves that one of the above happens,
> >then we'd need to force a copy of the variable to be on the stack?
> >
> >Does this sound practical?
> >
> >If this is the price of using MPS, and I'm not missing something
> >obvious, then it sounds like we should run away from MPS, fast.
> >Because we will sooner or later have to rewrite every single line of
> >code we ever wrote.
> 
> No, you do it by adopting a rule that when a function receives a pointer, the caller guarantees the validity of the pointer for the duration of the call. This way, only the level of the stack that spills the array to the heap has to take on the responsibility of keeping the referenced objects alive, and making the spilled array a pointer to the pinned guts of a Lisp vector is an adequate way to do this. 

We are talking about code such as this one:

	    SAFE_NALLOCA (args2, 1, nargs + 1);
	    args2[0] = Qcall_process;
	    for (i = 0; i < nargs; i++) args2[i + 1] = args[i];
	    coding_systems = Ffind_operation_coding_system (nargs + 1, args2);
	    val = CONSP (coding_systems) ? XCDR (coding_systems) : Qnil;

"Look, ma: no pointers!"

The args[] array is fine: it comes from the caller and is valid.  The
problem being discussed here is the args2[] array, in the case where
SAFE_NALLOCA decides args2[] is too large for the stack and instead
malloc's it.  In that case, args2[] stores on the heap copies of the
objects in args[] (still no pointers!), and the issue is that when GC
happens (say, at some point inside Ffind_operation_coding_system), the
objects in args[] are updated by GC, but their copies in args2[] are
not.

So this code needs to be changed.

And if you look around, we have quite a lot of these in many places.

> "Oh, but won't that kill performance?"

That wasn't my primary bother.  The primary bother is the need to
modify many places.  Which is not necessarily a purely mechanical
refactoring, either.  Consider:

  static Lisp_Object last_coding;
  Lisp_Object coding = Vcoding_system_for_write;
  [...]
  last_coding = coding;
  call_some_function ();
  if (NILP (last_coding))
    do_something ();

If call_some_function can trigger GC, the value of 'coding' after the
call is still valid, but the value of 'last_coding' could be garbage
(unless 'last_coding' is staticpro'd).  We have almost 200 static
Lisp_Object variables, probably not all of them staticpro'd (8 of them
inside functions, like the above example, so definitely not
staticpro'd).  So now we need to examine the uses of all of them and
either staticpro them or do something else (like move the assignment
to 'last_coding' to after call_some_function).

And we will probably find other usage patterns which trip the same
wire.

And the amazing part is not the above, it''s the fact that with all
that "unsafe" code people are using the branch for production, and
most of them report a stable Emacs.  What does that tell us? that
most, if not all, of those places are perfectly valid code, and we are
haunted by the proverbial shadow of a dwarf?  Or maybe it just makes
the job harder, because we now need to identify the places where these
bad things can _really_ happen, and fix only them?  Or something else?

This is what I'm trying to establish.  The problem is not theoretical,
it's entirely practical: where to steer our efforts of stabilizing the
branch and preparing it for landing, given the available resources.

> The *existing* code is broken, and you don't see it because we use alloca to allocate on the stack almost all the time. 

If we allocate on the stack almost all the time, we will keep
allocating on the stack almost all the time.  But anyway, if you are
now saying that the code is broken, then a rewrite, and a significant
one at that, _is_ needed, right?




This bug report was last modified 147 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.