GNU bug report logs - #77924
31.0.50; [Feature branch] Change marker implementation

Previous Next

Package: emacs;

Reported by: Gerd Möllmann <gerd.moellmann <at> gmail.com>

Date: Sat, 19 Apr 2025 16:06:02 UTC

Severity: normal

Found in version 31.0.50

Full log


Message #143 received at 77924 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: monnier <at> iro.umontreal.ca, gerd.moellmann <at> gmail.com
Cc: stefankangas <at> gmail.com, 77924 <at> debbugs.gnu.org
Subject: Re: bug#77924: 31.0.50; [Feature branch] Change marker implementation
Date: Thu, 24 Apr 2025 14:54:27 +0300
> Cc: gerd.moellmann <at> gmail.com, stefankangas <at> gmail.com, 77924 <at> debbugs.gnu.org
> Date: Wed, 23 Apr 2025 17:41:18 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
> > From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> > Cc: Eli Zaretskii <eliz <at> gnu.org>,  stefankangas <at> gmail.com,
> >   77924 <at> debbugs.gnu.org
> > Date: Wed, 23 Apr 2025 10:34:11 -0400
> > 
> > FWIW, I'm not a great fan of rebasing either.  I did rebase the branch
> > last night not for rebasing's sake but because I felt there was a need
> > for more detailed commit messages.
> > 
> > In any case, any objection to merging the branch?
> 
> I'd like to have a closer review of it first, so please wait for a
> while.  When I skimmed it, I remember having several, hopefully minor,
> aspects that stood out, and I want to take a closer look.

See below:

> -  m->next = BUF_MARKERS (buf);
> -  BUF_MARKERS (buf) = m;
> +  marker_vector_add (buf, m);
> +  marker_vector_set_charpos (m, charpos);

Could we please name functions that manipulate and get marker
information marker_SOMETHING and not marker_vector_something?
marker_vector_add is semi-okay (although marker_add would have been
nicer, IMO), but marker_vector_set_charpos and marker_vector_charpos
are not, because AFAIU they manipulate the marker's position, not
marker-vector's position.

In general, wherever the fact that we have a vector of markers is
merely an implementation detail that is not important to the
programmer, can we just drop the "vector" part from the names of the
functions, for the same reason that the old functions and macros
weren't called marker_list_SOMETHING?

> +  DO_MARKERS (b, m)
> +    {
> +      if (!vectorlike_marked_p (&m->header))
> +	marker_vector_remove (XVECTOR (BUF_MARKERS (b)), m);
> +    }
> +  END_DO_MARKERS;

Would it be possible not to introduce macros that modify the C syntax?
These are problematic for more than one reason (one being that they
are not recognized by C modes we use).  We have macros whose names
start with FOR_EACH_ (like FOR_EACH_FRAME); can we introduce
FOR_EACH_MARKER instead, and not hide opening/closing braces inside
macros?

> +      marker_vector_reset (b);

I'd prefer marker_vector_clear.  Or maybe even marker_delete_all.

> -	  set_marker_both (point_marker, Qnil, PT, PT_BYTE);
> +	  set_marker_both (point_marker, Qnil, PT);

It's confusing to keep calling this set_marker_both, when we no longer
pass the byte position.

> +	  DO_MARKERS (current_buffer, tail)
> +	    {
> +	      if (tail->need_adjustment)
> +		{
> +		  tail->need_adjustment = 0;
> +		  if (tail->insertion_type)
> +		    {
> +		      marker_vector_set_charpos (tail, from);
> +		    }

We don't use braces around blocks of a single statement.

>  INLINE ptrdiff_t
> +gc_vsize (const struct Lisp_Vector *v)
> +{
> +  return v->header.size & ~ARRAY_MARK_FLAG;
> +}
> +
> +INLINE ptrdiff_t
>  gc_asize (Lisp_Object array)
>  {
>    /* Like ASIZE, but also can be used in the garbage collector.  */
> -  return XVECTOR (array)->header.size & ~ARRAY_MARK_FLAG;
> +  return gc_vsize (XVECTOR (array));
>  }

Please add a comment to gc_vsize similar to what we say in gc_asize.

> +/* Push entry ENTRY of V to its free-list. This must set MARKER to not
                                                            ^^^^^^
> +   be a marker, which is done by using the MARKER field of entry
> +   to form the free-list.  */
> +
> +static void
> +push_free (struct Lisp_Vector *v, const ptrdiff_t entry)
> +{
> +  check_is_entry (v, entry);
> +  NEXT_FREE (v, entry) = FREE (v);
> +  FREE (v) = make_fixnum (entry);

There's no argument named MARKER in this function.

> +Lisp_Object
> +alloc_marker_vector (ptrdiff_t len)
> +{
> +#ifdef HAVE_MPS
> +  return igc_alloc_marker_vector (len, FREE_LIST_END);
> +#else
> +  return make_vector (len, FREE_LIST_END);
> +#endif
> +}

We don't have any code on master that uses HAVE_MPS.  This changeset
adds at least 2 of them.  I'd prefer not to have these conditions on
master as long as the igc branch was not landed.

> +void
> +marker_vector_reset (struct buffer *b)
> +{
> +  /* The old GC contains at least one assertion that unchaining markers
> +     in kill-buffer resets the markers' buffers.  IGC does not do this,
> +     can't do this, and does not need it.  */

Likewise here, wrt comments about igc.

> +/* Return marker M's byte position.  */
> +
> +ptrdiff_t
> +marker_vector_bytepos (const struct Lisp_Marker *m)
> +{
> +  const ptrdiff_t charpos = marker_vector_charpos (m);
> +  return buf_charpos_to_bytepos (m->buffer, charpos);

Can m->buffer be NULL or a dead buffer?  If so, what will happen here?

> +void
> +marker_vector_adjust_for_insert (struct buffer *b,
> +				 const ptrdiff_t from_charpos,
> +				 const ptrdiff_t to_charpos,
> +				 const bool before_markers)
> +{
> +  const ptrdiff_t nchars = to_charpos - from_charpos;
> +  struct Lisp_Vector *v = XVECTOR (BUF_MARKERS (b));
> +  DO_MARKERS (b, m)
> +    {
> +      const ptrdiff_t charpos = XFIXNUM (CHARPOS (v, m->entry));
> +      if (charpos == from_charpos)
> +	{
> +	  if (m->insertion_type || before_markers)
> +	    CHARPOS (v, m->entry) = make_fixnum (to_charpos);
> +	}
> +      else if (charpos > from_charpos)
> +	CHARPOS (v, m->entry) = make_fixnum (charpos + nchars);

Why does it use CHARPOS and not marker_vector_set_charpos?

> +void
> +marker_vector_adjust_for_replace (struct buffer *b,
> +				  const ptrdiff_t from_charpos,
> +				  const ptrdiff_t old_nchars,
> +				  const ptrdiff_t new_nchars)
> +{
> +  const ptrdiff_t diff_nchars = new_nchars - old_nchars;
> +  const ptrdiff_t old_to_charpos = from_charpos + old_nchars;
> +  struct Lisp_Vector *v = XVECTOR (BUF_MARKERS (b));
> +  DO_MARKERS (b, m)
> +    {
> +      const ptrdiff_t charpos = XFIXNUM (CHARPOS (v, m->entry));
> +      if (charpos >= old_to_charpos)
> +	CHARPOS (v, m->entry) = make_fixnum (charpos + diff_nchars);
> +      else if (charpos > from_charpos)
> +	CHARPOS (v, m->entry) = make_fixnum (from_charpos);
> +    }
> +  END_DO_MARKERS;

Likewise here.

> +Lisp_Object make_marker_vector (void);
> +Lisp_Object alloc_marker_vector (ptrdiff_t len);
> +void marker_vector_add (struct buffer *b, struct Lisp_Marker *m);
> +void marker_vector_remove (struct Lisp_Vector *v, struct Lisp_Marker *m);
> +void marker_vector_reset (struct buffer *b);
> +void marker_vector_set_charpos (struct Lisp_Marker *m, ptrdiff_t charpos);
> +ptrdiff_t marker_vector_charpos (const struct Lisp_Marker *m);
> +ptrdiff_t marker_vector_bytepos (const struct Lisp_Marker *m);
> +void marker_vector_adjust_for_insert (struct buffer *b, const ptrdiff_t from,
> +				      ptrdiff_t to, bool before_markers);
> +void marker_vector_adjust_for_replace (struct buffer *b, ptrdiff_t from,
> +				       ptrdiff_t old_chars, ptrdiff_t new_chars);

We use the 'extern' qualifier for prototypes of non-static functions
in header files.

> +   Note that the byte positions at interval boundaries can be in the
> +   middle of a multi-byte character.
> +
> +	     character start byte position
> +		|
> +	  ------01234-------- bytes of a character (up to 5 in Emacs'
> +		   |          internal encoding)
> +	       N * interval
> +
> +   To find the character position corresponding to a byte position
> +   BYTEPOS, we look up the character position in the index at BYTEPOS /
> +   interval.  From there, we can scan forward in the text until we reach
> +   BYTEPOS, counting characters, or we scan backward from the interval
> +   end, if that is closer.
> +
> +   To find the byte position BYTEPOS corresponding to a given character
> +   position CHARPOS, we search the index for the last entry ENTRY whose
> +   character position is <= CHARPOS.  That entry corresponds to a byte
> +   position ENTRY * interval size.  From there, we scan the text until
> +   we reach BYTEPOS, counting characters until we reach CHARPOS.  The
> +   byte position reached at the end is BYTEPOS.  We also scan backward
> +   from the interval end, if that looks advantageous.

Doesn't this make charpos->bytepos conversion more expensive than in
the other direction?  If not, why not?  If the charpos->bytepos
conversion is indeed more expensive, shouldn't it be the other way
around, since AFAIR we need a byte position of a given character
position much more frequently than the other way around?  (And now it
should be even more frequently, since markers record only the
character positions.)

> +   Why divide the text into intervals of bytes instead of characters?
> +   Dividing the text into intervals of characters makes scanning
> +   overhead less uniform, since characters can be of different lengths
> +   (1 to 5 bytes).  */

AFAIU, this should mention the fact that the implementation wants to
use binary search to find the entry corresponding to a character
position.

> +// clang-format off

Do we need this,l and if we do, does it have to be a C++ style
comment?

> +
> +struct text_index
> +{
> +  /* Value at index IDX is the character position of byte position IDX *
> +     INTERVAL.  Note that that byte position may be in the middle of a
> +     character.  The value at index 0 is BEG.  */
> +  ptrdiff_t *charpos;

What is IDX here?  And what is INTERVAL?

> +  /* Number of valid entries in the above array.  This is always at least 1
> +     because the first entry is for BEG.  */
> +  size_t nentries;
> +
> +  /* Number of entries allocated.  */
> +  size_t capacity;

Why not ptrdiff_t?  AFAIR, unsigned types should be avoided, to
prevent problems when mixing them with signed types.

> +/* Value is true is known position cache of TI is valid.  */
                    ^^
Typo: "if".

> +static bool
> +is_cache_valid (const struct text_index *ti)
> +{
> +  return ti->cache.bytepos != TEXT_INDEX_INVALID_POSITION;
> +}

Shouldn't we also test against BEG_BYTE and Z_BYTE?

> +/* Return the byte position in index TI corresponding to index entry
> +   ENTRY.  Note that this position cab be in the middle of a multi-byte
                                      ^^^
"can"

> +/* Return the character position in index TI corresponding index entry
                                                ^^^^^^^^^^^^^
"corresponding to"

> +  /* Start at the byte position of the last index entry.  if TO_BYTEPOS
                                                             ^^
Capitalize "If".

> +  /* Loop over bytes, starting one after the index entry we start from
> +     because we are only interested in yet unknown entries, and the
> +     one at EMTRY can be assumed to stay unchanged.  */
               ^^^^^
"ENTRY"

> +  for (++bytepos; bytepos < z_byte; ++bytepos)
> +    {
> +      if (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)))
> +	++charpos;
> +
> +      if (bytepos == next_stop)
> +	{
> +	  /* Add a new index entry.  */
> +	  append_entry (ti, charpos);
> +
> +	  /* If we reached the one after the position we are interested
> +	     in, we're done since we can then scan forward and backward
> +	     to BYTEPOS.  */
> +	  if ((to.bytepos != TEXT_INDEX_INVALID_POSITION
> +	       && bytepos > to.bytepos)
> +	      || (to.charpos != TEXT_INDEX_INVALID_POSITION
> +		  && charpos > to.charpos))
> +	    break;
> +
> +	  /* Compute next stop. We are done if no next entry
> +	     can be built.  */
> +	  next_stop += TEXT_INDEX_INTERVAL;
> +	  if (next_stop >= z_byte)
> +	    break;
> +	}
> +    }

I think this loop could be optimized if it used BYTES_BY_CHAR_HEAD
instead of advancing one byte at a time.

> +static ptrdiff_t
> +charpos_forward_to_bytepos (struct buffer *b, const struct text_pos from,
> +			    const ptrdiff_t to_bytepos)
> +{
> +  eassert (from.bytepos <= to_bytepos);
> +  ptrdiff_t bytepos = from.bytepos;
> +  ptrdiff_t charpos = from.charpos;
> +  while (bytepos < to_bytepos)
> +    {
> +      ++bytepos;
> +      if (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)))
> +	++charpos;
> +    }
> +  return charpos;
> +}

Likewise here.

> +  ptrdiff_t charpos = from.charpos;

Please use CHARPOS(from) instead of accessing the struct members
directly (here and elsewhere).

> +  while (charpos < to_charpos)
> +    {
> +      ++bytepos;
> +      if (CHAR_HEAD_P (BUF_FETCH_BYTE (b, bytepos)))
> +	++charpos;
> +    }

Here we could also use BYTES_BY_CHAR_HEAD.

> +/* Improve the known bytepos bounds *PREV and *NEXT if KNOWN is closer
> +   to BYTEPOS.  */
> +
> +static void
> +narrow_charpos_bounds_1 (const struct text_pos known, struct text_pos *prev,
> +			 struct text_pos *next, const ptrdiff_t charpos)

The commentary mentions BYTEPOS, but the function doesn't have such an
argument.

> +/* Improve the known bytepos bounds *PREV and *NEXT of buffer B using
> +   known positions in B.  BYTEPOS is a byte position to convert to a
> +   character position.  */
> +
> +static void
> +narrow_charpos_bounds (struct buffer *b, struct text_pos *prev,
> +		       struct text_pos *next, const ptrdiff_t charpos)

Same here.

> +/* Return the character position in buffer B corresponding to
> +   byte position BYTEPOS.  */
> +
> +ptrdiff_t
> +buf_bytepos_to_charpos (struct buffer *b, const ptrdiff_t bytepos)
> +{
> +  /* FIXME: Can BYTEPOS ever be outside of BEGV_BYTE..ZV_BYTE?  */
> +  /* If this buffer has as many characters as bytes, each character must
> +     be one byte.  This takes care of the case where
> +     enable-multibyte-characters is nil.  */
> +  const struct text_pos z = z_pos (b);
> +  if (z.charpos == z.bytepos)
> +    return bytepos;
> +
> +  /* Begin with the interval (BEG, Z), and improve on that by taking known
> +     positions into account like PT, GPT and the cache.  This might
> +     already find the answer.  */
> +  struct text_index *ti = ensure_has_index (b);
> +  struct text_pos prev = beg_pos (b);
> +  struct text_pos next = z;
> +
> +  narrow_bytepos_bounds (b, &prev, &next, bytepos);
> +
> +  /* Z_BYTE does not have an index entry because we don't want
> +     extra entries for (Z, Z_BYTE), so short-circuit *before* looking
> +     up the index.  Changing that would be possible but leads to more
> +     code than this if-statement, so it's probably not worth it.  */
> +  if (next.bytepos == bytepos)
> +    return next.charpos;
> +
> +  ensure_bytepos_indexed (b, bytepos);
> +  const ptrdiff_t entry = index_bytepos_entry (ti, bytepos);
> +  narrow_bytepos_bounds_1 (index_text_pos (ti, entry), &prev, &next, bytepos);
> +  narrow_bytepos_bounds_1 (next_known_text_pos (b, entry),
> +			   &prev, &next, bytepos);
> +
> +  if (next.charpos - prev.charpos == next.bytepos - prev.bytepos
> +      /* Beware: NEXT and PREV can be in the middle of multibyte chars!  */
> +      && CHAR_HEAD_P (BUF_FETCH_BYTE (b, prev.bytepos)))
> +    return prev.charpos + (bytepos - prev.bytepos); /* ASCII-only!  */

AFAICT, you already handled the ASCII-only case at the beginning, so
why is this needed?

> +  /* Don't scan forward if CHARPOS is exactly on the previous know
                                                                 ^^^^
"known"

> +DEFUN ("text-index--charpos-to-bytepos", Ftext_index__charpos_to_bytepos,
> +       Stext_index__charpos_to_bytepos, 1, 1, 0,
> +       doc: /* Convert CHARPOS to a bytepos in current buffer.
> +If POSITION is out of range, the value is nil.  */)
> +  (Lisp_Object charpos)
> +{
> +  const EMACS_INT pos = fix_position (charpos);
> +  if (pos < BEG || pos > Z)
> +    return Qnil;
> +  ptrdiff_t bytepos = buf_charpos_to_bytepos (current_buffer, pos);
> +  return make_fixnum (bytepos);
> +}
> +
> +DEFUN ("text-index--bytepos-to-charpos", Ftext_index__bytepos_to_charpos,
> +       Stext_index__bytepos_to_charpos, 1, 1, 0,
> +       doc: /* Convert BYTEPOS to a charpos in current buffer.
> +If BYTEPOS is out of range, the value is nil.  */)
> +  (Lisp_Object bytepos)
> +{
> +  CHECK_FIXNUM (bytepos);
> +  const ptrdiff_t pos_byte = XFIXNUM (bytepos);
> +  if (pos_byte < BEG_BYTE || pos_byte > Z_BYTE)
> +    return Qnil;
> +  ptrdiff_t charpos = buf_bytepos_to_charpos (current_buffer, pos_byte);
> +  return make_fixnum (charpos);
> +}
> +
> +DEFUN ("text-index--charpos-to-bytepos-brute",
> +       Ftext_index__charpos_to_bytepos_brute,
> +       Stext_index__charpos_to_bytepos_brute, 1, 1, 0,
> +       doc: /* Convert CHARPOS to a bytepos in current buffer.
> +Compute with brute force.  */)
> +  (Lisp_Object pos)
> +{
> +  const EMACS_INT to_charpos = fix_position (pos);
> +  if (to_charpos < BEG || to_charpos > Z)
> +    return Qnil;
> +  ptrdiff_t charpos = BEG, bytepos = BEG_BYTE;
> +  while (charpos < to_charpos)
> +    {
> +      ++bytepos;
> +      if (CHAR_HEAD_P (FETCH_BYTE (bytepos)))
> +	++charpos;
> +    }
> +  return make_fixnum (bytepos);
> +}

Why are these functions needed?  We already have byte-to-position and
position-bytes, so at least the first two seem redundant.  When would
be the last two ones useful?

Thanks.




This bug report was last modified 104 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.