GNU bug report logs - #61726
[PATCH] Eglot: Support positionEncoding capability

Previous Next

Package: emacs;

Reported by: Augusto Stoffel <arstoffel <at> gmail.com>

Date: Thu, 23 Feb 2023 08:06:01 UTC

Severity: normal

Tags: patch

Done: João Távora <joaotavora <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Augusto Stoffel <arstoffel <at> gmail.com>
Cc: 61726 <at> debbugs.gnu.org, joaotavora <at> gmail.com
Subject: bug#61726: [PATCH] Eglot: Support positionEncoding capability
Date: Thu, 23 Feb 2023 14:54:50 +0200
> From: Augusto Stoffel <arstoffel <at> gmail.com>
> Cc: 61726 <at> debbugs.gnu.org,  joaotavora <at> gmail.com
> Date: Thu, 23 Feb 2023 12:46:48 +0100
> 
> >> +(defun eglot--current-column-utf-8 ()
> >> +  "Calculate current column, counting bytes."
> >> +  (- (position-bytes (point)) (position-bytes (line-beginning-position))))
> >
> > This is subtly incorrect: position-bytes doesn't cound UTF-8 bytes, it
> > counts the bytes in the internal representation Emacs uses for buffer
> > and string text.  The differences are minor and subtle, but not
> > negligible.
> 
> Right, if the buffer contains a char outside of the Unicode range, we
> lose.
> 
> But just to confirm: position-bytes and byte-to-position are always with
> respect to Emacs's internal extended UTF-8 representation and have
> nothing to do with the buffer file enconding, right?

Yes.  See bufferpos-to-filepos to get an idea of what hoops we need to
jump through to get it right, even just with UTF-8.

> > What does this stuff do with double-width or zero-width characters?
> > Emacs takes character-width into consideration when it counts columns,
> > but it is unclear to me what do LSP servers do in those cases.
> > Likewise with characters that are composed on display.
> 
> `eglot-move-to-column' is supposed so count Unicode codepoints, so
> e.g. x, ⇒ and 😃 all contribute 1 unit.

But if the resulting column is then used in move-to-column etc., it
might go to the wrong column, because in Emacs each column is not
necessarily a single codepoint.  The simplest example is a TAB
character, but there are more examples, some of which are quite
complicated (see below).

> One the other hand, the Emoji
> 🧛‍♀️ contributes 4 units. This is independent of with screen display.

Not in Emacs.

> By the way, I don't undertand your claim about column counting.  If I
> move point over 🧛‍♀️, the mode line column count increments by 3 units,
> which seems to make no sense: this Emoji is 4 codepoints longs and
> occupies 1 screen column.  What's the logic here?

If that is what you see, it could be a bug.  Does current-column agree
with what you see in the mode line?

In general, characters (codepoints) that are composed on display into
a single glyph or "grapheme cluster" are supposed to be counted as a
single column.  Try typing this in "emacs -Q"

  a C-x 8 RET COMBINING ACUTE ACCENT RET

If your default font is capable enough, you will see a single glyph of
'a' with acute accent (á), and it will count as 1 column, although
there are 2 codepoints in the buffer.  And "M-: (move-to-column 1) RET"
will move past both codepoints.  Now imagine that we get such sequences
from the LSP server -- what will Eglot do in terms of column counting?




This bug report was last modified 2 years and 137 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.