GNU bug report logs - #74963
Ambiguous treesit named and anonymous nodes in ruby-ts-mode

Package: emacs;

Reported by: Juri Linkov <juri <at> linkov.net>

Date: Thu, 19 Dec 2024 07:20:02 UTC

Severity: normal

To reply to this bug, email your comments to 74963 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to casouri <at> gmail.com, dmitry <at> gutov.dev, bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Thu, 19 Dec 2024 07:20:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Juri Linkov <juri <at> linkov.net>:
New bug report received and forwarded. Copy sent to casouri <at> gmail.com, dmitry <at> gutov.dev, bug-gnu-emacs <at> gnu.org. (Thu, 19 Dec 2024 07:20:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: bug-gnu-emacs <at> gnu.org
Subject: Ambiguous treesit named and anonymous nodes in ruby-ts-mode
Date: Thu, 19 Dec 2024 09:18:37 +0200

[This is a separate bug report from bug#73404]

>> While testing treesit-forward-sexp-list, I discovered that
>> thing-navigation functions are not restricted to named nodes.
>> 
>> I wonder if there a reason to find anonymous nodes as things?
>
> We should rather ask is there any reason to not find anonymous nodes
> as things? Even ruby-ts-mode defines a bunch of anonymous nodes as
> sexp, no? In any case, excluding anonymous nodes from things doesn’t
> sound right.

Indeed, there are many anonymous nodes used in ruby-ts-mode.

>> The problem was found with the node "unless" in Ruby:
>> 
>>  unless cond
>>    a += 1
>>  else
>>    b -= 1
>>  end
>> 
>> Here the named node 'unless' has exactly the same name
>> as the anonymous node with the text "unless":
>> 
>>  (unless "unless" condition: (identifier)
>
> I feel like Ruby’s grammar should call the named node something else,
> like unless_statement.

Agreed, the problem is that nodes defined in Ruby’s grammar
are too ambiguous.  There are more such nodes with the same name
for named and anonymous: "if", "while", "until", etc.

>> Finding anonymous nodes breaks forward-sexp when point is on "unless":
>> 
>>  un-!-less cond
>>    a += 1
>>  else
>>    b -= 1
>>  end
>> 
>> because (treesit-thing-at (point) 'sexp t) finds
>> 
>>  #<treesit-node "unless" in 156-162>
>> 
>> instead of
>> 
>>  #<treesit-node unless in 156-203>
>> 
>> Also this breaks backward-sexp and backward-up-list
>> because treesit--thing-sibling finds
>> the anonymous node "unless" as a previous sibling
>> instead of the named node 'unless' as a parent.
>> 
>> Would the right solution be to check if the found thing
>> is a named node?  With something like:
>> 
>> diff --git a/lisp/treesit.el b/lisp/treesit.el
>> index 18200acf53f..9ad879ee40c 100644
>> --- a/lisp/treesit.el
>> +++ b/lisp/treesit.el
>> @@ -2711,6 +2774,7 @@ treesit--thing-sibling
>>                      (lambda (n) (>= (treesit-node-start n) pos))))
>>          (iter-pred (lambda (node)
>>                       (and (treesit-node-match-p node thing t)
>> +                           (treesit-node-check node 'named)
>>                            (funcall pos-pred node))))
>>          (sibling nil))
>>     (when cursor
>> @@ -2760,6 +2824,7 @@ treesit-thing-at
>>   (let* ((cursor (treesit-node-at pos))
>>          (iter-pred (lambda (node)
>>                       (and (treesit-node-match-p node thing t)
>> +                           (treesit-node-check node 'named)
>>                            (if strict
>>                                (< (treesit-node-start node) pos)
>>                              (<= (treesit-node-start node) pos))
>
> A better solution IMO is to add some way to distinguish between named and
> anonymous nodes. I can think of two ways, either add “and” and
> “named/anonymous” predicate, so (and named “unless”) only matches the named
> “unless” node; or we add a special syntax such that “(unless)” only matches
> named nodes, and “\”unless\”” only matches anonymous nodes.

Either predicate or a special syntax is welcome.

This would be more handy than writing a lambda with implicit calls
of treesit-node-check.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Tue, 24 Dec 2024 03:04:02 GMT) Full text and rfc822 format available.

Message #8 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Juri Linkov <juri <at> linkov.net>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Mon, 23 Dec 2024 19:02:28 -0800


> On Dec 18, 2024, at 11:18 PM, Juri Linkov <juri <at> linkov.net> wrote:
> 
> [This is a separate bug report from bug#73404]
> 
>>> While testing treesit-forward-sexp-list, I discovered that
>>> thing-navigation functions are not restricted to named nodes.
>>> 
>>> I wonder if there a reason to find anonymous nodes as things?
>> 
>> We should rather ask is there any reason to not find anonymous nodes
>> as things? Even ruby-ts-mode defines a bunch of anonymous nodes as
>> sexp, no? In any case, excluding anonymous nodes from things doesn’t
>> sound right.
> 
> Indeed, there are many anonymous nodes used in ruby-ts-mode.
> 
>>> The problem was found with the node "unless" in Ruby:
>>> 
>>> unless cond
>>>   a += 1
>>> else
>>>   b -= 1
>>> end
>>> 
>>> Here the named node 'unless' has exactly the same name
>>> as the anonymous node with the text "unless":
>>> 
>>> (unless "unless" condition: (identifier)
>> 
>> I feel like Ruby’s grammar should call the named node something else,
>> like unless_statement.
> 
> Agreed, the problem is that nodes defined in Ruby’s grammar
> are too ambiguous.  There are more such nodes with the same name
> for named and anonymous: "if", "while", "until", etc.
> 
>>> Finding anonymous nodes breaks forward-sexp when point is on "unless":
>>> 
>>> un-!-less cond
>>>   a += 1
>>> else
>>>   b -= 1
>>> end
>>> 
>>> because (treesit-thing-at (point) 'sexp t) finds
>>> 
>>> #<treesit-node "unless" in 156-162>
>>> 
>>> instead of
>>> 
>>> #<treesit-node unless in 156-203>
>>> 
>>> Also this breaks backward-sexp and backward-up-list
>>> because treesit--thing-sibling finds
>>> the anonymous node "unless" as a previous sibling
>>> instead of the named node 'unless' as a parent.
>>> 
>>> Would the right solution be to check if the found thing
>>> is a named node?  With something like:
>>> 
>>> diff --git a/lisp/treesit.el b/lisp/treesit.el
>>> index 18200acf53f..9ad879ee40c 100644
>>> --- a/lisp/treesit.el
>>> +++ b/lisp/treesit.el
>>> @@ -2711,6 +2774,7 @@ treesit--thing-sibling
>>>                     (lambda (n) (>= (treesit-node-start n) pos))))
>>>         (iter-pred (lambda (node)
>>>                      (and (treesit-node-match-p node thing t)
>>> +                           (treesit-node-check node 'named)
>>>                           (funcall pos-pred node))))
>>>         (sibling nil))
>>>    (when cursor
>>> @@ -2760,6 +2824,7 @@ treesit-thing-at
>>>  (let* ((cursor (treesit-node-at pos))
>>>         (iter-pred (lambda (node)
>>>                      (and (treesit-node-match-p node thing t)
>>> +                           (treesit-node-check node 'named)
>>>                           (if strict
>>>                               (< (treesit-node-start node) pos)
>>>                             (<= (treesit-node-start node) pos))
>> 
>> A better solution IMO is to add some way to distinguish between named and
>> anonymous nodes. I can think of two ways, either add “and” and
>> “named/anonymous” predicate, so (and named “unless”) only matches the named
>> “unless” node; or we add a special syntax such that “(unless)” only matches
>> named nodes, and “\”unless\”” only matches anonymous nodes.
> 
> Either predicate or a special syntax is welcome.
> 
> This would be more handy than writing a lambda with implicit calls
> of treesit-node-check.

I’ll go with the (and named “unless”) route because after thinking about it more, “(unless)” will be hard to work with because the string predicate is actually a regexp.

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Tue, 24 Dec 2024 07:19:01 GMT) Full text and rfc822 format available.

Message #11 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Tue, 24 Dec 2024 09:17:33 +0200

>>> A better solution IMO is to add some way to distinguish between named and
>>> anonymous nodes. I can think of two ways, either add “and” and
>>> “named/anonymous” predicate, so (and named “unless”) only matches the named
>>> “unless” node; or we add a special syntax such that “(unless)” only matches
>>> named nodes, and “\”unless\”” only matches anonymous nodes.
>> 
>> Either predicate or a special syntax is welcome.
>> 
>> This would be more handy than writing a lambda with implicit calls
>> of treesit-node-check.
>
> I’ll go with the (and named “unless”) route because after thinking
> about it more, “(unless)” will be hard to work with because the string
> predicate is actually a regexp.

Thanks.  While addition of '(and named "unless")' would be appreciated,
I see that currently it's possible to do this by proving a predicate
like there is 'ruby-ts--sexp-p' in

  (setq-local treesit-thing-settings
              `((ruby
                 (sexp ,(cons (rx
                               bol
                               (or
                                "class"
                                ...
                                )
                               eol)
                              #'ruby-ts--sexp-p))

Then 'ruby-ts--sexp-p' could check for the named node "unless" as well.

But it seems such solution is less efficient than adding '(and named "unless")'.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Tue, 24 Dec 2024 07:44:01 GMT) Full text and rfc822 format available.

Message #14 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Tue, 24 Dec 2024 09:41:58 +0200

>   (setq-local treesit-thing-settings
>               `((ruby
>                  (sexp ,(cons (rx
>                                bol
>                                (or
>                                 "class"
>                                 ...
>                                 )
>                                eol)
>                               #'ruby-ts--sexp-p))

BTW, I just fixed a bug in typescript-ts-mode
where "string_fragment" was mismatched by "string",
because its regexp-opt matched node names too widely,
so needed to enclose in regexp anchors.

I see that all ts-modes solve this common problem each in its own way
(here 'list' indicates a list of strings that should match node names):

  c-ts-mode:    (regexp-opt list 'symbols)
  js-ts-mode:   (concat "\\_<" (regexp-opt list t) "\\_>")
  java-ts-mode: (rx (or list))
  ruby-ts-mode: (rx bol (or list) eol)

Currently there is no uniform way to handle this frequent need.
'concat' like above looks too ugly, but 'regexp-opt' with the
'symbols' arg produces a strange regexp for matching symbols.

Maybe better would be create a new argument for 'regexp-opt', e.g.:

  (regexp-opt list 'complete)

that will expand to:

  (concat "^" list "$")

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Tue, 24 Dec 2024 17:55:01 GMT) Full text and rfc822 format available.

Message #17 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Tue, 24 Dec 2024 19:52:53 +0200

> I’ll go with the (and named “unless”) route because after thinking
> about it more, “(unless)” will be hard to work with because the string
> predicate is actually a regexp.

Is it possible to mark all node names specified in treesit-thing-settings
as named?

I just discovered a new problem:

1. With typescript-ts-mode on the following snippet:

type NodeInfo =
  | (BaseNode & {
      subtypes: BaseNode[];
    })
  | (BaseNode & {
      fields: { [name: string]: ChildNode };
      children: ChildNode[];
    });

You can move point inside "string" and type C-M-f or C-M-b.
But point doesn't move.

This is because treesit-thing-settings defines a named node "string".
But anonymous node has the same name "string":

           (index_signature [ name: (identifier) :
            index_type: (predefined_type string)

and (treesit-node-at (point)) returns
#<treesit-node "string" in 111-117>

This mismatched "string" in TypeScript is even more
unexpected than "unless" in Ruby.

So probably we need a way to mark all used nodes as named
to avoid such unexpected matches.  Maybe matching anonymous nodes
should be opt-in, and by default match only named nodes.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Tue, 24 Dec 2024 21:05:02 GMT) Full text and rfc822 format available.

Message #20 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Juri Linkov <juri <at> linkov.net>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Tue, 24 Dec 2024 13:03:45 -0800


> On Dec 24, 2024, at 9:52 AM, Juri Linkov <juri <at> linkov.net> wrote:
> 
>> I’ll go with the (and named “unless”) route because after thinking
>> about it more, “(unless)” will be hard to work with because the string
>> predicate is actually a regexp.
> 
> Is it possible to mark all node names specified in treesit-thing-settings
> as named?
> 
> I just discovered a new problem:
> 
> 1. With typescript-ts-mode on the following snippet:
> 
> type NodeInfo =
>  | (BaseNode & {
>      subtypes: BaseNode[];
>    })
>  | (BaseNode & {
>      fields: { [name: string]: ChildNode };
>      children: ChildNode[];
>    });
> 
> You can move point inside "string" and type C-M-f or C-M-b.
> But point doesn't move.
> 
> This is because treesit-thing-settings defines a named node "string".
> But anonymous node has the same name "string":
> 
>           (index_signature [ name: (identifier) :
>            index_type: (predefined_type string)
> 
> and (treesit-node-at (point)) returns
> #<treesit-node "string" in 111-117>
> 
> This mismatched "string" in TypeScript is even more
> unexpected than "unless" in Ruby.
> 
> So probably we need a way to mark all used nodes as named
> to avoid such unexpected matches.  Maybe matching anonymous nodes
> should be opt-in, and by default match only named nodes.

IMHO this is just an unfortunate bug that needs to be fixed. I agree that this type of bug are hard to avoid, which is a bad thing, but that doesn’t mean we should try to  alleviate it at any cost. Making predicates named by default just adds complexity and inflexibility for not much benefit.

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Wed, 25 Dec 2024 03:26:01 GMT) Full text and rfc822 format available.

Message #23 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dmitry <at> gutov.dev>
To: Juri Linkov <juri <at> linkov.net>, Yuan Fu <casouri <at> gmail.com>
Cc: 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Wed, 25 Dec 2024 05:25:24 +0200

Hi Juri,

On 24/12/2024 09:17, Juri Linkov wrote:
> While addition of '(and named "unless")' would be appreciated,
> I see that currently it's possible to do this by proving a predicate
> like there is 'ruby-ts--sexp-p' in
> 
>    (setq-local treesit-thing-settings
>                `((ruby
>                   (sexp ,(cons (rx
>                                 bol
>                                 (or
>                                  "class"
>                                  ...
>                                  )
>                                 eol)
>                                #'ruby-ts--sexp-p))
> 
> Then 'ruby-ts--sexp-p' could check for the named node "unless" as well.
> 
> But it seems such solution is less efficient than adding '(and named "unless")'.

Given that we're already calling a predicate every time (in 
ruby-ts-mode), we might as well add one more check. See the patch at the 
end.

Speaking of tricky examples though, here's a definition:

  module Bar
    class Foo
      def baz
      end
    end
  end

If you move point inside the keyword "module" or "class", C-M-f wouldn't 
move forward either as of the latest master. No such problem with "def".

Adding the check for "named" fixes the first two cases, but then C-M-f 
inside "def" jumps to after "baaz". Could be worked around with a 
special case, but I wonder what this difference comes from (haven't 
properly debugged yet).

diff --git a/lisp/progmodes/ruby-ts-mode.el b/lisp/progmodes/ruby-ts-mode.el
index 4ef0cb18eae..4b15c6cbf27 100644
--- a/lisp/progmodes/ruby-ts-mode.el
+++ b/lisp/progmodes/ruby-ts-mode.el
@@ -1120,6 +1120,10 @@ ruby-ts--sexp-p
       (equal (treesit-node-type (treesit-node-child node 0))
              "(")))

+(defun ruby-ts--sexp-list-p (node)
+  (when (treesit-node-check node 'named)
+    (ruby-ts--sexp-p node)))
+
 (defvar-keymap ruby-ts-mode-map
   :doc "Keymap used in Ruby mode"
   :parent prog-mode-map
@@ -1235,7 +1239,7 @@ ruby-ts-mode
                            "array"
                            "hash")
                           eol)
-                         #'ruby-ts--sexp-p))
+                         #'ruby-ts--sexp-list-p))
                  (text ,(lambda (node)
                           (or (member (treesit-node-type node)
                                       '("comment" "string_content" 
"heredoc_content"))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Wed, 25 Dec 2024 08:11:02 GMT) Full text and rfc822 format available.

Message #26 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Wed, 25 Dec 2024 09:49:24 +0200

>> This mismatched "string" in TypeScript is even more
>> unexpected than "unless" in Ruby.
>> 
>> So probably we need a way to mark all used nodes as named
>> to avoid such unexpected matches.  Maybe matching anonymous nodes
>> should be opt-in, and by default match only named nodes.
>
> IMHO this is just an unfortunate bug that needs to be fixed. I agree that
> this type of bug are hard to avoid, which is a bad thing, but that doesn’t
> mean we should try to alleviate it at any cost. Making predicates named by
> default just adds complexity and inflexibility for not much benefit.

Not sure if a possible flexibility is better than unintended matches.

When the authors of a ts-mode carefully selected a list of named nodes to match,
why treesit should try to match some random and unintended anonymous nodes?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Wed, 25 Dec 2024 08:12:03 GMT) Full text and rfc822 format available.

Message #29 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Dmitry Gutov <dmitry <at> gutov.dev>
Cc: Yuan Fu <casouri <at> gmail.com>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Wed, 25 Dec 2024 09:52:33 +0200

>> While addition of '(and named "unless")' would be appreciated,
>> I see that currently it's possible to do this by proving a predicate
>> like there is 'ruby-ts--sexp-p' in
>> 
>>    (setq-local treesit-thing-settings
>>                `((ruby
>>                   (sexp ,(cons (rx
>>                                 bol
>>                                 (or
>>                                  "class"
>>                                  ...
>>                                  )
>>                                 eol)
>>                                #'ruby-ts--sexp-p))
>> 
>> Then 'ruby-ts--sexp-p' could check for the named node "unless" as well.
>> 
>> But it seems such solution is less efficient than adding '(and named "unless")'.
>
> Given that we're already calling a predicate every time (in 
> ruby-ts-mode), we might as well add one more check. See the patch at the 
> end.

Thanks, I tried the patch.  It was broken, so needed to edit manually.
Also the new key 'w' doesn't work in diff buffers, need to fix it as well.

> Speaking of tricky examples though, here's a definition:
>
>    module Bar
>      class Foo
>        def baz
>        end
>      end
>    end
>
> If you move point inside the keyword "module" or "class", C-M-f wouldn't 
> move forward either as of the latest master. No such problem with "def".
>
> Adding the check for "named" fixes the first two cases, but then C-M-f 
> inside "def" jumps to after "baaz". Could be worked around with a 
> special case, but I wonder what this difference comes from (haven't 
> properly debugged yet).

I see no problems with your patch.  Everything works nicely.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Wed, 25 Dec 2024 09:13:02 GMT) Full text and rfc822 format available.

Message #32 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Juri Linkov <juri <at> linkov.net>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Wed, 25 Dec 2024 01:11:32 -0800

> On Dec 24, 2024, at 11:49 PM, Juri Linkov <juri <at> linkov.net> wrote:
> 
>>> This mismatched "string" in TypeScript is even more
>>> unexpected than "unless" in Ruby.
>>> 
>>> So probably we need a way to mark all used nodes as named
>>> to avoid such unexpected matches.  Maybe matching anonymous nodes
>>> should be opt-in, and by default match only named nodes.
>> 
>> IMHO this is just an unfortunate bug that needs to be fixed. I agree that
>> this type of bug are hard to avoid, which is a bad thing, but that doesn’t
>> mean we should try to alleviate it at any cost. Making predicates named by
>> default just adds complexity and inflexibility for not much benefit.
> 
> Not sure if a possible flexibility is better than unintended matches.
> 
> When the authors of a ts-mode carefully selected a list of named nodes to match,
> why treesit should try to match some random and unintended anonymous nodes?

I don’t know and can’t prove how much the flexibility is worth, but the cost on complexity is real. If everywhere else uses thing predicates as-is, but sexp navigation auto-converts thing predicates into named predicate, that’s a cognitive burden and a special case that’s guaranteed to trip people over.

OTOH, what’s the downside of wrapping the sexp predicate with (and named …), if you only want named nodes to match?

I just think the cost outweighs the benefit, if there is any to begin with.

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Wed, 25 Dec 2024 17:51:02 GMT) Full text and rfc822 format available.

Message #35 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Wed, 25 Dec 2024 19:39:28 +0200

>> Not sure if a possible flexibility is better than unintended matches.
>> 
>> When the authors of a ts-mode carefully selected a list of named nodes to match,
>> why treesit should try to match some random and unintended anonymous nodes?
>
> I don’t know and can’t prove how much the flexibility is worth, but the
> cost on complexity is real. If everywhere else uses thing predicates as-is,
> but sexp navigation auto-converts thing predicates into named predicate,
> that’s a cognitive burden and a special case that’s guaranteed to trip
> people over.
>
> OTOH, what’s the downside of wrapping the sexp predicate with (and named …),
> if you only want named nodes to match?
>
> I just think the cost outweighs the benefit, if there is any to begin with.

Actually, what I had in mind is not to enable named-only mode by default,
but only to allow the authors of ts-modes to specify this condition.
For example, if it will be possible to write

  (setq-local treesit-thing-settings
              `((typescript
                 (sexp (and named ,(regexp-opt typescript-ts-mode--sexp-nodes 'symbols))))))

this should be fine.  This is similar to how the authors of ts-modes
decide whether to restrict matches to exact names by using
"^...$" with regexp-opt.

BTW, I'm thinking about adding such simple helper:

  (defun treesit-regexp-opt (strings)
    (concat "^" (regexp-opt strings) "$"))

to use like this:

  (setq-local treesit-thing-settings
                `((typescript
                   (sexp (and named ,(treesit-regexp-opt typescript-ts-mode--sexp-nodes))))))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Thu, 26 Dec 2024 01:01:02 GMT) Full text and rfc822 format available.

Message #38 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Gutov <dmitry <at> gutov.dev>
To: Juri Linkov <juri <at> linkov.net>
Cc: Yuan Fu <casouri <at> gmail.com>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Thu, 26 Dec 2024 03:00:38 +0200

On 25/12/2024 09:52, Juri Linkov wrote:

>> Given that we're already calling a predicate every time (in
>> ruby-ts-mode), we might as well add one more check. See the patch at the
>> end.
> 
> Thanks, I tried the patch.  It was broken, so needed to edit manually.

Maybe something regarding whitespace at the end?

> Also the new key 'w' doesn't work in diff buffers, need to fix it as well.

The binding for 'diff-kill-ring-save'? Seems to work here, as long as 
the diff buffer is in read-only mode.

>> Adding the check for "named" fixes the first two cases, but then C-M-f
>> inside "def" jumps to after "baaz". Could be worked around with a
>> special case, but I wonder what this difference comes from (haven't
>> properly debugged yet).
> 
> I see no problems with your patch.  Everything works nicely.

Hmm, I can't reproduce it either anymore.

Thanks for testing, pushed to master now (unfortunately the commit 
message refers to bug#73404).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Fri, 27 Dec 2024 07:47:02 GMT) Full text and rfc822 format available.

Message #41 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Dmitry Gutov <dmitry <at> gutov.dev>
Cc: Yuan Fu <casouri <at> gmail.com>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Fri, 27 Dec 2024 09:42:47 +0200

>>> Given that we're already calling a predicate every time (in
>>> ruby-ts-mode), we might as well add one more check. See the patch at the
>>> end.
>> Thanks, I tried the patch.  It was broken, so needed to edit manually.
>
> Maybe something regarding whitespace at the end?

Something with whitespace, but not a big problem.

>> Also the new key 'w' doesn't work in diff buffers, need to fix it as well.
>
> The binding for 'diff-kill-ring-save'? Seems to work here, as long as the
> diff buffer is in read-only mode.

Yes, 'W' with 'diff-kill-ring-save'.  Single keys are still a problem
in visited diff files.

>>> Adding the check for "named" fixes the first two cases, but then C-M-f
>>> inside "def" jumps to after "baaz". Could be worked around with a
>>> special case, but I wonder what this difference comes from (haven't
>>> properly debugged yet).
>> I see no problems with your patch.  Everything works nicely.
>
> Hmm, I can't reproduce it either anymore.
>
> Thanks for testing, pushed to master now (unfortunately the commit message
> refers to bug#73404).

Thanks.  Maybe a helper for other ts-modes will be handy:

  (defun treesit-node-named (node)
    (treesit-node-check node 'named))

to be used like this

  (sexp ,(cons
          (treesit-match-nodes strings)
          'treesit-node-named))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Mon, 13 Jan 2025 07:40:02 GMT) Full text and rfc822 format available.

Message #44 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Mon, 13 Jan 2025 09:31:01 +0200

> I see that all ts-modes solve this common problem each in its own way
> (here 'list' indicates a list of strings that should match node names):
>
>   c-ts-mode:    (regexp-opt list 'symbols)
>   js-ts-mode:   (concat "\\_<" (regexp-opt list t) "\\_>")
>   java-ts-mode: (rx (or list))
>   ruby-ts-mode: (rx bol (or list) eol)
>
> Currently there is no uniform way to handle this frequent need.
> 'concat' like above looks too ugly, but 'regexp-opt' with the
> 'symbols' arg produces a strange regexp for matching symbols.

I was thinking about adding two functions treesit-regexp-strict
and treesit-regexp-lax.  But then discovered that some things
require specifying both strict and lax matches for the same thing.
For example, take treesit-thing-settings from c-ts-mode:

    (sentence
     ,(regexp-opt '("preproc"
                    "declaration"
                    "specifier"
                    "attributed_statement"
                    "labeled_statement"
                    "expression_statement"
                    "if_statement"
                    "switch_statement"
                    "do_statement"
                    "while_statement"
                    "for_statement"
                    "return_statement"
                    "break_statement"
                    "continue_statement"
                    "goto_statement"
                    "case_statement")))

"preproc" can be lax, this is fine to match all preprocessor directives.
But "declaration" should be strict and should not match "parameter_declaration".
Also "specifier" should not match "attribute_specifier" and "storage_class_specifier",
but only "enum_specifier" and "union_specifier" that end with the semicolon.
Also no need to specify all statements separately, it should be sufficient
to use lax match with "statement".

The most expressive language to specify all these requirements is the rx macro,
so let's use it in ts-modes.  Here is how the 'sentence' thing will look like:

    (sentence
     ,(rx (or (and bos (or "declaration"
                           "enum_specifier"
                           "union_specifier")
                   eos)
              (or "preproc"
                  "statement"))))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Mon, 13 Jan 2025 07:48:02 GMT) Full text and rfc822 format available.

Message #47 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Juri Linkov <juri <at> linkov.net>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Sun, 12 Jan 2025 23:47:33 -0800


> On Jan 12, 2025, at 11:31 PM, Juri Linkov <juri <at> linkov.net> wrote:
> 
>> I see that all ts-modes solve this common problem each in its own way
>> (here 'list' indicates a list of strings that should match node names):
>> 
>>  c-ts-mode:    (regexp-opt list 'symbols)
>>  js-ts-mode:   (concat "\\_<" (regexp-opt list t) "\\_>")
>>  java-ts-mode: (rx (or list))
>>  ruby-ts-mode: (rx bol (or list) eol)
>> 
>> Currently there is no uniform way to handle this frequent need.
>> 'concat' like above looks too ugly, but 'regexp-opt' with the
>> 'symbols' arg produces a strange regexp for matching symbols.
> 
> I was thinking about adding two functions treesit-regexp-strict
> and treesit-regexp-lax.  But then discovered that some things
> require specifying both strict and lax matches for the same thing.
> For example, take treesit-thing-settings from c-ts-mode:
> 
>    (sentence
>     ,(regexp-opt '("preproc"
>                    "declaration"
>                    "specifier"
>                    "attributed_statement"
>                    "labeled_statement"
>                    "expression_statement"
>                    "if_statement"
>                    "switch_statement"
>                    "do_statement"
>                    "while_statement"
>                    "for_statement"
>                    "return_statement"
>                    "break_statement"
>                    "continue_statement"
>                    "goto_statement"
>                    "case_statement")))
> 
> "preproc" can be lax, this is fine to match all preprocessor directives.
> But "declaration" should be strict and should not match "parameter_declaration".
> Also "specifier" should not match "attribute_specifier" and "storage_class_specifier",
> but only "enum_specifier" and "union_specifier" that end with the semicolon.
> Also no need to specify all statements separately, it should be sufficient
> to use lax match with "statement".
> 
> The most expressive language to specify all these requirements is the rx macro,
> so let's use it in ts-modes.  Here is how the 'sentence' thing will look like:
> 
>    (sentence
>     ,(rx (or (and bos (or "declaration"
>                           "enum_specifier"
>                           "union_specifier")
>                   eos)
>              (or "preproc"
>                  "statement"))))

Looks good. I’ve always used rx, it has the additional benefit of being macro expanded at compile time.

Also, I finally added support for ‘and’, ‘named’ and ‘anonymous’. I haven’t test it yet (sorry).

Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Tue, 14 Jan 2025 07:56:02 GMT) Full text and rfc822 format available.

Message #50 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Tue, 14 Jan 2025 09:52:13 +0200

>>    (sentence
>>     ,(rx (or (and bos (or "declaration"
>>                           "enum_specifier"
>>                           "union_specifier")
>>                   eos)
>>              (or "preproc"
>>                  "statement"))))
>
> Looks good. I’ve always used rx, it has the additional benefit of being macro expanded at compile time.
>
> Also, I finally added support for ‘and’, ‘named’ and ‘anonymous’. I haven’t test it yet (sorry).

Thanks!  Does it make sense also to add predicates to define
whether the node names should be matched completely?
Then maybe add two separate predicates for strict and lax matching,
e.g. for the same thing as above:

  (sentence
   (or (strict ,(rx (or "declaration"
                        "enum_specifier"
                        "union_specifier")))
       (lax ,(rx (or "preproc"
                     "statement")))))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Sat, 18 Jan 2025 08:01:02 GMT) Full text and rfc822 format available.

Message #53 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Juri Linkov <juri <at> linkov.net>
Cc: Dmitry Gutov <dmitry <at> gutov.dev>, 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Fri, 17 Jan 2025 23:59:52 -0800


> On Jan 13, 2025, at 11:52 PM, Juri Linkov <juri <at> linkov.net> wrote:
> 
>>>   (sentence
>>>    ,(rx (or (and bos (or "declaration"
>>>                          "enum_specifier"
>>>                          "union_specifier")
>>>                  eos)
>>>             (or "preproc"
>>>                 "statement"))))
>> 
>> Looks good. I’ve always used rx, it has the additional benefit of being macro expanded at compile time.
>> 
>> Also, I finally added support for ‘and’, ‘named’ and ‘anonymous’. I haven’t test it yet (sorry).
> 
> Thanks!  Does it make sense also to add predicates to define
> whether the node names should be matched completely?
> Then maybe add two separate predicates for strict and lax matching,
> e.g. for the same thing as above:
> 
>  (sentence
>   (or (strict ,(rx (or "declaration"
>                        "enum_specifier"
>                        "union_specifier")))
>       (lax ,(rx (or "preproc"
>                     "statement")))))

Hmm, I don’t know. Seems messy to implement, and we already have perfectly good solution: rx with bos or eos.
Yuan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#74963; Package emacs. (Thu, 30 Jan 2025 07:23:01 GMT) Full text and rfc822 format available.

Message #56 received at 74963 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Yuan Fu <casouri <at> gmail.com>
Cc: 74963 <at> debbugs.gnu.org
Subject: Re: bug#74963: Ambiguous treesit named and anonymous nodes in
 ruby-ts-mode
Date: Thu, 30 Jan 2025 09:15:58 +0200

>>> Also, I finally added support for ‘and’, ‘named’ and ‘anonymous’. I haven’t test it yet (sorry).
>> 
>> Thanks!  Does it make sense also to add predicates to define
>> whether the node names should be matched completely?
>> Then maybe add two separate predicates for strict and lax matching,
>> e.g. for the same thing as above:
>> 
>>  (sentence
>>   (or (strict ,(rx (or "declaration"
>>                        "enum_specifier"
>>                        "union_specifier")))
>>       (lax ,(rx (or "preproc"
>>                     "statement")))))
>
> Hmm, I don’t know. Seems messy to implement, and we already have
> perfectly good solution: rx with bos or eos.

Ok, will use rx with bos and eos.

I have another question: in c-ts-mode forward-sentence
was intended to stop after a semicolon.  So I tried
to modify the sentence thing to match semicolons
inside the for_statement, e.g.:

  for (i = 0; i < 2; ++i)
=>
  (for_statement for (
   condition: (assignment_expression left: (identifier) operator: = right: (number_literal))
   ;
   body: (binary_expression left: (identifier) operator: < right: (number_literal))
   ;
   (update_expression operator: ++ argument: (identifier))
   )

where semicolons are after field names "condition" and "body".

But can't find a way to specify such field names for the node "for_statement".

Shouldn't treesit-thing-settings allow specifying field names as well?
Or this is achievable only by writing a lambda?

This bug report was last modified 191 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #74963 Ambiguous treesit named and anonymous nodes in ruby-ts-mode

GNU bug report logs - #74963
Ambiguous treesit named and anonymous nodes in ruby-ts-mode