GNU bug report logs - #64017
Wrong conversion from Emacs to Tree-sitter S-expression syntax

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Mon, 12 Jun 2023 14:15:01 UTC

Severity: normal

Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 64017 in the body.
You can then email your comments to 64017 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Mon, 12 Jun 2023 14:15:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 12 Jun 2023 14:15:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Emacs Bug Report <bug-gnu-emacs <at> gnu.org>
Cc: Basil Contovounesios <contovob <at> tcd.ie>, Yuan Fu <casouri <at> gmail.com>
Subject: Wrong conversion from Emacs to Tree-sitter S-expression syntax
Date: Mon, 12 Jun 2023 16:14:01 +0200
`treesit-pattern-expand` converts a query pattern into tree-sitter S-expression syntax, as a string. The conversion mainly converts certain keywords but the main problem is that it prints strings in Emacs syntax which differs from that of tree-sitter.

As a consequence, :match regexps cannot contain newlines:

(treesit-query-capture
 'java
 '(((identifier) @font-lock-constant-face
    (:match "hello\n" @font-lock-constant-face))))

signals a syntax error.

As far as I can tell the tree-sitter string syntax allows for the escape sequences:

\n = LF
\r = CR
\t = TAB
\0 = NUL  (only a single 0 -- no octal escapes!)
\X = the character X itself

Unescape newlines result in a syntax error as seen in the example above. NULs don't seem to go well either.

At the very least, the conversion should avoid literal newlines and NULs in the result (and probably CR and TAB). This cannot be done with a straight prin1-to-string.

(By the way, why is the conversion written in C? Was Lisp too slow?)

Ideally we should not need to expose the tree-sitter s-exp query syntax at all. Surely Emacs s-exps should be preferable in every case?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Thu, 15 Jun 2023 10:46:01 GMT) Full text and rfc822 format available.

Message #8 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: 64017 <at> debbugs.gnu.org
Cc: Basil Contovounesios <contovob <at> tcd.ie>, Yuan Fu <casouri <at> gmail.com>
Subject: bug#64017: Wrong conversion from Emacs to Tree-sitter S-expression
 syntax
Date: Thu, 15 Jun 2023 12:45:23 +0200
I also propose that we change the documentation to describe the (Elisp) sexp-based query syntax only, or at least first and foremost, since that is what all existing code uses and is more convenient. Currently the manual starts by describing the string syntax and only then the Elisp sexp syntax.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Thu, 15 Jun 2023 22:09:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>,
 Bug Report Emacs <bug-gnu-emacs <at> gnu.org>
Subject: Re: Wrong conversion from Emacs to Tree-sitter S-expression syntax
Date: Thu, 15 Jun 2023 15:08:26 -0700
Thanks for catching this.

> On Jun 12, 2023, at 7:14 AM, Mattias Engdegård <mattias.engdegard <at> gmail.com> wrote:
> 
> `treesit-pattern-expand` converts a query pattern into tree-sitter S-expression syntax, as a string. The conversion mainly converts certain keywords but the main problem is that it prints strings in Emacs syntax which differs from that of tree-sitter.
> 
> As a consequence, :match regexps cannot contain newlines:
> 
> (treesit-query-capture
> 'java
> '(((identifier) @font-lock-constant-face
>    (:match "hello\n" @font-lock-constant-face))))
> 
> signals a syntax error.
> 
> As far as I can tell the tree-sitter string syntax allows for the escape sequences:
> 
> \n = LF
> \r = CR
> \t = TAB
> \0 = NUL  (only a single 0 -- no octal escapes!)
> \X = the character X itself
> 
> Unescape newlines result in a syntax error as seen in the example above. NULs don't seem to go well either.
> 
> At the very least, the conversion should avoid literal newlines and NULs in the result (and probably CR and TAB). This cannot be done with a straight prin1-to-string.
> 
> (By the way, why is the conversion written in C? Was Lisp too slow?)

Because I wasn't sure if it’s ok for C functions to rely on Lisp functions, plus the function is simple enough. Right now if one doesn’t load treesit.el, all the C functions work fine.

> 
> Ideally we should not need to expose the tree-sitter s-exp query syntax at all. Surely Emacs s-exps should be preferable in every case?
> 

It shouldn’t hurt to expose the tree-sitter sexp. Other editors mainly use the string syntax.

Yuan



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Thu, 15 Jun 2023 22:14:01 GMT) Full text and rfc822 format available.

Message #14 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>, 64017 <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Thu, 15 Jun 2023 15:13:12 -0700

> On Jun 15, 2023, at 3:45 AM, Mattias Engdegård <mattias.engdegard <at> gmail.com> wrote:
> 
> I also propose that we change the documentation to describe the (Elisp) sexp-based query syntax only, or at least first and foremost, since that is what all existing code uses and is more convenient. Currently the manual starts by describing the string syntax and only then the Elisp sexp syntax.
> 

The difference between tree-sitter syntax and Elisp sexp syntax is petty small (anchor, predicates), so the text describing the tree-sitter syntax is basically describing Elisp sexp syntax. With that said if someone makes it describe Elisp sexp syntax first, I wouldn’t mind.

Yuan



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Fri, 16 Jun 2023 11:26:02 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>,
 Bug Report Emacs <bug-gnu-emacs <at> gnu.org>
Subject: Re: Wrong conversion from Emacs to Tree-sitter S-expression syntax
Date: Fri, 16 Jun 2023 13:25:48 +0200
16 juni 2023 kl. 00.08 skrev Yuan Fu <casouri <at> gmail.com>:

>> (By the way, why is the conversion written in C? Was Lisp too slow?)
> 
> Because I wasn't sure if it’s ok for C functions to rely on Lisp functions, plus the function is simple enough. Right now if one doesn’t load treesit.el, all the C functions work fine.

All right, let's keep it there for now.
I fixed the string conversion bug in 8657afac77.

>> Ideally we should not need to expose the tree-sitter s-exp query syntax at all. Surely Emacs s-exps should be preferable in every case?

> It shouldn’t hurt to expose the tree-sitter sexp. Other editors mainly use the string syntax.

Most of them probably aren't written in Lisp. But fine, let's keep it as an alternative syntax.

> The difference between tree-sitter syntax and Elisp sexp syntax is petty small (anchor, predicates), so the text describing the tree-sitter syntax is basically describing Elisp sexp syntax.

Yes, so it seemed to me but reading the source code (lib/src/query.c) seems to indicate that what I thought were symbols -- *, +, ?, @thing, #thing -- appear to be special postfix and prefix operators. (Ironically, there doesn't seem to be a grammar for this language anywhere, or am I mistaken?)

Thus a structurally correct Lispish translation of

  (teet "toot"* (#equal "fie" @fum))

should arguable be something like

  (teet (* "toot") ((# equal) "fie" (@ fum)))

rather than the current

  (teet "toot" :* (:equal "fie @fum))

but I'm not demanding that it all be changed at this stage.

> With that said if someone makes it describe Elisp sexp syntax first, I wouldn’t mind.

I'll have a look. Wouldn't it be reasonable to use the Elisp syntax, briefly state how it corresponds to the 'native' syntax, and refer to the official tree-sitter documentation for details about the latter?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Fri, 16 Jun 2023 17:04:01 GMT) Full text and rfc822 format available.

Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>,
 Bug Report Emacs <bug-gnu-emacs <at> gnu.org>
Subject: Re: Wrong conversion from Emacs to Tree-sitter S-expression syntax
Date: Fri, 16 Jun 2023 19:02:58 +0200
[Message part 1 (text/plain, inline)]
Here is a modification of the treesit manual to teach s-expressions first.
It's mostly a matter of straightforward substitution.

[treesit-doc-sexp-patterns.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Fri, 16 Jun 2023 17:34:02 GMT) Full text and rfc822 format available.

Message #23 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Basil Contovounesios <contovob <at> tcd.ie>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Yuan Fu <casouri <at> gmail.com>, 64017 <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Fri, 16 Jun 2023 18:33:28 +0100
Mattias Engdegård [2023-06-16 19:02 +0200] wrote:

> Here is a modification of the treesit manual to teach s-expressions first.
> It's mostly a matter of straightforward substitution.

Generally LGTM, thanks.

> diff --git a/doc/lispref/parsing.texi b/doc/lispref/parsing.texi
> index b0824faaaa2..bd81ee3c535 100644
> --- a/doc/lispref/parsing.texi
> +++ b/doc/lispref/parsing.texi
> @@ -1132,9 +1132,9 @@ Pattern Matching
>  
>  @defun treesit-query-capture node query &optional beg end node-only
>  This function matches patterns in @var{query} within @var{node}.
> -The argument @var{query} can be either a string, a s-expression, or a
> -compiled query object.  For now, we focus on the string syntax;
> -s-expression syntax and compiled query are described at the end of the
> +The argument @var{query} can be either a s-expression, a string, or a
> +compiled query object.  For now, we focus on the s-expression syntax;
> +string syntax and compiled query are described at the end of the
>  section.

I recently tweaked some of these docs in emacs-29, so you may want to
merge into master before respinning your patch.

> @@ -1341,22 +1341,23 @@ Pattern Matching
>  @noindent
>  tree-sitter only matches arrays where the first element equals to the
>  last element.  To attach a predicate to a pattern, we need to group
> -them together.  A predicate always starts with a @samp{#}.  Currently
> -there are three predicates, @code{#equal}, @code{#match}, and
> -@code{#pred}.
> +them together.  Currently
> +there are three predicates, @code{:equal}, @code{:match}, and
> +@code{:pred}.

Do you intend to refill the paragraph before merging?

>  @itemize
>  @item
> -Anchor @samp{.} is written as @code{:anchor}.
> +Anchor @code{:anchor}. is written as @samp{.}
                        ^
Unladen European full stop migrated from eol.

-- 
Basil




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Sat, 17 Jun 2023 10:49:01 GMT) Full text and rfc822 format available.

Message #26 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Basil Contovounesios <contovob <at> tcd.ie>
Cc: Yuan Fu <casouri <at> gmail.com>, Eli Zaretskii <eliz <at> gnu.org>,
 64017 <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Sat, 17 Jun 2023 12:47:51 +0200
16 juni 2023 kl. 19.33 skrev Basil Contovounesios <contovob <at> tcd.ie>:

> I recently tweaked some of these docs in emacs-29, so you may want to
> merge into master before respinning your patch.

Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
Eli, would that be acceptable?

> Do you intend to refill the paragraph before merging?

I probably should (although it doesn't affect the output).

>> -Anchor @samp{.} is written as @code{:anchor}.
>> +Anchor @code{:anchor}. is written as @samp{.}
>                        ^
> Unladen European full stop migrated from eol.

So it tried to get away, that little rascal! Can't blame it for trying.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Sat, 17 Jun 2023 12:58:01 GMT) Full text and rfc822 format available.

Message #29 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: contovob <at> tcd.ie, casouri <at> gmail.com, 64017 <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Sat, 17 Jun 2023 15:57:06 +0300
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sat, 17 Jun 2023 12:47:51 +0200
> Cc: Yuan Fu <casouri <at> gmail.com>,
>  64017 <at> debbugs.gnu.org,
>  Eli Zaretskii <eliz <at> gnu.org>
> 
> 16 juni 2023 kl. 19.33 skrev Basil Contovounesios <contovob <at> tcd.ie>:
> 
> > I recently tweaked some of these docs in emacs-29, so you may want to
> > merge into master before respinning your patch.
> 
> Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
> Eli, would that be acceptable?

If Yuan doesn't mind, yes.  But I'd like to hear from Yuan that he is
okay with these changes.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Sat, 17 Jun 2023 13:31:01 GMT) Full text and rfc822 format available.

Message #32 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: contovob <at> tcd.ie, casouri <at> gmail.com, 64017 <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Sat, 17 Jun 2023 15:30:04 +0200
[Message part 1 (text/plain, inline)]
17 juni 2023 kl. 14.57 skrev Eli Zaretskii <eliz <at> gnu.org>:

>> Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
>> Eli, would that be acceptable?
> 
> If Yuan doesn't mind, yes.  But I'd like to hear from Yuan that he is
> okay with these changes.

Attached are the changes rebased to emacs-29 (fixing mistakes found by Basil).

[treesit-doc-sexp-patterns-em29.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Sat, 17 Jun 2023 22:56:02 GMT) Full text and rfc822 format available.

Message #35 received at 64017 <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>, Eli Zaretskii <eliz <at> gnu.org>,
 64017 <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Sat, 17 Jun 2023 15:55:25 -0700

> On Jun 17, 2023, at 6:30 AM, Mattias Engdegård <mattias.engdegard <at> gmail.com> wrote:
> 
> 17 juni 2023 kl. 14.57 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
>>> Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
>>> Eli, would that be acceptable?
>> 
>> If Yuan doesn't mind, yes.  But I'd like to hear from Yuan that he is
>> okay with these changes.
> 
> Attached are the changes rebased to emacs-29 (fixing mistakes found by Basil).
> 
> <treesit-doc-sexp-patterns-em29.diff>

LGTM!

Yuan



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#64017; Package emacs. (Sat, 17 Jun 2023 23:04:01 GMT) Full text and rfc822 format available.

Message #38 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Yuan Fu <casouri <at> gmail.com>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>,
 Bug Report Emacs <bug-gnu-emacs <at> gnu.org>
Subject: Re: Wrong conversion from Emacs to Tree-sitter S-expression syntax
Date: Sat, 17 Jun 2023 16:02:58 -0700
> 
> Yes, so it seemed to me but reading the source code (lib/src/query.c) seems to indicate that what I thought were symbols -- *, +, ?, @thing, #thing -- appear to be special postfix and prefix operators. (Ironically, there doesn't seem to be a grammar for this language anywhere, or am I mistaken?)
> 
> Thus a structurally correct Lispish translation of
> 
>  (teet "toot"* (#equal "fie" @fum))
> 
> should arguable be something like
> 
>  (teet (* "toot") ((# equal) "fie" (@ fum)))
> 
> rather than the current
> 
>  (teet "toot" :* (:equal "fie @fum))
> 
> but I'm not demanding that it all be changed at this stage.

IMHO the query syntax is already pretty far away from a “proper sexp” that we expect, so changing these little things don’t have much benefit. For example, the field names and trailing capture names are not conventional, are we going to change them to be more sexpy too? 

In a proper sexp they would have been wrapped too, like

(field-name: node) rather than field-name: node
(@fn node) rather than node @fn

Not to mention using colon and @ to distinguish field-names and capture names from nodes—not very conventional either.

Also a more conventional sexp syntax would be much more verbose than the current one, and arguable harder to translate to the tree-sitter string syntax, which is ultimately what we feed to tree-sitter functions.

Yuan



Reply sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
You have taken responsibility. (Sun, 18 Jun 2023 08:48:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
bug acknowledged by developer. (Sun, 18 Jun 2023 08:48:02 GMT) Full text and rfc822 format available.

Message #43 received at 64017-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Yuan Fu <casouri <at> gmail.com>
Cc: Basil Contovounesios <contovob <at> tcd.ie>, Eli Zaretskii <eliz <at> gnu.org>,
 64017-done <at> debbugs.gnu.org
Subject: Re: bug#64017: Wrong conversion from Emacs to Tree-sitter
 S-expression syntax
Date: Sun, 18 Jun 2023 10:47:01 +0200
18 juni 2023 kl. 00.55 skrev Yuan Fu <casouri <at> gmail.com>:

> LGTM!

Thank you, these changes are now in emacs-29.

And we are done, closing the bug.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 16 Jul 2023 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 32 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.