GNU bug report logs -
#64017
Wrong conversion from Emacs to Tree-sitter S-expression syntax
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 64017 in the body.
You can then email your comments to 64017 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Mon, 12 Jun 2023 14:15:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Mon, 12 Jun 2023 14:15:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
`treesit-pattern-expand` converts a query pattern into tree-sitter S-expression syntax, as a string. The conversion mainly converts certain keywords but the main problem is that it prints strings in Emacs syntax which differs from that of tree-sitter.
As a consequence, :match regexps cannot contain newlines:
(treesit-query-capture
'java
'(((identifier) @font-lock-constant-face
(:match "hello\n" @font-lock-constant-face))))
signals a syntax error.
As far as I can tell the tree-sitter string syntax allows for the escape sequences:
\n = LF
\r = CR
\t = TAB
\0 = NUL (only a single 0 -- no octal escapes!)
\X = the character X itself
Unescape newlines result in a syntax error as seen in the example above. NULs don't seem to go well either.
At the very least, the conversion should avoid literal newlines and NULs in the result (and probably CR and TAB). This cannot be done with a straight prin1-to-string.
(By the way, why is the conversion written in C? Was Lisp too slow?)
Ideally we should not need to expose the tree-sitter s-exp query syntax at all. Surely Emacs s-exps should be preferable in every case?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Thu, 15 Jun 2023 10:46:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 64017 <at> debbugs.gnu.org (full text, mbox):
I also propose that we change the documentation to describe the (Elisp) sexp-based query syntax only, or at least first and foremost, since that is what all existing code uses and is more convenient. Currently the manual starts by describing the string syntax and only then the Elisp sexp syntax.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Thu, 15 Jun 2023 22:09:01 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
Thanks for catching this.
> On Jun 12, 2023, at 7:14 AM, Mattias Engdegård <mattias.engdegard <at> gmail.com> wrote:
>
> `treesit-pattern-expand` converts a query pattern into tree-sitter S-expression syntax, as a string. The conversion mainly converts certain keywords but the main problem is that it prints strings in Emacs syntax which differs from that of tree-sitter.
>
> As a consequence, :match regexps cannot contain newlines:
>
> (treesit-query-capture
> 'java
> '(((identifier) @font-lock-constant-face
> (:match "hello\n" @font-lock-constant-face))))
>
> signals a syntax error.
>
> As far as I can tell the tree-sitter string syntax allows for the escape sequences:
>
> \n = LF
> \r = CR
> \t = TAB
> \0 = NUL (only a single 0 -- no octal escapes!)
> \X = the character X itself
>
> Unescape newlines result in a syntax error as seen in the example above. NULs don't seem to go well either.
>
> At the very least, the conversion should avoid literal newlines and NULs in the result (and probably CR and TAB). This cannot be done with a straight prin1-to-string.
>
> (By the way, why is the conversion written in C? Was Lisp too slow?)
Because I wasn't sure if it’s ok for C functions to rely on Lisp functions, plus the function is simple enough. Right now if one doesn’t load treesit.el, all the C functions work fine.
>
> Ideally we should not need to expose the tree-sitter s-exp query syntax at all. Surely Emacs s-exps should be preferable in every case?
>
It shouldn’t hurt to expose the tree-sitter sexp. Other editors mainly use the string syntax.
Yuan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Thu, 15 Jun 2023 22:14:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 64017 <at> debbugs.gnu.org (full text, mbox):
> On Jun 15, 2023, at 3:45 AM, Mattias Engdegård <mattias.engdegard <at> gmail.com> wrote:
>
> I also propose that we change the documentation to describe the (Elisp) sexp-based query syntax only, or at least first and foremost, since that is what all existing code uses and is more convenient. Currently the manual starts by describing the string syntax and only then the Elisp sexp syntax.
>
The difference between tree-sitter syntax and Elisp sexp syntax is petty small (anchor, predicates), so the text describing the tree-sitter syntax is basically describing Elisp sexp syntax. With that said if someone makes it describe Elisp sexp syntax first, I wouldn’t mind.
Yuan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Fri, 16 Jun 2023 11:26:02 GMT)
Full text and
rfc822 format available.
Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):
16 juni 2023 kl. 00.08 skrev Yuan Fu <casouri <at> gmail.com>:
>> (By the way, why is the conversion written in C? Was Lisp too slow?)
>
> Because I wasn't sure if it’s ok for C functions to rely on Lisp functions, plus the function is simple enough. Right now if one doesn’t load treesit.el, all the C functions work fine.
All right, let's keep it there for now.
I fixed the string conversion bug in 8657afac77.
>> Ideally we should not need to expose the tree-sitter s-exp query syntax at all. Surely Emacs s-exps should be preferable in every case?
> It shouldn’t hurt to expose the tree-sitter sexp. Other editors mainly use the string syntax.
Most of them probably aren't written in Lisp. But fine, let's keep it as an alternative syntax.
> The difference between tree-sitter syntax and Elisp sexp syntax is petty small (anchor, predicates), so the text describing the tree-sitter syntax is basically describing Elisp sexp syntax.
Yes, so it seemed to me but reading the source code (lib/src/query.c) seems to indicate that what I thought were symbols -- *, +, ?, @thing, #thing -- appear to be special postfix and prefix operators. (Ironically, there doesn't seem to be a grammar for this language anywhere, or am I mistaken?)
Thus a structurally correct Lispish translation of
(teet "toot"* (#equal "fie" @fum))
should arguable be something like
(teet (* "toot") ((# equal) "fie" (@ fum)))
rather than the current
(teet "toot" :* (:equal "fie @fum))
but I'm not demanding that it all be changed at this stage.
> With that said if someone makes it describe Elisp sexp syntax first, I wouldn’t mind.
I'll have a look. Wouldn't it be reasonable to use the Elisp syntax, briefly state how it corresponds to the 'native' syntax, and refer to the official tree-sitter documentation for details about the latter?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Fri, 16 Jun 2023 17:04:01 GMT)
Full text and
rfc822 format available.
Message #20 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Here is a modification of the treesit manual to teach s-expressions first.
It's mostly a matter of straightforward substitution.
[treesit-doc-sexp-patterns.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Fri, 16 Jun 2023 17:34:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 64017 <at> debbugs.gnu.org (full text, mbox):
Mattias Engdegård [2023-06-16 19:02 +0200] wrote:
> Here is a modification of the treesit manual to teach s-expressions first.
> It's mostly a matter of straightforward substitution.
Generally LGTM, thanks.
> diff --git a/doc/lispref/parsing.texi b/doc/lispref/parsing.texi
> index b0824faaaa2..bd81ee3c535 100644
> --- a/doc/lispref/parsing.texi
> +++ b/doc/lispref/parsing.texi
> @@ -1132,9 +1132,9 @@ Pattern Matching
>
> @defun treesit-query-capture node query &optional beg end node-only
> This function matches patterns in @var{query} within @var{node}.
> -The argument @var{query} can be either a string, a s-expression, or a
> -compiled query object. For now, we focus on the string syntax;
> -s-expression syntax and compiled query are described at the end of the
> +The argument @var{query} can be either a s-expression, a string, or a
> +compiled query object. For now, we focus on the s-expression syntax;
> +string syntax and compiled query are described at the end of the
> section.
I recently tweaked some of these docs in emacs-29, so you may want to
merge into master before respinning your patch.
> @@ -1341,22 +1341,23 @@ Pattern Matching
> @noindent
> tree-sitter only matches arrays where the first element equals to the
> last element. To attach a predicate to a pattern, we need to group
> -them together. A predicate always starts with a @samp{#}. Currently
> -there are three predicates, @code{#equal}, @code{#match}, and
> -@code{#pred}.
> +them together. Currently
> +there are three predicates, @code{:equal}, @code{:match}, and
> +@code{:pred}.
Do you intend to refill the paragraph before merging?
> @itemize
> @item
> -Anchor @samp{.} is written as @code{:anchor}.
> +Anchor @code{:anchor}. is written as @samp{.}
^
Unladen European full stop migrated from eol.
--
Basil
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Sat, 17 Jun 2023 10:49:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 64017 <at> debbugs.gnu.org (full text, mbox):
16 juni 2023 kl. 19.33 skrev Basil Contovounesios <contovob <at> tcd.ie>:
> I recently tweaked some of these docs in emacs-29, so you may want to
> merge into master before respinning your patch.
Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
Eli, would that be acceptable?
> Do you intend to refill the paragraph before merging?
I probably should (although it doesn't affect the output).
>> -Anchor @samp{.} is written as @code{:anchor}.
>> +Anchor @code{:anchor}. is written as @samp{.}
> ^
> Unladen European full stop migrated from eol.
So it tried to get away, that little rascal! Can't blame it for trying.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Sat, 17 Jun 2023 12:58:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 64017 <at> debbugs.gnu.org (full text, mbox):
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Sat, 17 Jun 2023 12:47:51 +0200
> Cc: Yuan Fu <casouri <at> gmail.com>,
> 64017 <at> debbugs.gnu.org,
> Eli Zaretskii <eliz <at> gnu.org>
>
> 16 juni 2023 kl. 19.33 skrev Basil Contovounesios <contovob <at> tcd.ie>:
>
> > I recently tweaked some of these docs in emacs-29, so you may want to
> > merge into master before respinning your patch.
>
> Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
> Eli, would that be acceptable?
If Yuan doesn't mind, yes. But I'd like to hear from Yuan that he is
okay with these changes.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Sat, 17 Jun 2023 13:31:01 GMT)
Full text and
rfc822 format available.
Message #32 received at 64017 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
17 juni 2023 kl. 14.57 skrev Eli Zaretskii <eliz <at> gnu.org>:
>> Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
>> Eli, would that be acceptable?
>
> If Yuan doesn't mind, yes. But I'd like to hear from Yuan that he is
> okay with these changes.
Attached are the changes rebased to emacs-29 (fixing mistakes found by Basil).
[treesit-doc-sexp-patterns-em29.diff (application/octet-stream, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Sat, 17 Jun 2023 22:56:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 64017 <at> debbugs.gnu.org (full text, mbox):
> On Jun 17, 2023, at 6:30 AM, Mattias Engdegård <mattias.engdegard <at> gmail.com> wrote:
>
> 17 juni 2023 kl. 14.57 skrev Eli Zaretskii <eliz <at> gnu.org>:
>
>>> Will do, thank you. Since this is only about documentation, perhaps it could be done in emacs-29?
>>> Eli, would that be acceptable?
>>
>> If Yuan doesn't mind, yes. But I'd like to hear from Yuan that he is
>> okay with these changes.
>
> Attached are the changes rebased to emacs-29 (fixing mistakes found by Basil).
>
> <treesit-doc-sexp-patterns-em29.diff>
LGTM!
Yuan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#64017
; Package
emacs
.
(Sat, 17 Jun 2023 23:04:01 GMT)
Full text and
rfc822 format available.
Message #38 received at submit <at> debbugs.gnu.org (full text, mbox):
>
> Yes, so it seemed to me but reading the source code (lib/src/query.c) seems to indicate that what I thought were symbols -- *, +, ?, @thing, #thing -- appear to be special postfix and prefix operators. (Ironically, there doesn't seem to be a grammar for this language anywhere, or am I mistaken?)
>
> Thus a structurally correct Lispish translation of
>
> (teet "toot"* (#equal "fie" @fum))
>
> should arguable be something like
>
> (teet (* "toot") ((# equal) "fie" (@ fum)))
>
> rather than the current
>
> (teet "toot" :* (:equal "fie @fum))
>
> but I'm not demanding that it all be changed at this stage.
IMHO the query syntax is already pretty far away from a “proper sexp” that we expect, so changing these little things don’t have much benefit. For example, the field names and trailing capture names are not conventional, are we going to change them to be more sexpy too?
In a proper sexp they would have been wrapped too, like
(field-name: node) rather than field-name: node
(@fn node) rather than node @fn
Not to mention using colon and @ to distinguish field-names and capture names from nodes—not very conventional either.
Also a more conventional sexp syntax would be much more verbose than the current one, and arguable harder to translate to the tree-sitter string syntax, which is ultimately what we feed to tree-sitter functions.
Yuan
Reply sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
You have taken responsibility.
(Sun, 18 Jun 2023 08:48:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
bug acknowledged by developer.
(Sun, 18 Jun 2023 08:48:02 GMT)
Full text and
rfc822 format available.
Message #43 received at 64017-done <at> debbugs.gnu.org (full text, mbox):
18 juni 2023 kl. 00.55 skrev Yuan Fu <casouri <at> gmail.com>:
> LGTM!
Thank you, these changes are now in emacs-29.
And we are done, closing the bug.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 16 Jul 2023 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 2 years and 32 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.