Package: emacs;
Reported by: Dominik Honnef <dominik <at> honnef.co>
Date: Sun, 22 Oct 2023 06:32:01 UTC
Severity: normal
Found in version 30.0.50
Done: Yuan Fu <casouri <at> gmail.com>
Bug is archived. No further changes may be made.
Message #25 received at 66674-done <at> debbugs.gnu.org (full text, mbox):
From: Yuan Fu <casouri <at> gmail.com> To: Dominik Honnef <dominik <at> honnef.co>, Eli Zaretskii <eliz <at> gnu.org> Cc: 66674-done <at> debbugs.gnu.org Subject: Re: bug#66674: 30.0.50; Upstream tree-sitter and treesit disagree about fields Date: Sun, 10 Dec 2023 17:02:48 -0800
On 12/10/23 6:28 AM, Dominik Honnef wrote: > Yuan Fu <casouri <at> gmail.com> writes: > >> On 11/25/23 2:03 AM, Eli Zaretskii wrote: >>> Ping! Ping! Yuan, please chime in. >>> >>>> Cc: 66674 <at> debbugs.gnu.org, dominik <at> honnef.co >>>> Date: Sun, 19 Nov 2023 12:08:08 +0200 >>>> From: Eli Zaretskii <eliz <at> gnu.org> >>>> >>>> Ping! Yuan, any comments? >>>> >>>>> Cc: 66674 <at> debbugs.gnu.org >>>>> Date: Wed, 25 Oct 2023 16:03:10 +0300 >>>>> From: Eli Zaretskii <eliz <at> gnu.org> >>>>> >>>>>> From: Dominik Honnef <dominik <at> honnef.co> >>>>>> Date: Sat, 21 Oct 2023 22:36:30 +0200 >>>>>> >>>>>> Using tree-sitter's CLI as well as the publicly hosted playground >>>>>> produce different parse trees than treesit in Emacs. Specifically, the >>>>>> assignment of nodes to named fields differs. >>>>>> >>>>>> Given the following C source: >>>>>> >>>>>> void main() { >>>>>> int x = // foo >>>>>> 1+ >>>>>> // comment >>>>>> 2; >>>>>> } >>>>>> >>>>>> treesit-explore-mode displays the following tree: >>>>>> >>>>>> (translation_unit >>>>>> (function_definition type: (primitive_type) >>>>>> declarator: >>>>>> (function_declarator declarator: (identifier) >>>>>> parameters: (parameter_list ( ))) >>>>>> body: >>>>>> (compound_statement { >>>>>> (declaration type: (primitive_type) >>>>>> declarator: >>>>>> (init_declarator declarator: (identifier) = value: (comment) >>>>>> (binary_expression left: (number_literal) operator: + right: (comment) (number_literal))) >>>>>> ;) >>>>>> }))) >>>>>> >>>>>> Note how in the init_declarator node, the 'value' field is a comment >>>>>> node, and similarly for the 'right' field in the binary_expression node. >>>>>> >>>>>> Running 'tree-sitter parse file.c', on the other hand, produces the >>>>>> following tree: >>>>>> >>>>>> (translation_unit [0, 0] - [6, 0] >>>>>> (function_definition [0, 0] - [5, 1] >>>>>> type: (primitive_type [0, 0] - [0, 4]) >>>>>> declarator: (function_declarator [0, 5] - [0, 11] >>>>>> declarator: (identifier [0, 5] - [0, 9]) >>>>>> parameters: (parameter_list [0, 9] - [0, 11])) >>>>>> body: (compound_statement [0, 12] - [5, 1] >>>>>> (declaration [1, 2] - [4, 6] >>>>>> type: (primitive_type [1, 2] - [1, 5]) >>>>>> declarator: (init_declarator [1, 6] - [4, 5] >>>>>> declarator: (identifier [1, 6] - [1, 7]) >>>>>> (comment [1, 10] - [1, 16]) >>>>>> value: (binary_expression [2, 4] - [4, 5] >>>>>> left: (number_literal [2, 4] - [2, 5]) >>>>>> (comment [3, 4] - [3, 14]) >>>>>> right: (number_literal [4, 4] - [4, 5]))))))) >>>>>> >>>>>> Here, the two comment nodes appear as unnamed nodes. IMHO the second >>>>>> tree is a more useful one, as the named fields contain the semantically >>>>>> important subtrees (e.g. a binary expression is made up of a left and >>>>>> right subtree, not a left subtree, a right comment, and then some >>>>>> unnamed subtree.) >>>>>> >>>>>> Emacs's tree makes writing queries less convenient, as instead of being >>>>>> able to refer to well-defined names, one has to rely on child indices to >>>>>> account for comments. >>>>>> >>>>>> >>>>>> Further mismatch arises from repeated fields and separators. >>>>>> >>>>>> Consider the following Go source: >>>>>> >>>>>> package pkg >>>>>> >>>>>> var a, b, c = 1, 2, 3 >>>>>> >>>>>> treesit-explore-mode displays the following tree: >>>>>> >>>>>> (source_file >>>>>> (package_clause package (package_identifier)) >>>>>> \n >>>>>> (var_declaration var >>>>>> (var_spec name: (identifier) name: , (identifier) value: , (identifier) = >>>>>> (expression_list (int_literal) , (int_literal) , (int_literal)))) >>>>>> \n) >>>>>> >>>>>> Here, the var_spec node has two fields named 'name' even though the >>>>>> source specifies three names. Furthermore, The second 'name', as well as >>>>>> 'value' are set to the ',' separator between identifiers. Two of the three >>>>>> identifiers aren't named. >>>>>> >>>>>> 'tree-sitter parse file.go', on the other hand, produces this more >>>>>> accurate tree: >>>>>> >>>>>> (source_file [0, 0] - [2, 21] >>>>>> (package_clause [0, 0] - [0, 11] >>>>>> (package_identifier [0, 8] - [0, 11])) >>>>>> (var_declaration [2, 0] - [2, 21] >>>>>> (var_spec [2, 4] - [2, 21] >>>>>> name: (identifier [2, 4] - [2, 5]) >>>>>> name: (identifier [2, 7] - [2, 8]) >>>>>> name: (identifier [2, 10] - [2, 11]) >>>>>> value: (expression_list [2, 14] - [2, 21] >>>>>> (int_literal [2, 14] - [2, 15]) >>>>>> (int_literal [2, 17] - [2, 18]) >>>>>> (int_literal [2, 20] - [2, 21]))))) >>>>>> >>>>>> This reproduces with 29.1 as well as 30.0.50. >>>>> Yuan, any comments or suggestions? >> Sorry sorry sorry, another missed report. I think this is a bug in >> treesit-explore-mode, I'll work on fixing it! >> >> Yuan > I don't think that's the case, at least not exclusively. I used > treesit-explore-mode to debug patterns that matched in the playground > but not in Emacs. The matching behavior seemed pretty in line with what > treesit-explore-mode reported. I do find that treesit-node-field-name are returning wrong field names, that's why in the first example, you see the "value" field name given to the comment node, rather than the binary_expression behind it. In the actual parse tree, "value" belongs to binary_expression. With the fixed I just pushed to emacs-29, the explorer parse tree for the first example becomes (translation_unit (function_definition type: (primitive_type) declarator: (function_declarator declarator: (identifier) parameters: (parameter_list ( ))) body: (compound_statement { (declaration type: (primitive_type) declarator: (init_declarator declarator: (identifier) = (comment) value: (binary_expression left: (number_literal) operator: + operator: (comment) right: (number_literal))) ;) }))) which should match the playground. If you can find the pattern that matches in the playground but doesn't in Emacs, do please post it and I can see if there's anything wrong. Yuan
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.