#76731 - C-style comment regexp example in (info "(elisp)Rx Notation") is not correct

GNU bug report logs - #76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct

Package: emacs;

Reported by: "Yue Yi" <include_yy <at> qq.com>

Date: Tue, 4 Mar 2025 04:09:02 UTC

Severity: wishlist

Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Bug is archived. No further changes may be made.

Message #16 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: "Yue Yi" <include_yy <at> qq.com> To: "Mattias Engdeg�0�2rd" <mattias.engdegard <at> gmail.com>, "Stefan Kangas" <stefankangas <at> gmail.com> Cc: 76731 <76731 <at> debbugs.gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx Notation") is not correct Date: Fri, 16 May 2025 23:12:05 +0800

[Message part 1 (text/plain, inline)]

Mattias Engdeg�0�2rd Fri, 16 May 2025 15:46:40 +0200 writes > And my sincerest thanks to you for noticing this. Everyone who writes technical > texts knows how valuable people who actually work through examples are. I encountered this problem while trying to write a simple CSS parser using PEG. My first reaction was to look for regex examples in the manual, and I��m glad I found it :p > Your proposed solution, > > > (rx "/*" > > (* (| (not "*") > > (: (1+ "*") (not (or "*" "/"))))) > > (1+ "*") "/") > > appears correct but Emacs's NFA engine will match a final run of stars twice. > Consider the text > > /*************************************/ > > The regexp will match all stars, encounter the final slash, backtrack and match > the stars again before matching the slash. A bit inelegant perhaps. More > seriously, the stack usage is such that it can't parse a 1 MB comment without > running out of stack space (on my machine). To be fair, he original regexp had > the same problem. > > (And yes, non-greedy operators can be used for a simple solution but as the > footnote in the text says that's not the point here.) > > > 2. Better alternative? > > (rx "/*" > (* (not "*")) > (+ "*") > (* (not (in "*/")) > (* (not "*")) > (+ "*")) > "/") > > is slightly more complicated but doesn't backtrack as much. > It still produces unnecessary backtrack points between runs of stars; perhaps > the analysis to eliminate them is too hard for the compiler. I'm not an expert in regular expressions, but it seems that cases like C block comments are hard to handle without introducing backtracking. Although it's unlikely to encounter such enormous and specific comments in practice, the implementation I provided does introduce some implicit performance issues. During my search, I think I might have seen the improved version you provided on Stack Overflow or somewhere else. It��s definitely better, but also harder to understand. > 2. But is it a good example? > > The purpose was never parsing C comments but to provide an example of how rx > can help. Can we find something simpler? > Here is regexp for a simple quoted string: > > (rx ?\" > (* (or (not (or ?\\ ?\")) > (: ?\\ (or ?\\ ?\")))) > ?\") > > Would that be a better example? The backslashes obscure things a bit. > > Right now I'm leaning towards using the proposed fix. Yes, I think this example is great, simpler examples are always better. C comments are indeed a case where it��s hard to come up with a solution right away. It would be better to present it as a demonstration of how to think about and implement a regular expression, rather than just a simple regex example. But I��d still like to keep it in some places, since it can actually be quite useful sometimes. [ Of course, this isn��t really a problem nowadays �� LLMs can quickly lead us to solutions for some classic problems (though they occasionally introduce some unexpected issues). ] Regards

[Message part 2 (text/html, inline)]

This bug report was last modified 52 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #76731 C-style comment regexp example in (info "(elisp)Rx Notation") is not correct

GNU bug report logs - #76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct