GNU bug report logs -
#76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct
Previous Next
Reported by: "Yue Yi" <include_yy <at> qq.com>
Date: Tue, 4 Mar 2025 04:09:02 UTC
Severity: wishlist
Done: Mattias Engdeg氓rd <mattias.engdegard <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #16 received at 76731 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Mattias Engdegård Fri, 16 May 2025 15:46:40 +0200 writes > And my sincerest thanks to you for noticing this. Everyone who writes technical > texts knows how valuable people who actually work through examples are. I encountered this problem while trying to write a simple CSS parser using PEG. My first reaction was to look for regex examples in the manual, and I’m glad I found it :p > Your proposed solution, > > > (rx "/*" > > (* (| (not "*") > > (: (1+ "*") (not (or "*" "/"))))) > > (1+ "*") "/") > > appears correct but Emacs's NFA engine will match a final run of stars twice. > Consider the text > > /*************************************/ > > The regexp will match all stars, encounter the final slash, backtrack and match > the stars again before matching the slash. A bit inelegant perhaps. More > seriously, the stack usage is such that it can't parse a 1 MB comment without > running out of stack space (on my machine). To be fair, he original regexp had > the same problem. > > (And yes, non-greedy operators can be used for a simple solution but as the > footnote in the text says that's not the point here.) > > > 2. Better alternative? > > (rx "/*" > (* (not "*")) > (+ "*") > (* (not (in "*/")) > (* (not "*")) > (+ "*")) > "/") > > is slightly more complicated but doesn't backtrack as much. > It still produces unnecessary backtrack points between runs of stars; perhaps > the analysis to eliminate them is too hard for the compiler. I'm not an expert in regular expressions, but it seems that cases like C block comments are hard to handle without introducing backtracking. Although it's unlikely to encounter such enormous and specific comments in practice, the implementation I provided does introduce some implicit performance issues. During my search, I think I might have seen the improved version you provided on Stack Overflow or somewhere else. It’s definitely better, but also harder to understand. > 2. But is it a good example? > > The purpose was never parsing C comments but to provide an example of how rx > can help. Can we find something simpler? > Here is regexp for a simple quoted string: > > (rx ?\" > (* (or (not (or ?\\ ?\")) > (: ?\\ (or ?\\ ?\")))) > ?\") > > Would that be a better example? The backslashes obscure things a bit. > > Right now I'm leaning towards using the proposed fix. Yes, I think this example is great, simpler examples are always better. C comments are indeed a case where it’s hard to come up with a solution right away. It would be better to present it as a demonstration of how to think about and implement a regular expression, rather than just a simple regex example. But I’d still like to keep it in some places, since it can actually be quite useful sometimes. [ Of course, this isn’t really a problem nowadays — LLMs can quickly lead us to solutions for some classic problems (though they occasionally introduce some unexpected issues). ] Regards
[Message part 2 (text/html, inline)]
This bug report was last modified 1 day ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.