GNU bug report logs -
#76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct
Previous Next
Reported by: "Yue Yi" <include_yy <at> qq.com>
Date: Tue, 4 Mar 2025 04:09:02 UTC
Severity: wishlist
Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #13 received at 76731 <at> debbugs.gnu.org (full text, mbox):
"Yue Yi" via "Bug reports for GNU Emacs, the Swiss army knife of text
editors" <bug-gnu-emacs <at> gnu.org> writes:
> In Elisp Manual's Rx Notation section, we have
>
> Here is an ‘rx’ regexp(1) that matches a block comment in the C
> programming language:
>
> (rx "/*" ; Initial /*
> (zero-or-more
> (or (not "*") ; Either non-*,
> (seq "*" ; or * followed by
> (not "/")))) ; non-/
> (one-or-more "*") ; At least one star,
> "/") ; and the final /
>
> Sadly, this regexp is not correct, as demonstated by this simple
> example:
> /***/ 123 /* anything else */
You are completely right! I just don't know what I was thinking. Sorry about that!
And my sincerest thanks to you for noticing this. Everyone who writes technical texts knows how valuable people who actually work through examples are.
1. How to fix it
Your proposed solution,
> (rx "/*"
> (* (| (not "*")
> (: (1+ "*") (not (or "*" "/")))))
> (1+ "*") "/")
appears correct but Emacs's NFA engine will match a final run of stars twice. Consider the text
/*************************************/
The regexp will match all stars, encounter the final slash, backtrack and match the stars again before matching the slash. A bit inelegant perhaps. More seriously, the stack usage is such that it can't parse a 1 MB comment without running out of stack space (on my machine). To be fair, he original regexp had the same problem.
(And yes, non-greedy operators can be used for a simple solution but as the footnote in the text says that's not the point here.)
2. Better alternative?
(rx "/*"
(* (not "*"))
(+ "*")
(* (not (in "*/"))
(* (not "*"))
(+ "*"))
"/")
is slightly more complicated but doesn't backtrack as much.
It still produces unnecessary backtrack points between runs of stars; perhaps the analysis to eliminate them is too hard for the compiler.
2. But is it a good example?
The purpose was never parsing C comments but to provide an example of how rx can help. Can we find something simpler?
Here is regexp for a simple quoted string:
(rx ?\"
(* (or (not (or ?\\ ?\"))
(: ?\\ (or ?\\ ?\"))))
?\")
Would that be a better example? The backslashes obscure things a bit.
Right now I'm leaning towards using the proposed fix.
This bug report was last modified today.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.