GNU bug report logs -
#76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct
Previous Next
Reported by: "Yue Yi" <include_yy <at> qq.com>
Date: Tue, 4 Mar 2025 04:09:02 UTC
Severity: wishlist
Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 76731 in the body.
You can then email your comments to 76731 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Tue, 04 Mar 2025 04:09:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"Yue Yi" <include_yy <at> qq.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Tue, 04 Mar 2025 04:09:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello Emacs, In Elisp Manual's Rx Notation section, we have ------------------------------------------------------------------- Here is an rx regexp(1) that matches a block comment in the C programming language: (rx "/*" ; Initial /* (zero-or-more (or (not "*") ; Either non-*, (seq "*" ; or * followed by (not "/")))) ; non-/ (one-or-more "*") ; At least one star, "/") ; and the final / or, using shorter synonyms and written more compactly, (rx "/*" (* (| (not "*") (: "*" (not "/")))) (+ "*") "/") In conventional string syntax, it would be written "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/" -------------------------------------------------------------------- Sadly, this regexp is not correct, as demonstated by this simple example: (Try M-x isearch-forward-regexp with /\*\(?:[^*]\|\*[^/]\)*\*+/) /***/ 123 /* anything else */ As you can see, the entire line above is highlighted by the search, meaning that the whole line has been matched. In fact, this issue occurs when the number of asterisks in /*(nstar)*/ is odd. The correct regular expression is: /\*\(?:[^*]\|\*+[^*/]\)*\*+/ The corresponding RX expression in the original document could be: (rx "/*" (zero-or-more (or (not "*") (seq (one-or-more "*") (not (or "*" "/"))))) (one-or-more "*") "/") Or: (rx "/*" (* (| (not "*") (: (1+ "*") (not (or "*" "/"))))) (1+ "*") "/") BTW, using non-greedy `*?', the simplest way might be: (rx "/*" (*? anything) "*/") "/\\*[^z-a]*?\\*/" or "/\\*\\(?:.\\|\n\\)*?\\*/" Regards.
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Tue, 04 Mar 2025 18:11:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 76731 <at> debbugs.gnu.org (full text, mbox):
"Yue Yi" via "Bug reports for GNU Emacs, the Swiss army knife of text
editors" <bug-gnu-emacs <at> gnu.org> writes:
> Hello Emacs,
>
> In Elisp Manual's Rx Notation section, we have
>
> -------------------------------------------------------------------
>
> Here is an ‘rx’ regexp(1) that matches a block comment in the C
> programming language:
>
> (rx "/*" ; Initial /*
> (zero-or-more
> (or (not "*") ; Either non-*,
> (seq "*" ; or * followed by
> (not "/")))) ; non-/
> (one-or-more "*") ; At least one star,
> "/") ; and the final /
>
> or, using shorter synonyms and written more compactly,
>
> (rx "/*"
> (* (| (not "*")
> (: "*" (not "/"))))
> (+ "*") "/")
>
> In conventional string syntax, it would be written
>
> "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
> --------------------------------------------------------------------
>
> Sadly, this regexp is not correct, as demonstated by this simple
> example: (Try M-x isearch-forward-regexp with
> /\*\(?:[^*]\|\*[^/]\)*\*+/)
>
> /***/ 123 /* anything else */
>
> As you can see, the entire line above is highlighted by the search,
> meaning that the whole line has been matched. In fact, this issue
> occurs when the number of asterisks in /*(nstar)*/ is odd.
>
> The correct regular expression is:
>
> /\*\(?:[^*]\|\*+[^*/]\)*\*+/
>
> The corresponding RX expression in the original document could be:
>
> (rx "/*"
> (zero-or-more
> (or (not "*")
> (seq (one-or-more "*")
> (not (or "*" "/")))))
> (one-or-more "*")
> "/")
>
> Or:
>
> (rx "/*"
> (* (| (not "*")
> (: (1+ "*") (not (or "*" "/")))))
> (1+ "*") "/")
>
> BTW, using non-greedy `*?', the simplest way might be:
>
> (rx "/*"
> (*? anything)
> "*/")
>
> "/\\*[^z-a]*?\\*/" or "/\\*\\(?:.\\|\n\\)*?\\*/"
>
> Regards.
Mattias, any comments?
Severity set to 'wishlist' from 'normal'
Request was from
Stefan Kangas <stefankangas <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Mon, 10 Mar 2025 21:10:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Fri, 16 May 2025 13:47:04 GMT)
Full text and
rfc822 format available.
Message #13 received at 76731 <at> debbugs.gnu.org (full text, mbox):
"Yue Yi" via "Bug reports for GNU Emacs, the Swiss army knife of text
editors" <bug-gnu-emacs <at> gnu.org> writes:
> In Elisp Manual's Rx Notation section, we have
>
> Here is an ‘rx’ regexp(1) that matches a block comment in the C
> programming language:
>
> (rx "/*" ; Initial /*
> (zero-or-more
> (or (not "*") ; Either non-*,
> (seq "*" ; or * followed by
> (not "/")))) ; non-/
> (one-or-more "*") ; At least one star,
> "/") ; and the final /
>
> Sadly, this regexp is not correct, as demonstated by this simple
> example:
> /***/ 123 /* anything else */
You are completely right! I just don't know what I was thinking. Sorry about that!
And my sincerest thanks to you for noticing this. Everyone who writes technical texts knows how valuable people who actually work through examples are.
1. How to fix it
Your proposed solution,
> (rx "/*"
> (* (| (not "*")
> (: (1+ "*") (not (or "*" "/")))))
> (1+ "*") "/")
appears correct but Emacs's NFA engine will match a final run of stars twice. Consider the text
/*************************************/
The regexp will match all stars, encounter the final slash, backtrack and match the stars again before matching the slash. A bit inelegant perhaps. More seriously, the stack usage is such that it can't parse a 1 MB comment without running out of stack space (on my machine). To be fair, he original regexp had the same problem.
(And yes, non-greedy operators can be used for a simple solution but as the footnote in the text says that's not the point here.)
2. Better alternative?
(rx "/*"
(* (not "*"))
(+ "*")
(* (not (in "*/"))
(* (not "*"))
(+ "*"))
"/")
is slightly more complicated but doesn't backtrack as much.
It still produces unnecessary backtrack points between runs of stars; perhaps the analysis to eliminate them is too hard for the compiler.
2. But is it a good example?
The purpose was never parsing C comments but to provide an example of how rx can help. Can we find something simpler?
Here is regexp for a simple quoted string:
(rx ?\"
(* (or (not (or ?\\ ?\"))
(: ?\\ (or ?\\ ?\"))))
?\")
Would that be a better example? The backslashes obscure things a bit.
Right now I'm leaning towards using the proposed fix.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Fri, 16 May 2025 15:18:01 GMT)
Full text and
rfc822 format available.
Message #16 received at 76731 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Mattias Engdeg02rd Fri, 16 May 2025 15:46:40 +0200 writes > And my sincerest thanks to you for noticing this. Everyone who writes technical > texts knows how valuable people who actually work through examples are. I encountered this problem while trying to write a simple CSS parser using PEG. My first reaction was to look for regex examples in the manual, and Im glad I found it :p > Your proposed solution, > > > (rx "/*" > > (* (| (not "*") > > (: (1+ "*") (not (or "*" "/"))))) > > (1+ "*") "/") > > appears correct but Emacs's NFA engine will match a final run of stars twice. > Consider the text > > /*************************************/ > > The regexp will match all stars, encounter the final slash, backtrack and match > the stars again before matching the slash. A bit inelegant perhaps. More > seriously, the stack usage is such that it can't parse a 1 MB comment without > running out of stack space (on my machine). To be fair, he original regexp had > the same problem. > > (And yes, non-greedy operators can be used for a simple solution but as the > footnote in the text says that's not the point here.) > > > 2. Better alternative? > > (rx "/*" > (* (not "*")) > (+ "*") > (* (not (in "*/")) > (* (not "*")) > (+ "*")) > "/") > > is slightly more complicated but doesn't backtrack as much. > It still produces unnecessary backtrack points between runs of stars; perhaps > the analysis to eliminate them is too hard for the compiler. I'm not an expert in regular expressions, but it seems that cases like C block comments are hard to handle without introducing backtracking. Although it's unlikely to encounter such enormous and specific comments in practice, the implementation I provided does introduce some implicit performance issues. During my search, I think I might have seen the improved version you provided on Stack Overflow or somewhere else. Its definitely better, but also harder to understand. > 2. But is it a good example? > > The purpose was never parsing C comments but to provide an example of how rx > can help. Can we find something simpler? > Here is regexp for a simple quoted string: > > (rx ?\" > (* (or (not (or ?\\ ?\")) > (: ?\\ (or ?\\ ?\")))) > ?\") > > Would that be a better example? The backslashes obscure things a bit. > > Right now I'm leaning towards using the proposed fix. Yes, I think this example is great, simpler examples are always better. C comments are indeed a case where its hard to come up with a solution right away. It would be better to present it as a demonstration of how to think about and implement a regular expression, rather than just a simple regex example. But Id still like to keep it in some places, since it can actually be quite useful sometimes. [ Of course, this isnt really a problem nowadays LLMs can quickly lead us to solutions for some classic problems (though they occasionally introduce some unexpected issues). ] Regards
[Message part 2 (text/html, inline)]
Reply sent
to
Mattias Engdegård <mattias.engdegard <at> gmail.com>
:
You have taken responsibility.
(Sat, 17 May 2025 10:23:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
"Yue Yi" <include_yy <at> qq.com>
:
bug acknowledged by developer.
(Sat, 17 May 2025 10:23:02 GMT)
Full text and
rfc822 format available.
Message #21 received at 76731-done <at> debbugs.gnu.org (full text, mbox):
16 maj 2025 kl. 17.12 skrev Yue Yi <include_yy <at> qq.com>:
> I'm not an expert in regular expressions, but it seems that cases like C
> block comments are hard to handle without introducing
> backtracking.
I see no fundamental reason why they should be, as the C comment syntax can be parsed efficiently by a tiny state machine. The first "/*" encountered is always the beginning of the comment on matter what is found later, and the first "*/" after that is always the end. There is never any reason to go back and try a different parse.
Non-DFA regexp engines such as the one in Emacs need some hacks and/or carefully formulated regexps to avoid consuming stack space but that's a different matter. I still think we should be able to do better with either your or my regexps.
I kept your proposed fix instead of switching to a different example. The quoted-string case is simpler but the amount of backslashes detracted from the point of the exercise.
Fix pushed to master. Thank you again!
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Mon, 19 May 2025 22:43:02 GMT)
Full text and
rfc822 format available.
Message #24 received at 76731-done <at> debbugs.gnu.org (full text, mbox):
> I see no fundamental reason why they should be, as the C comment syntax can
> be parsed efficiently by a tiny state machine.
Our regexp engine lacks to "switch" kind of operation needed for that.
After a series of `*` we have two mutually exclusive paths, depending on whether
the next char is a `/` or something else, but we don't have a "look
ahead" so we have to push a backtrack-point before we try to match the
next char with `/`.
Stefan
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Tue, 20 May 2025 11:49:02 GMT)
Full text and
rfc822 format available.
Message #27 received at 76731 <at> debbugs.gnu.org (full text, mbox):
20 maj 2025 kl. 00.42 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:
> Our regexp engine lacks to "switch" kind of operation needed for that.
> After a series of `*` we have two mutually exclusive paths, depending on whether
> the next char is a `/` or something else, but we don't have a "look
> ahead" so we have to push a backtrack-point before we try to match the
> next char with `/`.
All true but we do have the ad-hoc non-backtracking optimisation. Right now it's only used for 'simple' patterns as in
(: (* "a") "b")
but there shouldn't be any reason not to use it for, say,
(: (* "a" ARBITRARY-PATTERN) "b")
should it? Except for ease of implementation, of course.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#76731
; Package
emacs
.
(Tue, 20 May 2025 14:22:02 GMT)
Full text and
rfc822 format available.
Message #30 received at 76731 <at> debbugs.gnu.org (full text, mbox):
>> Our regexp engine lacks to "switch" kind of operation needed for that.
>> After a series of `*` we have two mutually exclusive paths, depending on whether
>> the next char is a `/` or something else, but we don't have a "look
>> ahead" so we have to push a backtrack-point before we try to match the
>> next char with `/`.
>
> All true but we do have the ad-hoc non-backtracking optimisation. Right now
> it's only used for 'simple' patterns as in
>
> (: (* "a") "b")
>
> but there shouldn't be any reason not to use it for, say,
>
> (: (* "a" ARBITRARY-PATTERN) "b")
>
> should it? Except for ease of implementation, of course.
Yup, presumably the same `mutually_exclusive_p` test could be used to
detect when such an optimisation can be used, so it's a "small matter of
programming" (tho it includes the choice/design of appropriate
bytecodes, which is ... language design more than programming per se).
Last time I looked/thought about it, I saw two ways to do that:
- Add a bytecode to `pop` the last backtrack-point. This might(?) be
easy-ish to implement, and would indeed reduce the stack usage but
without speeding things up very much (we'd still push&pop the same
number of backtrack-points). Also it would require extra care
(e.g. punting) for cases where another backtrack-point gets pushed
in-between such as
(: (* (or "a" RE1) (or (and "abc") RE2)) "b")
- Add a kind of `switch` bytecode with a table that maps chars to
corresponding labels to jump to. This would speed up matching (and
would allow using that same regexp engine with a different
regexp-compiler that always generates a DFA) but it requires a fair
bit more work on the regexp-compiler side. I think it would require
a whole new regexp-compiler, which I would imagine implementing in
ELisp and calling explicitly at compile-time.
Stefan
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 18 Jun 2025 11:24:10 GMT)
Full text and
rfc822 format available.
This bug report was last modified today.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.