GNU bug report logs - #76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct

Package: emacs;

Reported by: "Yue Yi" <include_yy <at> qq.com>

Date: Tue, 4 Mar 2025 04:09:02 UTC

Severity: wishlist

Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 76731 in the body.
You can then email your comments to 76731 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Tue, 04 Mar 2025 04:09:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Yue Yi" <include_yy <at> qq.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 04 Mar 2025 04:09:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Yue Yi" <include_yy <at> qq.com>
To: "bug-gnu-emacs" <bug-gnu-emacs <at> gnu.org>
Subject: C-style comment regexp example in (info "(elisp)Rx Notation") is not
 correct
Date: Tue, 4 Mar 2025 11:58:36 +0800

[Message part 1 (text/plain, inline)]

Hello Emacs, In Elisp Manual's Rx Notation section, we have -------------------------------------------------------------------    Here is an ��rx�� regexp(1) that matches a block comment in the C programming language:      (rx "/*"                    ; Initial /*          (zero-or-more           (or (not "*")          ;  Either non-*,               (seq "*"           ;  or * followed by                    (not "/"))))  ;     non-/          (one-or-more "*")       ; At least one star,          "/")                    ; and the final / or, using shorter synonyms and written more compactly,      (rx "/*"          (* (| (not "*")                (: "*" (not "/"))))          (+ "*") "/") In conventional string syntax, it would be written      "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/" -------------------------------------------------------------------- Sadly, this regexp is not correct, as demonstated by this simple example: (Try M-x isearch-forward-regexp with /\*\(?:[^*]\|\*[^/]\)*\*+/) /***/ 123 /* anything else */ As you can see, the entire line above is highlighted by the search, meaning that the whole line has been matched. In fact, this issue occurs when the number of asterisks in /*(nstar)*/ is odd. The correct regular expression is: /\*\(?:[^*]\|\*+[^*/]\)*\*+/ The corresponding RX expression in the original document could be: (rx "/*"     (zero-or-more      (or (not "*") 	 (seq (one-or-more "*") 	      (not (or "*" "/")))))     (one-or-more "*")     "/") Or: (rx "/*"     (* (| (not "*") 	  (: (1+ "*") (not (or "*" "/")))))     (1+ "*") "/") BTW, using non-greedy `*?', the simplest way might be: (rx "/*"     (*? anything)     "*/") "/\\*[^z-a]*?\\*/" or "/\\*\\(?:.\\|\n\\)*?\\*/" Regards.

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Tue, 04 Mar 2025 18:11:02 GMT) Full text and rfc822 format available.

Message #8 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Yue Yi <include_yy <at> qq.com>, 76731 <at> debbugs.gnu.org
Cc: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Tue, 4 Mar 2025 18:10:43 +0000

"Yue Yi" via "Bug reports for GNU Emacs, the Swiss army knife of text
editors" <bug-gnu-emacs <at> gnu.org> writes:

> Hello Emacs,
>
> In Elisp Manual's Rx Notation section, we have
>
> -------------------------------------------------------------------
>
>    Here is an ‘rx’ regexp(1) that matches a block comment in the C
> programming language:
>
>      (rx "/*"                    ; Initial /*
>          (zero-or-more
>           (or (not "*")          ;  Either non-*,
>               (seq "*"           ;  or * followed by
>                    (not "/"))))  ;     non-/
>          (one-or-more "*")       ; At least one star,
>          "/")                    ; and the final /
>
> or, using shorter synonyms and written more compactly,
>
>      (rx "/*"
>          (* (| (not "*")
>                (: "*" (not "/"))))
>          (+ "*") "/")
>
> In conventional string syntax, it would be written
>
>      "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
> --------------------------------------------------------------------
>
> Sadly, this regexp is not correct, as demonstated by this simple
> example: (Try M-x isearch-forward-regexp with
> /\*\(?:[^*]\|\*[^/]\)*\*+/)
>
> /***/ 123 /* anything else */
>
> As you can see, the entire line above is highlighted by the search,
> meaning that the whole line has been matched. In fact, this issue
> occurs when the number of asterisks in /*(nstar)*/ is odd.
>
> The correct regular expression is:
>
> /\*\(?:[^*]\|\*+[^*/]\)*\*+/
>
> The corresponding RX expression in the original document could be:
>
> (rx "/*"
>     (zero-or-more
>      (or (not "*")
>          (seq (one-or-more "*")
>               (not (or "*" "/")))))
>     (one-or-more "*")
>     "/")
>
> Or:
>
> (rx "/*"
>     (* (| (not "*")
>           (: (1+ "*") (not (or "*" "/")))))
>     (1+ "*") "/")
>
> BTW, using non-greedy `*?', the simplest way might be:
>
> (rx "/*"
>     (*? anything)
>     "*/")
>
> "/\\*[^z-a]*?\\*/" or "/\\*\\(?:.\\|\n\\)*?\\*/"
>
> Regards.

Mattias, any comments?

Severity set to 'wishlist' from 'normal' Request was from Stefan Kangas <stefankangas <at> gmail.com> to control <at> debbugs.gnu.org. (Mon, 10 Mar 2025 21:10:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Fri, 16 May 2025 13:47:04 GMT) Full text and rfc822 format available.

Message #13 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: Yue Yi <include_yy <at> qq.com>, Stefan Monnier <monnier <at> iro.umontreal.ca>,
 76731 <at> debbugs.gnu.org
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Fri, 16 May 2025 15:46:40 +0200

"Yue Yi" via "Bug reports for GNU Emacs, the Swiss army knife of text
editors" <bug-gnu-emacs <at> gnu.org> writes:

> In Elisp Manual's Rx Notation section, we have
> 
>   Here is an ‘rx’ regexp(1) that matches a block comment in the C
> programming language:
> 
>     (rx "/*"                    ; Initial /*
>         (zero-or-more
>          (or (not "*")          ;  Either non-*,
>              (seq "*"           ;  or * followed by
>                   (not "/"))))  ;     non-/
>         (one-or-more "*")       ; At least one star,
>         "/")                    ; and the final /
> 
> Sadly, this regexp is not correct, as demonstated by this simple
> example:

> /***/ 123 /* anything else */

You are completely right! I just don't know what I was thinking. Sorry about that!

And my sincerest thanks to you for noticing this. Everyone who writes technical texts knows how valuable people who actually work through examples are.

1. How to fix it

Your proposed solution,

> (rx "/*"
>    (* (| (not "*")
>          (: (1+ "*") (not (or "*" "/")))))
>    (1+ "*") "/")

appears correct but Emacs's NFA engine will match a final run of stars twice. Consider the text

  /*************************************/

The regexp will match all stars, encounter the final slash, backtrack and match the stars again before matching the slash. A bit inelegant perhaps. More seriously, the stack usage is such that it can't parse a 1 MB comment without running out of stack space (on my machine). To be fair, he original regexp had the same problem.

(And yes, non-greedy operators can be used for a simple solution but as the footnote in the text says that's not the point here.)

2. Better alternative?

(rx "/*"
    (* (not "*"))
    (+ "*")
    (* (not (in "*/"))
       (* (not "*"))
       (+ "*"))
    "/")

is slightly more complicated but doesn't backtrack as much.
It still produces unnecessary backtrack points between runs of stars; perhaps the analysis to eliminate them is too hard for the compiler.

2. But is it a good example?

The purpose was never parsing C comments but to provide an example of how rx can help. Can we find something simpler?
Here is regexp for a simple quoted string:

(rx ?\"
    (* (or (not (or ?\\ ?\"))
           (: ?\\ (or ?\\ ?\"))))
    ?\")

Would that be a better example? The backslashes obscure things a bit.

Right now I'm leaning towards using the proposed fix.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Fri, 16 May 2025 15:18:01 GMT) Full text and rfc822 format available.

Message #16 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: "Yue Yi" <include_yy <at> qq.com>
To: "Mattias Engdeg�0�2rd" <mattias.engdegard <at> gmail.com>,
 "Stefan Kangas" <stefankangas <at> gmail.com>
Cc: 76731 <76731 <at> debbugs.gnu.org>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Fri, 16 May 2025 23:12:05 +0800

[Message part 1 (text/plain, inline)]

Mattias Engdeg�0�2rd Fri, 16 May 2025 15:46:40 +0200 writes &gt; And my sincerest thanks to you for noticing this. Everyone who writes technical  &gt; texts knows how valuable people who actually work through examples are. I encountered this problem while trying to write a simple CSS parser using PEG. My first reaction was to look for regex examples in the manual, and I��m glad I found it :p &gt; Your proposed solution, &gt;  &gt; &gt; (rx "/*" &gt; &gt;    (* (| (not "*") &gt; &gt;          (: (1+ "*") (not (or "*" "/"))))) &gt; &gt;    (1+ "*") "/") &gt;  &gt; appears correct but Emacs's NFA engine will match a final run of stars twice.  &gt; Consider the text &gt;  &gt;   /*************************************/ &gt;  &gt; The regexp will match all stars, encounter the final slash, backtrack and match  &gt; the stars again before matching the slash. A bit inelegant perhaps. More  &gt; seriously, the stack usage is such that it can't parse a 1 MB comment without  &gt; running out of stack space (on my machine). To be fair, he original regexp had  &gt; the same problem. &gt;  &gt; (And yes, non-greedy operators can be used for a simple solution but as the  &gt; footnote in the text says that's not the point here.) &gt;  &gt;  &gt; 2. Better alternative? &gt;  &gt; (rx "/*" &gt;     (* (not "*")) &gt;     (+ "*") &gt;     (* (not (in "*/")) &gt;        (* (not "*")) &gt;        (+ "*")) &gt;     "/") &gt;  &gt; is slightly more complicated but doesn't backtrack as much. &gt; It still produces unnecessary backtrack points between runs of stars; perhaps  &gt; the analysis to eliminate them is too hard for the compiler. I'm not an expert in regular expressions, but it seems that cases like C block comments are hard to handle without introducing backtracking. Although it's unlikely to encounter such enormous and specific comments in practice, the implementation I provided does introduce some implicit performance issues. During my search, I think I might have seen the improved version you provided on Stack Overflow or somewhere else. It��s definitely better, but also harder to understand. &gt; 2. But is it a good example? &gt;  &gt; The purpose was never parsing C comments but to provide an example of how rx  &gt; can help. Can we find something simpler? &gt; Here is regexp for a simple quoted string: &gt;  &gt; (rx ?\" &gt;     (* (or (not (or ?\\ ?\")) &gt;            (: ?\\ (or ?\\ ?\")))) &gt;     ?\") &gt;  &gt; Would that be a better example? The backslashes obscure things a bit. &gt;  &gt; Right now I'm leaning towards using the proposed fix. Yes, I think this example is great, simpler examples are always better. C comments are indeed a case where it��s hard to come up with a solution right away. It would be better to present it as a demonstration of how to think about and implement a regular expression, rather than just a simple regex example. But I��d still like to keep it in some places, since it can actually be quite useful sometimes. [ Of course, this isn��t really a problem nowadays �� LLMs can quickly lead us to solutions for some classic problems (though they occasionally introduce some unexpected issues). ] Regards

[Message part 2 (text/html, inline)]

Reply sent to Mattias Engdegård <mattias.engdegard <at> gmail.com>:
You have taken responsibility. (Sat, 17 May 2025 10:23:01 GMT) Full text and rfc822 format available.

Notification sent to "Yue Yi" <include_yy <at> qq.com>:
bug acknowledged by developer. (Sat, 17 May 2025 10:23:02 GMT) Full text and rfc822 format available.

Message #21 received at 76731-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Yue Yi <include_yy <at> qq.com>
Cc: Stefan Kangas <stefankangas <at> gmail.com>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>, 76731-done <at> debbugs.gnu.org
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Sat, 17 May 2025 12:21:52 +0200

16 maj 2025 kl. 17.12 skrev Yue Yi <include_yy <at> qq.com>:

> I'm not an expert in regular expressions, but it seems that cases like C
> block comments are hard to handle without introducing
> backtracking.

I see no fundamental reason why they should be, as the C comment syntax can be parsed efficiently by a tiny state machine. The first "/*" encountered is always the beginning of the comment on matter what is found later, and the first "*/" after that is always the end. There is never any reason to go back and try a different parse.

Non-DFA regexp engines such as the one in Emacs need some hacks and/or carefully formulated regexps to avoid consuming stack space but that's a different matter. I still think we should be able to do better with either your or my regexps.

I kept your proposed fix instead of switching to a different example. The quoted-string case is simpler but the amount of backslashes detracted from the point of the exercise.

Fix pushed to master. Thank you again!

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Mon, 19 May 2025 22:43:02 GMT) Full text and rfc822 format available.

Message #24 received at 76731-done <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 76731-done <at> debbugs.gnu.org, Yue Yi <include_yy <at> qq.com>,
 Stefan Kangas <stefankangas <at> gmail.com>
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Mon, 19 May 2025 18:42:42 -0400

> I see no fundamental reason why they should be, as the C comment syntax can
> be parsed efficiently by a tiny state machine.

Our regexp engine lacks to "switch" kind of operation needed for that.
After a series of `*` we have two mutually exclusive paths, depending on whether
the next char is a `/` or something else, but we don't have a "look
ahead" so we have to push a backtrack-point before we try to match the
next char with `/`.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Tue, 20 May 2025 11:49:02 GMT) Full text and rfc822 format available.

Message #27 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Yue Yi <include_yy <at> qq.com>, 76731 <at> debbugs.gnu.org,
 Stefan Kangas <stefankangas <at> gmail.com>
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Tue, 20 May 2025 13:48:45 +0200

20 maj 2025 kl. 00.42 skrev Stefan Monnier <monnier <at> iro.umontreal.ca>:

> Our regexp engine lacks to "switch" kind of operation needed for that.
> After a series of `*` we have two mutually exclusive paths, depending on whether
> the next char is a `/` or something else, but we don't have a "look
> ahead" so we have to push a backtrack-point before we try to match the
> next char with `/`.

All true but we do have the ad-hoc non-backtracking optimisation. Right now it's only used for 'simple' patterns as in

(: (* "a") "b")

but there shouldn't be any reason not to use it for, say,

(: (* "a" ARBITRARY-PATTERN) "b")

should it? Except for ease of implementation, of course.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#76731; Package emacs. (Tue, 20 May 2025 14:22:02 GMT) Full text and rfc822 format available.

Message #30 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: Yue Yi <include_yy <at> qq.com>, 76731 <at> debbugs.gnu.org,
 Stefan Kangas <stefankangas <at> gmail.com>
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Tue, 20 May 2025 10:21:27 -0400

>> Our regexp engine lacks to "switch" kind of operation needed for that.
>> After a series of `*` we have two mutually exclusive paths, depending on whether
>> the next char is a `/` or something else, but we don't have a "look
>> ahead" so we have to push a backtrack-point before we try to match the
>> next char with `/`.
>
> All true but we do have the ad-hoc non-backtracking optimisation. Right now
> it's only used for 'simple' patterns as in
>
> (: (* "a") "b")
>
> but there shouldn't be any reason not to use it for, say,
>
> (: (* "a" ARBITRARY-PATTERN) "b")
>
> should it? Except for ease of implementation, of course.

Yup, presumably the same `mutually_exclusive_p` test could be used to
detect when such an optimisation can be used, so it's a "small matter of
programming" (tho it includes the choice/design of appropriate
bytecodes, which is ... language design more than programming per se).
Last time I looked/thought about it, I saw two ways to do that:

- Add a bytecode to `pop` the last backtrack-point.  This might(?) be
  easy-ish to implement, and would indeed reduce the stack usage but
  without speeding things up very much (we'd still push&pop the same
  number of backtrack-points).  Also it would require extra care
  (e.g. punting) for cases where another backtrack-point gets pushed
  in-between such as

      (: (* (or "a" RE1) (or (and "abc") RE2)) "b")

- Add a kind of `switch` bytecode with a table that maps chars to
  corresponding labels to jump to.  This would speed up matching (and
  would allow using that same regexp engine with a different
  regexp-compiler that always generates a DFA) but it requires a fair
  bit more work on the regexp-compiler side.  I think it would require
  a whole new regexp-compiler, which I would imagine implementing in
  ELisp and calling explicitly at compile-time.


        Stefan

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 18 Jun 2025 11:24:10 GMT) Full text and rfc822 format available.

This bug report was last modified 52 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #76731 C-style comment regexp example in (info "(elisp)Rx Notation") is not correct

GNU bug report logs - #76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct