GNU bug report logs -
#76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct
Previous Next
Reported by: "Yue Yi" <include_yy <at> qq.com>
Date: Tue, 4 Mar 2025 04:09:02 UTC
Severity: wishlist
Done: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Sat, 17 May 2025 12:21:52 +0200
with message-id <74D426F1-2A7B-4F6A-BBF2-3CC1885BD138 <at> gmail.com>
and subject line Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx Notation") is not correct
has caused the debbugs.gnu.org bug report #76731,
regarding C-style comment regexp example in (info "(elisp)Rx Notation") is not correct
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
76731: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=76731
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
Hello Emacs, In Elisp Manual's Rx Notation section, we have ------------------------------------------------------------------- Here is an ¡®rx¡¯ regexp(1) that matches a block comment in the C programming language: (rx "/*" ; Initial /* (zero-or-more (or (not "*") ; Either non-*, (seq "*" ; or * followed by (not "/")))) ; non-/ (one-or-more "*") ; At least one star, "/") ; and the final / or, using shorter synonyms and written more compactly, (rx "/*" (* (| (not "*") (: "*" (not "/")))) (+ "*") "/") In conventional string syntax, it would be written "/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/" -------------------------------------------------------------------- Sadly, this regexp is not correct, as demonstated by this simple example: (Try M-x isearch-forward-regexp with /\*\(?:[^*]\|\*[^/]\)*\*+/) /***/ 123 /* anything else */ As you can see, the entire line above is highlighted by the search, meaning that the whole line has been matched. In fact, this issue occurs when the number of asterisks in /*(nstar)*/ is odd. The correct regular expression is: /\*\(?:[^*]\|\*+[^*/]\)*\*+/ The corresponding RX expression in the original document could be: (rx "/*" (zero-or-more (or (not "*") (seq (one-or-more "*") (not (or "*" "/"))))) (one-or-more "*") "/") Or: (rx "/*" (* (| (not "*") (: (1+ "*") (not (or "*" "/"))))) (1+ "*") "/") BTW, using non-greedy `*?', the simplest way might be: (rx "/*" (*? anything) "*/") "/\\*[^z-a]*?\\*/" or "/\\*\\(?:.\\|\n\\)*?\\*/" Regards.
[Message part 4 (text/html, inline)]
[Message part 5 (message/rfc822, inline)]
16 maj 2025 kl. 17.12 skrev Yue Yi <include_yy <at> qq.com>:
> I'm not an expert in regular expressions, but it seems that cases like C
> block comments are hard to handle without introducing
> backtracking.
I see no fundamental reason why they should be, as the C comment syntax can be parsed efficiently by a tiny state machine. The first "/*" encountered is always the beginning of the comment on matter what is found later, and the first "*/" after that is always the end. There is never any reason to go back and try a different parse.
Non-DFA regexp engines such as the one in Emacs need some hacks and/or carefully formulated regexps to avoid consuming stack space but that's a different matter. I still think we should be able to do better with either your or my regexps.
I kept your proposed fix instead of switching to a different example. The quoted-string case is simpler but the amount of backslashes detracted from the point of the exercise.
Fix pushed to master. Thank you again!
This bug report was last modified today.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.