GNU bug report logs - #76731
C-style comment regexp example in (info "(elisp)Rx Notation") is not correct

Previous Next

Package: emacs;

Reported by: "Yue Yi" <include_yy <at> qq.com>

Date: Tue, 4 Mar 2025 04:09:02 UTC

Severity: wishlist

Done: Mattias Engdeg氓rd <mattias.engdegard <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #16 received at 76731 <at> debbugs.gnu.org (full text, mbox):

From: "Yue Yi" <include_yy <at> qq.com>
To: "Mattias Engdegård" <mattias.engdegard <at> gmail.com>,
 "Stefan Kangas" <stefankangas <at> gmail.com>
Cc: 76731 <76731 <at> debbugs.gnu.org>,
 Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#76731: C-style comment regexp example in (info "(elisp)Rx
 Notation") is not correct
Date: Fri, 16 May 2025 23:12:05 +0800
[Message part 1 (text/plain, inline)]
Mattias Engdegård Fri, 16 May 2025 15:46:40 +0200 writes &gt; And my sincerest thanks to you for noticing this. Everyone who writes technical  &gt; texts knows how valuable people who actually work through examples are. I encountered this problem while trying to write a simple CSS parser using PEG. My first reaction was to look for regex examples in the manual, and I’m glad I found it :p &gt; Your proposed solution, &gt;  &gt; &gt; (rx "/*" &gt; &gt;    (* (| (not "*") &gt; &gt;          (: (1+ "*") (not (or "*" "/"))))) &gt; &gt;    (1+ "*") "/") &gt;  &gt; appears correct but Emacs's NFA engine will match a final run of stars twice.  &gt; Consider the text &gt;  &gt;   /*************************************/ &gt;  &gt; The regexp will match all stars, encounter the final slash, backtrack and match  &gt; the stars again before matching the slash. A bit inelegant perhaps. More  &gt; seriously, the stack usage is such that it can't parse a 1 MB comment without  &gt; running out of stack space (on my machine). To be fair, he original regexp had  &gt; the same problem. &gt;  &gt; (And yes, non-greedy operators can be used for a simple solution but as the  &gt; footnote in the text says that's not the point here.) &gt;  &gt;  &gt; 2. Better alternative? &gt;  &gt; (rx "/*" &gt;     (* (not "*")) &gt;     (+ "*") &gt;     (* (not (in "*/")) &gt;        (* (not "*")) &gt;        (+ "*")) &gt;     "/") &gt;  &gt; is slightly more complicated but doesn't backtrack as much. &gt; It still produces unnecessary backtrack points between runs of stars; perhaps  &gt; the analysis to eliminate them is too hard for the compiler. I'm not an expert in regular expressions, but it seems that cases like C block comments are hard to handle without introducing backtracking. Although it's unlikely to encounter such enormous and specific comments in practice, the implementation I provided does introduce some implicit performance issues. During my search, I think I might have seen the improved version you provided on Stack Overflow or somewhere else. It’s definitely better, but also harder to understand. &gt; 2. But is it a good example? &gt;  &gt; The purpose was never parsing C comments but to provide an example of how rx  &gt; can help. Can we find something simpler? &gt; Here is regexp for a simple quoted string: &gt;  &gt; (rx ?\" &gt;     (* (or (not (or ?\\ ?\")) &gt;            (: ?\\ (or ?\\ ?\")))) &gt;     ?\") &gt;  &gt; Would that be a better example? The backslashes obscure things a bit. &gt;  &gt; Right now I'm leaning towards using the proposed fix. Yes, I think this example is great, simpler examples are always better. C comments are indeed a case where it’s hard to come up with a solution right away. It would be better to present it as a demonstration of how to think about and implement a regular expression, rather than just a simple regex example. But I’d still like to keep it in some places, since it can actually be quite useful sometimes. [ Of course, this isn’t really a problem nowadays — LLMs can quickly lead us to solutions for some classic problems (though they occasionally introduce some unexpected issues). ] Regards
[Message part 2 (text/html, inline)]

This bug report was last modified 1 day ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.