GNU bug report logs -
#53680
Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
Previous Next
Reported by: Christian Johansson <christian <at> cvj.se>
Date: Tue, 1 Feb 2022 07:38:01 UTC
Severity: normal
Tags: confirmed, notabug
Done: Lars Ingebrigtsen <larsi <at> gnus.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 53680 in the body.
You can then email your comments to 53680 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 07:38:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Christian Johansson <christian <at> cvj.se>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Tue, 01 Feb 2022 07:38:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello
Some context first, since I needed a multi-line regexp matcher that
could tell the difference of end-of-string vs newline ($ vs \n) in a
string I replace occurrences of \n with \r in a string.
In a peculiar case a endless loop happens and I can reproduce this bug
in Emacs 27.2 and 28.0.50 (have not tested any other versions), ok try
to eval following snippet:
(string-match-p·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
or
(string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
--
Hälsningar / Best Regards
Christian
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 11:01:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 53680 <at> debbugs.gnu.org (full text, mbox):
Christian Johansson <christian <at> cvj.se> writes:
> In a peculiar case a endless loop happens and I can reproduce this bug
> in Emacs 27.2 and 28.0.50 (have not tested any other versions), ok try
> to eval following snippet:
>
> (string-match-p·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
> or
I guess the · are supposed to be spaces, so it's:
(string-match
"[\r\t ]*implements[\r\t ]+\\([\r\t ]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
"ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r public function __construct()\r ")
And I can reproduce this on master, too. Is it a problem with excessive
backtracking, perhaps? I let it run for a minute, and it didn't finish
in that time.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Added tag(s) confirmed.
Request was from
Lars Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Tue, 01 Feb 2022 11:01:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 11:09:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 53680 <at> debbugs.gnu.org (full text, mbox):
Lars Ingebrigtsen <larsi <at> gnus.org> writes:
> And I can reproduce this on master, too. Is it a problem with excessive
> backtracking, perhaps? I let it run for a minute, and it didn't finish
> in that time.
This simplified version also infloops:
(string-match
"implements\\([\r\t ]*[\\a-zA-Z0-9_]+,?\\)+{$"
"ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r public function __construct()\r ")
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 11:16:01 GMT)
Full text and
rfc822 format available.
Message #16 received at 53680 <at> debbugs.gnu.org (full text, mbox):
On Feb 01 2022, Lars Ingebrigtsen wrote:
> I guess the · are supposed to be spaces, so it's:
>
> (string-match
> "[\r\t ]*implements[\r\t ]+\\([\r\t ]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
> "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r public function __construct()\r ")
>
> And I can reproduce this on master, too. Is it a problem with excessive
> backtracking, perhaps?
It sure is. The nesting of the + operator without proper anchoring
makes the search space explode.
--
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 11:26:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 53680 <at> debbugs.gnu.org (full text, mbox):
On Feb 01 2022, Lars Ingebrigtsen wrote:
> This simplified version also infloops:
>
> (string-match
> "implements\\([\r\t ]*[\\a-zA-Z0-9_]+,?\\)+{$"
This has the same search space explosion.
--
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 12:17:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 53680 <at> debbugs.gnu.org (full text, mbox):
Andreas Schwab <schwab <at> linux-m68k.org> writes:
> It sure is. The nesting of the + operator without proper anchoring
> makes the search space explode.
Yup. So something like
(string-match
"[\r\t ]*implements\\([\r\t ]+[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
"ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r public function __construct()\r ")
won't have the same explosion (and indeed finished immediately).
So I don't think this is an Emacs bug, just a very tough regexp to
match. There's been some talk about replacing Emacs' regexp motor with
something that has better backtracking characteristics, but I don't
recall if anybody has actually made any moves to make that happen.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 12:57:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 53680 <at> debbugs.gnu.org (full text, mbox):
> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:
(rx (* (in "\t\r "))
"implements"
(+ (in "\t\r "))
(+ (group
(* (in "\t\r "))
(+ (in "0-9A-Za-z" "\\_"))
(? ",")))
(* (in "\t\r "))
"{" eol)
The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:
(rx symbol-start "implements"
(+ (in "\t\r "))
(+ (* (in "\t\r "))
(+ (in "0-9A-Za-z" "\\_"))
(? ","))
(* (in "\t\r "))
"{" eol)
You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous -- all inter-word separators are optional. Let's make them mandatory:
(rx symbol-start "implements"
;; mandatory whitespace
(+ (in "\t\r "))
;; then a word
(+ (in "0-9A-Za-z" "\\_"))
;; then maybe more words, each prefixed by spaces or comma
(* (+ (in "\t\r ,")) ; fast and loose
(+ (in "0-9A-Za-z" "\\_")))
;; finally whitespace before the curly bracket
(* (in "\t\r "))
"{" eol)
which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.
Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like
(* (in "\t\r "))
","
(* (in "\t\r "))
instead.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 13:13:01 GMT)
Full text and
rfc822 format available.
Message #28 received at 53680 <at> debbugs.gnu.org (full text, mbox):
Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings
> 1 feb. 2022 kl. 13:56 skrev Mattias Engdegård <mattiase <at> acm.org>:
>
>
>>
>> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
>
> The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:
>
> (rx (* (in "\t\r "))
> "implements"
> (+ (in "\t\r "))
> (+ (group
> (* (in "\t\r "))
> (+ (in "0-9A-Za-z" "\\_"))
> (? ",")))
> (* (in "\t\r "))
> "{" eol)
>
> The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:
>
> (rx symbol-start "implements"
> (+ (in "\t\r "))
> (+ (* (in "\t\r "))
> (+ (in "0-9A-Za-z" "\\_"))
> (? ","))
> (* (in "\t\r "))
> "{" eol)
>
> You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous -- all inter-word separators are optional. Let's make them mandatory:
>
> (rx symbol-start "implements"
> ;; mandatory whitespace
> (+ (in "\t\r "))
> ;; then a word
> (+ (in "0-9A-Za-z" "\\_"))
> ;; then maybe more words, each prefixed by spaces or comma
> (* (+ (in "\t\r ,")) ; fast and loose
> (+ (in "0-9A-Za-z" "\\_")))
> ;; finally whitespace before the curly bracket
> (* (in "\t\r "))
> "{" eol)
>
> which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.
>
> Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like
>
> (* (in "\t\r "))
> ","
> (* (in "\t\r "))
>
> instead.
>
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#53680
; Package
emacs
.
(Tue, 01 Feb 2022 16:25:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 53680 <at> debbugs.gnu.org (full text, mbox):
I see, thanks for the tip, I got it working now without endless-loop and
without the CR work-around
Hälsningar / Best Regards
Christian
On 01/02/2022 14:39, Mattias Engdegård wrote:
> 1 feb. 2022 kl. 14.12 skrev Christian Johansson <christian <at> cvj.se>:
>> Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings
> By the way, your translation of LF into CR in order to use "$" as end-of-string anchor is misguided: to match at the beginning or end of a string or buffer, use \` and \' (in rx, string-start and string-end or bos and eos) respectively.
>
Added tag(s) notabug.
Request was from
Lars Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Wed, 02 Feb 2022 18:14:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
53680 <at> debbugs.gnu.org and Christian Johansson <christian <at> cvj.se>
Request was from
Lars Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Wed, 02 Feb 2022 18:14:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 03 Mar 2022 12:24:14 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 105 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.