GNU bug report logs - #53680
Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50

Previous Next

Package: emacs;

Reported by: Christian Johansson <christian <at> cvj.se>

Date: Tue, 1 Feb 2022 07:38:01 UTC

Severity: normal

Tags: confirmed, notabug

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 53680 in the body.
You can then email your comments to 53680 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 07:38:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Christian Johansson <christian <at> cvj.se>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 01 Feb 2022 07:38:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Christian Johansson <christian <at> cvj.se>
To: bug-gnu-emacs <at> gnu.org
Subject: Endless loop in peculiar case of string-match and string-match-p
 27.02 and 28.0.50
Date: Tue, 1 Feb 2022 08:37:37 +0100
Hello

Some context first, since I needed a multi-line regexp matcher that 
could tell the difference of end-of-string vs newline ($ vs \n) in a 
string I replace occurrences of \n with \r in a string.

In a peculiar case a endless loop happens and I can reproduce this bug 
in Emacs 27.2 and 28.0.50 (have not tested any other versions), ok try 
to eval following snippet:

(string-match-p·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 


or

(string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")


-- 
Hälsningar / Best Regards
Christian





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 11:01:01 GMT) Full text and rfc822 format available.

Message #8 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Christian Johansson <christian <at> cvj.se>
Cc: 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 01 Feb 2022 12:00:14 +0100
Christian Johansson <christian <at> cvj.se> writes:

> In a peculiar case a endless loop happens and I can reproduce this bug
> in Emacs 27.2 and 28.0.50 (have not tested any other versions), ok try
> to eval following snippet:
>
> (string-match-p·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·")
> or

I guess the · are supposed to be spaces, so it's:

(string-match
 "[\r\t ]*implements[\r\t ]+\\([\r\t ]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
 "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")

And I can reproduce this on master, too.  Is it a problem with excessive
backtracking, perhaps?  I let it run for a minute, and it didn't finish
in that time.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) confirmed. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Tue, 01 Feb 2022 11:01:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 11:09:02 GMT) Full text and rfc822 format available.

Message #13 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Christian Johansson <christian <at> cvj.se>
Cc: 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 01 Feb 2022 12:08:40 +0100
Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> And I can reproduce this on master, too.  Is it a problem with excessive
> backtracking, perhaps?  I let it run for a minute, and it didn't finish
> in that time.

This simplified version also infloops:

(string-match
 "implements\\([\r\t ]*[\\a-zA-Z0-9_]+,?\\)+{$"
 "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 11:16:01 GMT) Full text and rfc822 format available.

Message #16 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Christian Johansson <christian <at> cvj.se>, 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 01 Feb 2022 12:15:43 +0100
On Feb 01 2022, Lars Ingebrigtsen wrote:

> I guess the · are supposed to be spaces, so it's:
>
> (string-match
>  "[\r\t ]*implements[\r\t ]+\\([\r\t ]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
>  "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")
>
> And I can reproduce this on master, too.  Is it a problem with excessive
> backtracking, perhaps?

It sure is.  The nesting of the + operator without proper anchoring
makes the search space explode.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 11:26:02 GMT) Full text and rfc822 format available.

Message #19 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Christian Johansson <christian <at> cvj.se>, 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 01 Feb 2022 12:25:43 +0100
On Feb 01 2022, Lars Ingebrigtsen wrote:

> This simplified version also infloops:
>
> (string-match
>  "implements\\([\r\t ]*[\\a-zA-Z0-9_]+,?\\)+{$"

This has the same search space explosion.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 12:17:02 GMT) Full text and rfc822 format available.

Message #22 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: Christian Johansson <christian <at> cvj.se>, 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 01 Feb 2022 13:16:30 +0100
Andreas Schwab <schwab <at> linux-m68k.org> writes:

> It sure is.  The nesting of the + operator without proper anchoring
> makes the search space explode.

Yup.  So something like 

(string-match
 "[\r\t ]*implements\\([\r\t ]+[\\a-zA-Z_0-9_]+,?\\)+[\r\t ]*{$"
 "ariable implements \\Magento\\Framework\\Event\\OberserverInterface\r{\r    public function __construct()\r ")

won't have the same explosion (and indeed finished immediately). 

So I don't think this is an Emacs bug, just a very tough regexp to
match.  There's been some talk about replacing Emacs' regexp motor with
something that has better backtracking characteristics, but I don't
recall if anybody has actually made any moves to make that happen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 12:57:02 GMT) Full text and rfc822 format available.

Message #25 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Christian Johansson <christian <at> cvj.se>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Andreas Schwab <schwab <at> linux-m68k.org>,
 53680 <at> debbugs.gnu.org
Subject: bug#53680: Endless loop in peculiar case of string-match and 
 string-match-p 27.02 and 28.0.50
Date: Tue, 1 Feb 2022 13:56:10 +0100
> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 

The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:

(rx (* (in "\t\r "))
    "implements"
    (+ (in "\t\r "))
    (+ (group
        (* (in "\t\r "))
        (+ (in "0-9A-Za-z" "\\_"))
        (? ",")))
    (* (in "\t\r "))
    "{" eol)

The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:

(rx symbol-start "implements"
    (+ (in "\t\r "))
    (+ (* (in "\t\r "))
       (+ (in "0-9A-Za-z" "\\_"))
       (? ","))
    (* (in "\t\r "))
    "{" eol)

You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous --  all inter-word separators are optional. Let's make them mandatory:

 (rx symbol-start "implements"
     ;; mandatory whitespace
     (+ (in "\t\r "))
     ;; then a word
     (+ (in "0-9A-Za-z" "\\_"))
     ;; then maybe more words, each prefixed by spaces or comma
     (* (+ (in "\t\r ,"))   ; fast and loose
        (+ (in "0-9A-Za-z" "\\_")))
     ;; finally whitespace before the curly bracket
     (* (in "\t\r "))
     "{" eol)

which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.

Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like

  (* (in "\t\r "))
  ","
  (* (in "\t\r "))

instead.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 13:13:01 GMT) Full text and rfc822 format available.

Message #28 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Christian Johansson <christian <at> cvj.se>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Andreas Schwab <schwab <at> linux-m68k.org>,
 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 1 Feb 2022 14:12:19 +0100
Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings

> 1 feb. 2022 kl. 13:56 skrev Mattias Engdegård <mattiase <at> acm.org>:
> 
> 
>> 
>> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 
> 
> The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:
> 
> (rx (* (in "\t\r "))
>    "implements"
>    (+ (in "\t\r "))
>    (+ (group
>        (* (in "\t\r "))
>        (+ (in "0-9A-Za-z" "\\_"))
>        (? ",")))
>    (* (in "\t\r "))
>    "{" eol)
> 
> The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:
> 
> (rx symbol-start "implements"
>    (+ (in "\t\r "))
>    (+ (* (in "\t\r "))
>       (+ (in "0-9A-Za-z" "\\_"))
>       (? ","))
>    (* (in "\t\r "))
>    "{" eol)
> 
> You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous --  all inter-word separators are optional. Let's make them mandatory:
> 
> (rx symbol-start "implements"
>     ;; mandatory whitespace
>     (+ (in "\t\r "))
>     ;; then a word
>     (+ (in "0-9A-Za-z" "\\_"))
>     ;; then maybe more words, each prefixed by spaces or comma
>     (* (+ (in "\t\r ,"))   ; fast and loose
>        (+ (in "0-9A-Za-z" "\\_")))
>     ;; finally whitespace before the curly bracket
>     (* (in "\t\r "))
>     "{" eol)
> 
> which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.
> 
> Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like
> 
>  (* (in "\t\r "))
>  ","
>  (* (in "\t\r "))
> 
> instead.
> 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#53680; Package emacs. (Tue, 01 Feb 2022 16:25:02 GMT) Full text and rfc822 format available.

Message #31 received at 53680 <at> debbugs.gnu.org (full text, mbox):

From: Christian Johansson <christian <at> cvj.se>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Andreas Schwab <schwab <at> linux-m68k.org>,
 53680 <at> debbugs.gnu.org
Subject: Re: bug#53680: Endless loop in peculiar case of string-match and
 string-match-p 27.02 and 28.0.50
Date: Tue, 1 Feb 2022 17:23:59 +0100
I see, thanks for the tip, I got it working now without endless-loop and 
without the CR work-around

Hälsningar / Best Regards
Christian

On 01/02/2022 14:39, Mattias Engdegård wrote:
> 1 feb. 2022 kl. 14.12 skrev Christian Johansson <christian <at> cvj.se>:
>> Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings
> By the way, your translation of LF into CR in order to use "$" as end-of-string anchor is misguided: to match at the beginning or end of a string or buffer, use \` and \' (in rx, string-start and string-end or bos and eos) respectively.
>




Added tag(s) notabug. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Wed, 02 Feb 2022 18:14:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 53680 <at> debbugs.gnu.org and Christian Johansson <christian <at> cvj.se> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Wed, 02 Feb 2022 18:14:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 03 Mar 2022 12:24:14 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 105 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.