GNU bug report logs - #53680
Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50

Previous Next

Package: emacs;

Reported by: Christian Johansson <christian <at> cvj.se>

Date: Tue, 1 Feb 2022 07:38:01 UTC

Severity: normal

Tags: confirmed, notabug

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Christian Johansson <christian <at> cvj.se>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Andreas Schwab <schwab <at> linux-m68k.org>, 53680 <at> debbugs.gnu.org
Subject: bug#53680: Endless loop in peculiar case of string-match and string-match-p 27.02 and 28.0.50
Date: Tue, 1 Feb 2022 14:12:19 +0100
Alright, thanks for helping me find the correct regexp, this issue only appeared on some peculiar strings

> 1 feb. 2022 kl. 13:56 skrev Mattias Engdegård <mattiase <at> acm.org>:
> 
> 
>> 
>> (string-match·"[\r\t·]*implements[\r\t·]+\\([\r\t·]*[\\a-zA-Z_0-9_]+,?\\)+[\r\t·]*{$"·"ariable·implements·\\Magento\\Framework\\Event\\OberserverInterface\r{\r····public·function·__construct()\r·") 
> 
> The diagnostics by Lars and Andreas is correct. Let's look at it more closely, first translating the regexp to rx for ease of reasoning, and see if we can make it work:
> 
> (rx (* (in "\t\r "))
>    "implements"
>    (+ (in "\t\r "))
>    (+ (group
>        (* (in "\t\r "))
>        (+ (in "0-9A-Za-z" "\\_"))
>        (? ",")))
>    (* (in "\t\r "))
>    "{" eol)
> 
> The first line is meaningless since it can match the empty string, but you probably want to anchor the start of "implements" so that it doesn't match "house_implements". Let's also drop the capture group, and we get:
> 
> (rx symbol-start "implements"
>    (+ (in "\t\r "))
>    (+ (* (in "\t\r "))
>       (+ (in "0-9A-Za-z" "\\_"))
>       (? ","))
>    (* (in "\t\r "))
>    "{" eol)
> 
> You clearly want to match a non-empty sequence of 'words' separated with whitespace and/or commas, but the pattern is ambiguous --  all inter-word separators are optional. Let's make them mandatory:
> 
> (rx symbol-start "implements"
>     ;; mandatory whitespace
>     (+ (in "\t\r "))
>     ;; then a word
>     (+ (in "0-9A-Za-z" "\\_"))
>     ;; then maybe more words, each prefixed by spaces or comma
>     (* (+ (in "\t\r ,"))   ; fast and loose
>        (+ (in "0-9A-Za-z" "\\_")))
>     ;; finally whitespace before the curly bracket
>     (* (in "\t\r "))
>     "{" eol)
> 
> which is reasonably efficient, since all ambiguity is now gone: the regexp can (almost) only match in one way.
> 
> Note the "fast and loose" pattern where we accept any number of spaces or commas. Here it depends on your grammar but if you want exactly one comma separating each word, that subexpression would be something like
> 
>  (* (in "\t\r "))
>  ","
>  (* (in "\t\r "))
> 
> instead.
> 




This bug report was last modified 3 years and 105 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.