GNU bug report logs - #19873
Ill-formed regular expression is constructed in forward-paragraph.

Previous Next

Packages: cc-mode, emacs;

Reported by: Alan Mackenzie <acm <at> muc.de>

Date: Sun, 15 Feb 2015 10:39:01 UTC

Severity: normal

Merged with 19846

Found in version 25.0.50

To reply to this bug, email your comments to 19873 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Sun, 15 Feb 2015 10:39:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alan Mackenzie <acm <at> muc.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 15 Feb 2015 10:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: bug-gnu-emacs <at> gnu.org
Subject: Ill-formed regular expression is constructed in forward-paragraph.
Date: Sun, 15 Feb 2015 10:31:22 +0000

Hello, Emacs!

In forward-paragraph, L37, a regular expression is constructed as
follows:

(let* ...
 (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
 ...)

.  Here parstart and parsep are, more or less,
paragraph-{start,separate}.

The problem is that parstart and parsep themselves are likely to begin
with "[ \t]*" (the default values certainly do), so we have two
consecutive matchers for an arbitrary amount of whitespace.  This causes
the regexp engine to run very slowly when a line starts with lots of WS
but doesn't match.

This problem seems to be the cause of bug # 19846 (where holding down the
spacebar inside a C comment causes Emacs to seize up when auto-fill mode
is enabled).

-- 
Alan Mackenzie (Nuremberg, Germany).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Sun, 26 Feb 2017 16:45:01 GMT) Full text and rfc822 format available.

Message #8 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Marcin Borkowski <mbork <at> amu.edu.pl>
To: Alan Mackenzie <acm <at> muc.de>
Cc: 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Sun, 26 Feb 2017 17:44:51 +0100

On 2015-02-15, at 10:31, Alan Mackenzie <acm <at> muc.de> wrote:

> Hello, Emacs!
>
> In forward-paragraph, L37, a regular expression is constructed as
> follows:
>
> (let* ...
>  (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
>  ...)
>
> .  Here parstart and parsep are, more or less,
> paragraph-{start,separate}.
>
> The problem is that parstart and parsep themselves are likely to begin
> with "[ \t]*" (the default values certainly do), so we have two
> consecutive matchers for an arbitrary amount of whitespace.  This causes
> the regexp engine to run very slowly when a line starts with lots of WS
> but doesn't match.
>
> This problem seems to be the cause of bug # 19846 (where holding down the
> spacebar inside a C comment causes Emacs to seize up when auto-fill mode
> is enabled).

Hi Alan, hi all,

I put this bug on my todo-list some time ago and decided now to revisit
it.

I'm wondering what could be done about it.  First of all, my Emacs has
this as paragraph-start:

"\\|[ 	]*$"

and this as paragraph-separate:

"[ 	]*$"

and frankly speaking, I'm not sure why they differ at all (by default).
Also, even though forward-paragraph checks for "^" at their beginning,
they actually don't begin with that character (again, by default).

My first thought is to add a check whether paragraph-start and
paragraph-sep match something like

"^\\^?\\[[[:space:]]+\\][+*]?"

and if yes, make parstart/parsep equal to them, but without the matching
part.

WDYT?

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Sun, 26 Feb 2017 16:58:02 GMT) Full text and rfc822 format available.

Message #11 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Marcin Borkowski <mbork <at> amu.edu.pl>
Cc: acm <at> muc.de, 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Sun, 26 Feb 2017 18:57:05 +0200

> From: Marcin Borkowski <mbork <at> amu.edu.pl>
> Date: Sun, 26 Feb 2017 17:44:51 +0100
> Cc: 19873 <at> debbugs.gnu.org
> 
> First of all, my Emacs has this as paragraph-start:
> 
> "\\|[ 	]*$"
> 
> and this as paragraph-separate:
> 
> "[ 	]*$"
> 
> and frankly speaking, I'm not sure why they differ at all (by default).

I believe this is explained in the Emacs manual, node "Paragraphs".
In a nutshell, these two regexps have different purposes.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Sun, 26 Feb 2017 18:48:02 GMT) Full text and rfc822 format available.

Message #14 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Marcin Borkowski <mbork <at> amu.edu.pl>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: acm <at> muc.de, 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Sun, 26 Feb 2017 19:48:10 +0100

On 2017-02-26, at 17:57, Eli Zaretskii <eliz <at> gnu.org> wrote:

>> From: Marcin Borkowski <mbork <at> amu.edu.pl>
>> Date: Sun, 26 Feb 2017 17:44:51 +0100
>> Cc: 19873 <at> debbugs.gnu.org
>> 
>> First of all, my Emacs has this as paragraph-start:
>> 
>> "\\|[ 	]*$"
>> 
>> and this as paragraph-separate:
>> 
>> "[ 	]*$"
>> 
>> and frankly speaking, I'm not sure why they differ at all (by default).
>
> I believe this is explained in the Emacs manual, node "Paragraphs".
> In a nutshell, these two regexps have different purposes.

My bad, you're right.  Still, this has little to do with the bug itself.

Best,

-- 
Marcin Borkowski
http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
Faculty of Mathematics and Computer Science
Adam Mickiewicz University

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Tue, 07 Mar 2017 16:49:01 GMT) Full text and rfc822 format available.

Message #17 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Marcin Borkowski <mbork <at> amu.edu.pl>
Cc: acm <at> muc.de, 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Tue, 07 Mar 2017 18:47:47 +0200

> From: Marcin Borkowski <mbork <at> amu.edu.pl>
> Date: Sun, 26 Feb 2017 17:44:51 +0100
> Cc: 19873 <at> debbugs.gnu.org
> 
> My first thought is to add a check whether paragraph-start and
> paragraph-sep match something like
> 
> "^\\^?\\[[[:space:]]+\\][+*]?"
> 
> and if yes, make parstart/parsep equal to them, but without the matching
> part.
> 
> WDYT?

Can you suggest a patch along these lines?

Alan, any comments?

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Thu, 09 Mar 2017 21:06:02 GMT) Full text and rfc822 format available.

Message #20 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: Marcin Borkowski <mbork <at> amu.edu.pl>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Thu, 9 Mar 2017 21:04:45 +0000

Hello, Marcin.

On Sun, Feb 26, 2017 at 17:44:51 +0100, Marcin Borkowski wrote:
> On 2015-02-15, at 10:31, Alan Mackenzie <acm <at> muc.de> wrote:

> > Hello, Emacs!

> > In forward-paragraph, L37, a regular expression is constructed as
> > follows:

> > (let* ...
> >  (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
> >  ...)

> > .  Here parstart and parsep are, more or less,
> > paragraph-{start,separate}.

> > The problem is that parstart and parsep themselves are likely to begin
> > with "[ \t]*" (the default values certainly do), so we have two
> > consecutive matchers for an arbitrary amount of whitespace.  This causes
> > the regexp engine to run very slowly when a line starts with lots of WS
> > but doesn't match.

> > This problem seems to be the cause of bug # 19846 (where holding down the
> > spacebar inside a C comment causes Emacs to seize up when auto-fill mode
> > is enabled).

> Hi Alan, hi all,

> I put this bug on my todo-list some time ago and decided now to revisit
> it.

> I'm wondering what could be done about it.  First of all, my Emacs has
> this as paragraph-start:

> "\\|[ 	]*$"

> and this as paragraph-separate:

> "[ 	]*$"

> and frankly speaking, I'm not sure why they differ at all (by default).
> Also, even though forward-paragraph checks for "^" at their beginning,
> they actually don't begin with that character (again, by default).

> My first thought is to add a check whether paragraph-start and
> paragraph-sep match something like

> "^\\^?\\[[[:space:]]+\\][+*]?"

> and if yes, make parstart/parsep equal to them, but without the matching
> part.

> WDYT?

My first reaction is "This is a good idea, but be very careful!".  For
example, if paragraph-start and/or paragraph-separate begin with
"[ \t]+" (i.e. the paragraph start requires space at BOL), you will miss
it by removing matches of "^\\^?\\[[[:space:]]+\\][+*]?" from them.

I think this idea is workable, but you'll have to check for one or both
of paragraph-s{tart,eparate} starting with "[ \t]+".  A good strategy
here might be to begin the target regexp with "^[ \t]*", then begin one
or both components with "[ \t]" (without the "*").

There may be other gotchas which I haven't thought about yet.

One needs a twisted mind to do this sort of thing properly, so I offer my
services to review your upcoming patch.  ;-)

> -- 
> Marcin Borkowski
> http://octd.wmi.amu.edu.pl/en/Marcin_Borkowski
> Faculty of Mathematics and Computer Science
> Adam Mickiewicz University

-- 
Alan Mackenzie (Nuremberg, Germany).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Thu, 02 Dec 2021 10:41:01 GMT) Full text and rfc822 format available.

Message #23 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Alan Mackenzie <acm <at> muc.de>
Cc: Marcin Borkowski <mbork <at> amu.edu.pl>, 19873 <at> debbugs.gnu.org,
 Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Thu, 02 Dec 2021 11:39:51 +0100

Alan Mackenzie <acm <at> muc.de> writes:

> I think this idea is workable, but you'll have to check for one or both
> of paragraph-s{tart,eparate} starting with "[ \t]+".  A good strategy
> here might be to begin the target regexp with "^[ \t]*", then begin one
> or both components with "[ \t]" (without the "*").
>
> There may be other gotchas which I haven't thought about yet.
>
> One needs a twisted mind to do this sort of thing properly, so I offer my
> services to review your upcoming patch.  ;-)

The problem seems rather intractable to me.  Is there really any way to
examine a regexp to determine "does this in practice match [ \t]*"?

I wonder whether instead of trying to construct a better overall regexp
could rewrite the loop.  That is, instead of searching for sp-parstart,
search for parstart "\\|" parsep, and then check whether
(match-beginning 0) of that comes after "^[ \t]*".  Or something along
those lines.

But I don't know whether that'd be any faster in practice.

Do you have a test case that demonstrates the slowness?  In that case I
could try to see whether there's any alternate approach here that's
faster.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Thu, 02 Dec 2021 10:45:02 GMT) Full text and rfc822 format available.

Message #26 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Alan Mackenzie <acm <at> muc.de>
Cc: Marcin Borkowski <mbork <at> amu.edu.pl>, 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Thu, 02 Dec 2021 11:44:37 +0100

Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> Do you have a test case that demonstrates the slowness?  In that case I
> could try to see whether there's any alternate approach here that's
> faster.

Ah, there was a reproducer in bug#19846 (and it still reproduces).

I'll poke around here, then.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Thu, 02 Dec 2021 11:19:02 GMT) Full text and rfc822 format available.

Message #29 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Alan Mackenzie <acm <at> muc.de>
Cc: Marcin Borkowski <mbork <at> amu.edu.pl>, 19873 <at> debbugs.gnu.org
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Thu, 02 Dec 2021 12:17:54 +0100

Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> Ah, there was a reproducer in bug#19846 (and it still reproduces).
>
> I'll poke around here, then.

The following is about 100x faster in the test case, and doesn't seem to
lead to any obvious regressions.

But it's kinda ugly.

Any comments?

An alternate approach would be to just go line by line, and see whether
the regexp matches at the start of the line (either before or after
skipping past any leading whitespace).  But I'm not sure that's any
prettier.

diff --git a/lisp/textmodes/paragraphs.el b/lisp/textmodes/paragraphs.el
index 59b15e82a8..e0d633d49b 100644
--- a/lisp/textmodes/paragraphs.el
+++ b/lisp/textmodes/paragraphs.el
@@ -238,7 +238,7 @@ forward-paragraph
 		      fill-prefix-regexp "[ \t]*$")
 	    parsep))
 	 ;; This is used for searching.
-	 (sp-parstart (concat "^[ \t]*\\(?:" parstart "\\|" parsep "\\)"))
+	 (sp-parstart (concat "\\(?:" parstart "\\|" parsep "\\)"))
 	 start found-start)
     (while (and (< arg 0) (not (bobp)))
       (if (and (not (looking-at parsep))
@@ -278,7 +278,7 @@ forward-paragraph
 		    ;;      multiple-lines
 		    ;;      (forward-line 1))
 		    (not (bobp)))
-		(while (and (re-search-backward sp-parstart nil 1)
+		(while (and (paragraph--line-search nil sp-parstart)
 			    (setq found-start t)
 			    ;; Found a candidate, but need to check if it is a
 			    ;; REAL parstart.
@@ -328,7 +328,7 @@ forward-paragraph
 		      (not (looking-at parsep))
 		      (looking-at fill-prefix-regexp))
 	    (forward-line 1))
-	(while (and (re-search-forward sp-parstart nil 1)
+	(while (and (paragraph--line-search t sp-parstart)
 		    (progn (setq start (match-beginning 0))
 			   (goto-char start)
 			   (not (eobp)))
@@ -344,6 +344,34 @@ forward-paragraph
     ;; Return the number of steps that could not be done.
     arg))
 
+;; We do it this way because the regexp commonly starts with optional
+;; whitespace.
+(defun paragraph--line-search (forward regexp)
+  "Look for REGEXP starting on a line.
+If FORWARD, search forward.  If not, go backward."
+  (catch 'found
+    (while (and (or (and forward
+                         (not (eobp)))
+                    (and (not forward)
+                         (not (bobp))))
+                (funcall (if forward
+                             #'re-search-forward
+                           #'re-search-backward)
+                         regexp nil 1))
+      (save-excursion
+        (goto-char (match-beginning 0))
+        (beginning-of-line)
+        (save-restriction
+          (narrow-to-region
+           (point) (match-beginning 0))
+          (when (looking-at-p "[ \t]*$")
+            (throw 'found t))))
+      (if forward
+          (unless (eobp)
+            (forward-char 1))
+        (unless (bobp)
+          (backward-char 1))))))
+
 (defun backward-paragraph (&optional arg)
   "Move backward to start of paragraph.
 With argument ARG, do it ARG times;


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Thu, 02 Dec 2021 20:46:01 GMT) Full text and rfc822 format available.

Message #32 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Alan Mackenzie <acm <at> muc.de>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Marcin Borkowski <mbork <at> amu.edu.pl>, 19873 <at> debbugs.gnu.org,
 Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Thu, 2 Dec 2021 20:45:17 +0000

Hello, Lars.

On Thu, Dec 02, 2021 at 11:39:51 +0100, Lars Ingebrigtsen wrote:
> Alan Mackenzie <acm <at> muc.de> writes:

> > I think this idea is workable, but you'll have to check for one or both
> > of paragraph-s{tart,eparate} starting with "[ \t]+".  A good strategy
> > here might be to begin the target regexp with "^[ \t]*", then begin one
> > or both components with "[ \t]" (without the "*").

> > There may be other gotchas which I haven't thought about yet.

> > One needs a twisted mind to do this sort of thing properly, so I offer my
> > services to review your upcoming patch.  ;-)

> The problem seems rather intractable to me.  Is there really any way to
> examine a regexp to determine "does this in practice match [ \t]*"?

Back when the bug was new, I started writing a library to analyse a
regular expression and convert it into an equivalent well formed regular
expression.  It's actually working, but is incomplete.  It's currently
2757 lines long, including pretty complete unit testing.  I actually
looked at it again at the start of November.

> I wonder whether instead of trying to construct a better overall regexp
> could rewrite the loop.  That is, instead of searching for sp-parstart,
> search for parstart "\\|" parsep, and then check whether
> (match-beginning 0) of that comes after "^[ \t]*".  Or something along
> those lines.

> But I don't know whether that'd be any faster in practice.

It strikes me as one of these things which needs to be done
systematically, which, as I said, I've already tried (and not yet given
up).  The question presents itself, would the effort be better spent
improving Emacs's regexp engine?

> Do you have a test case that demonstrates the slowness?  In that case I
> could try to see whether there's any alternate approach here that's
> faster.

Martin Rudalics had the original testcase.  The slowness was exponential
with the number of spaces typed, I think.

> -- 
> (domestic pets only, the antidote for overdose, milk.)
>    bloggy blog: http://lars.ingebrigtsen.no

-- 
Alan Mackenzie (Nuremberg, Germany).

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19873; Package emacs. (Fri, 03 Dec 2021 16:16:02 GMT) Full text and rfc822 format available.

Message #35 received at 19873 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Alan Mackenzie <acm <at> muc.de>
Cc: Marcin Borkowski <mbork <at> amu.edu.pl>, 19873 <at> debbugs.gnu.org,
 Eli Zaretskii <eliz <at> gnu.org>
Subject: Re: bug#19873: Ill-formed regular expression is constructed in
 forward-paragraph.
Date: Fri, 03 Dec 2021 17:15:18 +0100

Alan Mackenzie <acm <at> muc.de> writes:

> Back when the bug was new, I started writing a library to analyse a
> regular expression and convert it into an equivalent well formed regular
> expression.  It's actually working, but is incomplete.  It's currently
> 2757 lines long, including pretty complete unit testing.  I actually
> looked at it again at the start of November.

Interesting.

> It strikes me as one of these things which needs to be done
> systematically, which, as I said, I've already tried (and not yet given
> up).  The question presents itself, would the effort be better spent
> improving Emacs's regexp engine?

Indeed -- making the Emacs regexp engine transform these complex regexps
into simpler, equivalent forms would be great.  We wouldn't have to use
it on all regexps -- the problem usually rears its head when we're
combining several user-defined regexps into a large one, so if we had a
`simplify-regexp' function that we could stick in here, that'd solve the
issue here (and in similar circumstances elsewhere).

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug reassigned from package 'emacs' to 'emacs,cc-mode'. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 01 May 2022 14:58:02 GMT) Full text and rfc822 format available.

Forcibly Merged 19846 19873. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 01 May 2022 14:58:02 GMT) Full text and rfc822 format available.

Removed tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 02 May 2022 08:43:01 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 144 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #19873 Ill-formed regular expression is constructed in forward-paragraph.

GNU bug report logs - #19873
Ill-formed regular expression is constructed in forward-paragraph.