GNU bug report logs - #15107
24.3; replace-regexp-in-string wrong on \`

Previous Next

Package: emacs;

Reported by: Kevin Ryde <user42 <at> zip.com.au>

Date: Thu, 15 Aug 2013 22:17:02 UTC

Severity: normal

Tags: confirmed, patch

Merged with 44861

Found in versions 24.3, 25.1, 27.1

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 15107 in the body.
You can then email your comments to 15107 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Thu, 15 Aug 2013 22:17:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Kevin Ryde <user42 <at> zip.com.au>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 15 Aug 2013 22:17:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Kevin Ryde <user42 <at> zip.com.au>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.3; replace-regexp-in-string wrong on \`
Date: Fri, 16 Aug 2013 08:15:53 +1000
replace-regexp-in-string behaves incorrectly if a regexp has \` among
its matches.

    (replace-regexp-in-string "\\`\\|X" "Z" "--XX--" t t)
    =>
    "Z--ZXZX--"

where I expected

    "Z--ZZ--"

This seems to be due to the optimization in replace-regexp-in-string
which re-matches on the matched substring.  \' can match the substring
where it did not match in the middle of the full string.  In the example
above "X" is the match in the full string, but on taking that "X" as a
substring it can match "\\`".

Probably similar mismatches on the substring occur for things like \' ^
$ \b \< etc.  Maybe the comment in the code about munging the match data
would be a better way.





In GNU Emacs 24.3.1 (i486-pc-linux-gnu, X toolkit, Xaw3d scroll bars)
 of 2013-05-29 on blah.blah, modified by Debian
System Description:	Debian GNU/Linux testing/unstable

Configured using:
 `configure '--build' 'i486-linux-gnu' '--build' 'i486-linux-gnu'
 '--prefix=/usr' '--sharedstatedir=/var/lib' '--libexecdir=/usr/lib'
 '--localstatedir=/var/lib' '--infodir=/usr/share/info'
 '--mandir=/usr/share/man' '--with-pop=yes'
 '--enable-locallisppath=/etc/emacs24:/etc/emacs:/usr/local/share/emacs/24.3/site-lisp:/usr/local/share/emacs/site-lisp:/usr/share/emacs/24.3/site-lisp:/usr/share/emacs/site-lisp'
 '--with-crt-dir=/usr/lib/i386-linux-gnu' '--with-x=yes'
 '--with-x-toolkit=lucid' '--with-toolkit-scroll-bars' '--without-gconf'
 'build_alias=i486-linux-gnu' 'CFLAGS=-g -O2 -fstack-protector
 --param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wall'
 'LDFLAGS=-Wl,-z,relro -Wl,-znocombreloc'
 'CPPFLAGS=-D_FORTIFY_SOURCE=2''

Important settings:
  value of $LANG: en_AU
  locale-coding-system: iso-latin-1-unix
  default enable-multibyte-characters: t




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Sun, 06 Mar 2016 06:36:01 GMT) Full text and rfc822 format available.

Message #8 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Michael Wright <wrightmikea <at> gmail.com>
To: 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: reproduced on emacs-25 branch
Date: Sat, 5 Mar 2016 20:18:50 -0800
[Message part 1 (text/plain, inline)]
Kevin Ryde <user42 <at> zip.com.au> writes:

> replace-regexp-in-string behaves incorrectly if a regexp has \` among
> its matches.
>
>     (replace-regexp-in-string "\\`\\|X" "Z" "--XX--" t t)
>     =>
>     "Z--ZXZX--"
>
> where I expected
>
>     "Z--ZZ--"
>
> This seems to be due to the optimization in replace-regexp-in-string
> which re-matches on the matched substring.  \' can match the substring
> where it did not match in the middle of the full string.  In the example
> above "X" is the match in the full string, but on taking that "X" as a
> substring it can match "\\`".

I built the emacs-25 git branch I recreated the above bug today.

GNU Emacs 25.0.92.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21 Version
10.9.5 (Build 13F1507))
 of 2016-03-05
[Message part 2 (text/html, inline)]

Added tag(s) confirmed. Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Sat, 06 Aug 2016 00:31:01 GMT) Full text and rfc822 format available.

bug Marked as found in versions 25.1. Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Sat, 06 Aug 2016 00:31:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 00:09:01 GMT) Full text and rfc822 format available.

Message #15 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Erik Anderson <erikbpanderson <at> gmail.com>
To: 15107 <at> debbugs.gnu.org
Subject: [PATCH] Add replace-regexp-in-string regression test
Date: Tue, 30 Aug 2016 23:57:35 +0000
[Message part 1 (text/plain, inline)]
I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for Kevin's
example as well as:

(replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> "Foo bar"  (should be "Foo Bar")

Some close variants which behave correctly:

(replace-regexp-in-string "^F\\| ." #'upcase "foo bar")
> "Foo Bar"
(replace-regexp-in-string "^.K\\| ." #'upcase "ok corral")
> "OK Corral"
(replace-regexp-in-string "^..\\| ." #'upcase "ok corral")
> "OK Corral"

This was discussed here:
http://emacs.stackexchange.com/questions/26590/replace-regexp-in-string-stops-replacement-with

Here is a regression test for when someone has a chance to tackle this:

---
 test/lisp/subr-tests.el | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index ce21290..969d5c2 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -224,5 +224,8 @@
               (error-message-string (should-error (version-to-list
"beta22_8alpha3")))
               "Invalid version syntax: `beta22_8alpha3' (must start with a
number)"))))

+(ert-deftest replace-regexp-in-string-test ()
+  (should (equal (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
"Foo Bar")))
+
 (provide 'subr-tests)
 ;;; subr-tests.el ends here
-- 

Regards,
Erik Anderson.
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 14:26:02 GMT) Full text and rfc822 format available.

Message #18 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Erik Anderson <erikbpanderson <at> gmail.com>
Cc: 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Wed, 31 Aug 2016 17:24:39 +0300
> From: Erik Anderson <erikbpanderson <at> gmail.com>
> Date: Tue, 30 Aug 2016 23:57:35 +0000
> 
> I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for Kevin's example as well as:
> 
> (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > "Foo bar"  (should be "Foo Bar")

Maybe I'm missing something, but I don't see why this is a bug.  The
input string "foo bar" matches the "^." alternative in its entirety,
so there's no reason to expect Emacs to apply 'upcase' twice.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 14:37:01 GMT) Full text and rfc822 format available.

Message #21 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Erik Anderson <erikbpanderson <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Wed, 31 Aug 2016 14:36:06 +0000
[Message part 1 (text/plain, inline)]
Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP
with REP in STRING."

My email was a comment to an existing open bug from 2013-08-15:
http://debbugs.gnu.org/cgi/bugreport.cgi?bug=15107

On Wed, Aug 31, 2016 at 9:25 AM Eli Zaretskii <eliz <at> gnu.org> wrote:

> > From: Erik Anderson <erikbpanderson <at> gmail.com>
> > Date: Tue, 30 Aug 2016 23:57:35 +0000
> >
> > I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for
> Kevin's example as well as:
> >
> > (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > > "Foo bar"  (should be "Foo Bar")
>
> Maybe I'm missing something, but I don't see why this is a bug.  The
> input string "foo bar" matches the "^." alternative in its entirety,
> so there's no reason to expect Emacs to apply 'upcase' twice.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 15:02:02 GMT) Full text and rfc822 format available.

Message #24 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Erik Anderson <erikbpanderson <at> gmail.com>
Cc: 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Wed, 31 Aug 2016 18:01:12 +0300
> From: Erik Anderson <erikbpanderson <at> gmail.com>
> Date: Wed, 31 Aug 2016 14:36:06 +0000
> Cc: 15107 <at> debbugs.gnu.org
> 
> Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP with REP in STRING."

Yes, and there is a single match in this case, so a single
replacement.  The _entire_ input string matches the regexp, so after
that match there's nothing else left to match.

What am I missing?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 15:14:01 GMT) Full text and rfc822 format available.

Message #27 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Noam Postavsky <npostavs <at> users.sourceforge.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Erik Anderson <erikbpanderson <at> gmail.com>, 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Wed, 31 Aug 2016 11:13:05 -0400
On Wed, Aug 31, 2016 at 11:01 AM, Eli Zaretskii <eliz <at> gnu.org> wrote:
>> From: Erik Anderson <erikbpanderson <at> gmail.com>
>> Date: Wed, 31 Aug 2016 14:36:06 +0000
>> Cc: 15107 <at> debbugs.gnu.org
>>
>> Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP with REP in STRING."
>
> Yes, and there is a single match in this case, so a single
> replacement.  The _entire_ input string matches the regexp, so after
> that match there's nothing else left to match.
>
> What am I missing?

"^." matches only the first character of "foo bar", but maybe you have
a different idea of "matches" than I do. I would consider "^..*" to
match the whole string.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 15:34:01 GMT) Full text and rfc822 format available.

Message #30 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Erik Anderson <erikbpanderson <at> gmail.com>
To: Noam Postavsky <npostavs <at> users.sourceforge.net>,
 Eli Zaretskii <eliz <at> gnu.org>
Cc: 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Wed, 31 Aug 2016 15:32:58 +0000
[Message part 1 (text/plain, inline)]
I suspect that since ".*" is such a commonly used term in regexps, Eli
might be misreading the regexp.

From the Emacs manual on regular expression special characters:
"‘.’ (Period)

is a special character that matches any single character except a newline.
Using concatenation, we can make regular expressions like ‘a.b’, which
matches any three-character string that begins with ‘a’ and ends with ‘b’."
You can verify the behavior of "."

(string-match "^." "No greedy modifiers here")
(match-data)
> (0 1)

(string-match "^.*" "This has a greedy modifier")
(match-data)
> (0 26)

This is a helpful document:
https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Special.html#Regexp-Special

Further discussion should be moved off this list.

-Erik.

On Wed, Aug 31, 2016 at 10:13 AM Noam Postavsky <
npostavs <at> users.sourceforge.net> wrote:

> On Wed, Aug 31, 2016 at 11:01 AM, Eli Zaretskii <eliz <at> gnu.org> wrote:
> >> From: Erik Anderson <erikbpanderson <at> gmail.com>
> >> Date: Wed, 31 Aug 2016 14:36:06 +0000
> >> Cc: 15107 <at> debbugs.gnu.org
> >>
> >> Per the replace-regexp-in-string docstring: "Replace all matches for
> REGEXP with REP in STRING."
> >
> > Yes, and there is a single match in this case, so a single
> > replacement.  The _entire_ input string matches the regexp, so after
> > that match there's nothing else left to match.
> >
> > What am I missing?
>
> "^." matches only the first character of "foo bar", but maybe you have
> a different idea of "matches" than I do. I would consider "^..*" to
> match the whole string.
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Wed, 31 Aug 2016 16:06:02 GMT) Full text and rfc822 format available.

Message #33 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Noam Postavsky <npostavs <at> users.sourceforge.net>
Cc: erikbpanderson <at> gmail.com, 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Wed, 31 Aug 2016 19:04:49 +0300
> From: Noam Postavsky <npostavs <at> users.sourceforge.net>
> Date: Wed, 31 Aug 2016 11:13:05 -0400
> Cc: Erik Anderson <erikbpanderson <at> gmail.com>, 15107 <at> debbugs.gnu.org
> 
> "^." matches only the first character of "foo bar", but maybe you have
> a different idea of "matches" than I do. I would consider "^..*" to
> match the whole string.

OMG, I was sure the * was there!

Sorry.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#15107; Package emacs. (Thu, 01 Sep 2016 15:47:02 GMT) Full text and rfc822 format available.

Message #36 received at 15107 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Erik Anderson <erikbpanderson <at> gmail.com>
Cc: 15107 <at> debbugs.gnu.org
Subject: Re: bug#15107: [PATCH] Add replace-regexp-in-string regression test
Date: Thu, 01 Sep 2016 18:46:35 +0300
> From: Erik Anderson <erikbpanderson <at> gmail.com>
> Date: Tue, 30 Aug 2016 23:57:35 +0000
> 
> (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > "Foo bar"  (should be "Foo Bar")

It looks like an algorithmic design flaw.  Here's the relevant part of
replace-regexp-in-string:

      (while (and (< start l) (string-match regexp string start))
	(setq mb (match-beginning 0)
	      me (match-end 0))
	;; If we matched the empty string, make sure we advance by one char
	(when (= me mb) (setq me (min l (1+ mb))))
	;; Generate a replacement for the matched substring.
	;; Operate only on the substring to minimize string consing.
	;; Set up match data for the substring for replacement;
	;; presumably this is likely to be faster than munging the
	;; match data directly in Lisp.
	(string-match regexp (setq str (substring string mb me)))
	(setq matches
	      (cons (replace-match (if (stringp rep)
				       rep
				     (funcall rep (match-string 0 str)))
				   fixedcase literal str subexp)

As you see, it first matches the (rest of the) string against REGEXP,
then takes the substring that matched, and matches that substring
again.  But the evident assumption that the match in the substring
will yield the same result is false.  In this case, the substring of
"oo bar" that matches "^.\\| ." is " b", but matching it again against
the same regexp yields just " ", because the first alternative
matches.  So 'upcase' is applied to the blank, and the rest is
history.




Forcibly Merged 15107 44861. Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Wed, 25 Nov 2020 14:59:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 28 Dec 2020 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 170 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.