GNU bug report logs -
#15107
24.3; replace-regexp-in-string wrong on \`
Previous Next
Reported by: Kevin Ryde <user42 <at> zip.com.au>
Date: Thu, 15 Aug 2013 22:17:02 UTC
Severity: normal
Tags: confirmed, patch
Merged with 44861
Found in versions 24.3, 25.1, 27.1
Done: Mattias Engdegård <mattiase <at> acm.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 15107 in the body.
You can then email your comments to 15107 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Thu, 15 Aug 2013 22:17:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Kevin Ryde <user42 <at> zip.com.au>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Thu, 15 Aug 2013 22:17:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
replace-regexp-in-string behaves incorrectly if a regexp has \` among
its matches.
(replace-regexp-in-string "\\`\\|X" "Z" "--XX--" t t)
=>
"Z--ZXZX--"
where I expected
"Z--ZZ--"
This seems to be due to the optimization in replace-regexp-in-string
which re-matches on the matched substring. \' can match the substring
where it did not match in the middle of the full string. In the example
above "X" is the match in the full string, but on taking that "X" as a
substring it can match "\\`".
Probably similar mismatches on the substring occur for things like \' ^
$ \b \< etc. Maybe the comment in the code about munging the match data
would be a better way.
In GNU Emacs 24.3.1 (i486-pc-linux-gnu, X toolkit, Xaw3d scroll bars)
of 2013-05-29 on blah.blah, modified by Debian
System Description: Debian GNU/Linux testing/unstable
Configured using:
`configure '--build' 'i486-linux-gnu' '--build' 'i486-linux-gnu'
'--prefix=/usr' '--sharedstatedir=/var/lib' '--libexecdir=/usr/lib'
'--localstatedir=/var/lib' '--infodir=/usr/share/info'
'--mandir=/usr/share/man' '--with-pop=yes'
'--enable-locallisppath=/etc/emacs24:/etc/emacs:/usr/local/share/emacs/24.3/site-lisp:/usr/local/share/emacs/site-lisp:/usr/share/emacs/24.3/site-lisp:/usr/share/emacs/site-lisp'
'--with-crt-dir=/usr/lib/i386-linux-gnu' '--with-x=yes'
'--with-x-toolkit=lucid' '--with-toolkit-scroll-bars' '--without-gconf'
'build_alias=i486-linux-gnu' 'CFLAGS=-g -O2 -fstack-protector
--param=ssp-buffer-size=4 -Wformat -Werror=format-security -Wall'
'LDFLAGS=-Wl,-z,relro -Wl,-znocombreloc'
'CPPFLAGS=-D_FORTIFY_SOURCE=2''
Important settings:
value of $LANG: en_AU
locale-coding-system: iso-latin-1-unix
default enable-multibyte-characters: t
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Sun, 06 Mar 2016 06:36:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 15107 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Kevin Ryde <user42 <at> zip.com.au> writes:
> replace-regexp-in-string behaves incorrectly if a regexp has \` among
> its matches.
>
> (replace-regexp-in-string "\\`\\|X" "Z" "--XX--" t t)
> =>
> "Z--ZXZX--"
>
> where I expected
>
> "Z--ZZ--"
>
> This seems to be due to the optimization in replace-regexp-in-string
> which re-matches on the matched substring. \' can match the substring
> where it did not match in the middle of the full string. In the example
> above "X" is the match in the full string, but on taking that "X" as a
> substring it can match "\\`".
I built the emacs-25 git branch I recreated the above bug today.
GNU Emacs 25.0.92.1 (x86_64-apple-darwin13.4.0, NS appkit-1265.21 Version
10.9.5 (Build 13F1507))
of 2016-03-05
[Message part 2 (text/html, inline)]
Added tag(s) confirmed.
Request was from
npostavs <at> users.sourceforge.net
to
control <at> debbugs.gnu.org
.
(Sat, 06 Aug 2016 00:31:01 GMT)
Full text and
rfc822 format available.
bug Marked as found in versions 25.1.
Request was from
npostavs <at> users.sourceforge.net
to
control <at> debbugs.gnu.org
.
(Sat, 06 Aug 2016 00:31:01 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 00:09:01 GMT)
Full text and
rfc822 format available.
Message #15 received at 15107 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for Kevin's
example as well as:
(replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> "Foo bar" (should be "Foo Bar")
Some close variants which behave correctly:
(replace-regexp-in-string "^F\\| ." #'upcase "foo bar")
> "Foo Bar"
(replace-regexp-in-string "^.K\\| ." #'upcase "ok corral")
> "OK Corral"
(replace-regexp-in-string "^..\\| ." #'upcase "ok corral")
> "OK Corral"
This was discussed here:
http://emacs.stackexchange.com/questions/26590/replace-regexp-in-string-stops-replacement-with
Here is a regression test for when someone has a chance to tackle this:
---
test/lisp/subr-tests.el | 3 +++
1 file changed, 3 insertions(+)
diff --git a/test/lisp/subr-tests.el b/test/lisp/subr-tests.el
index ce21290..969d5c2 100644
--- a/test/lisp/subr-tests.el
+++ b/test/lisp/subr-tests.el
@@ -224,5 +224,8 @@
(error-message-string (should-error (version-to-list
"beta22_8alpha3")))
"Invalid version syntax: `beta22_8alpha3' (must start with a
number)"))))
+(ert-deftest replace-regexp-in-string-test ()
+ (should (equal (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
"Foo Bar")))
+
(provide 'subr-tests)
;;; subr-tests.el ends here
--
Regards,
Erik Anderson.
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 14:26:02 GMT)
Full text and
rfc822 format available.
Message #18 received at 15107 <at> debbugs.gnu.org (full text, mbox):
> From: Erik Anderson <erikbpanderson <at> gmail.com>
> Date: Tue, 30 Aug 2016 23:57:35 +0000
>
> I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for Kevin's example as well as:
>
> (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > "Foo bar" (should be "Foo Bar")
Maybe I'm missing something, but I don't see why this is a bug. The
input string "foo bar" matches the "^." alternative in its entirety,
so there's no reason to expect Emacs to apply 'upcase' twice.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 14:37:01 GMT)
Full text and
rfc822 format available.
Message #21 received at 15107 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP
with REP in STRING."
My email was a comment to an existing open bug from 2013-08-15:
http://debbugs.gnu.org/cgi/bugreport.cgi?bug=15107
On Wed, Aug 31, 2016 at 9:25 AM Eli Zaretskii <eliz <at> gnu.org> wrote:
> > From: Erik Anderson <erikbpanderson <at> gmail.com>
> > Date: Tue, 30 Aug 2016 23:57:35 +0000
> >
> > I can confirm the buggy behavior on emacs 24.5.1 and 25.1.50.1 for
> Kevin's example as well as:
> >
> > (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > > "Foo bar" (should be "Foo Bar")
>
> Maybe I'm missing something, but I don't see why this is a bug. The
> input string "foo bar" matches the "^." alternative in its entirety,
> so there's no reason to expect Emacs to apply 'upcase' twice.
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 15:02:02 GMT)
Full text and
rfc822 format available.
Message #24 received at 15107 <at> debbugs.gnu.org (full text, mbox):
> From: Erik Anderson <erikbpanderson <at> gmail.com>
> Date: Wed, 31 Aug 2016 14:36:06 +0000
> Cc: 15107 <at> debbugs.gnu.org
>
> Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP with REP in STRING."
Yes, and there is a single match in this case, so a single
replacement. The _entire_ input string matches the regexp, so after
that match there's nothing else left to match.
What am I missing?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 15:14:01 GMT)
Full text and
rfc822 format available.
Message #27 received at 15107 <at> debbugs.gnu.org (full text, mbox):
On Wed, Aug 31, 2016 at 11:01 AM, Eli Zaretskii <eliz <at> gnu.org> wrote:
>> From: Erik Anderson <erikbpanderson <at> gmail.com>
>> Date: Wed, 31 Aug 2016 14:36:06 +0000
>> Cc: 15107 <at> debbugs.gnu.org
>>
>> Per the replace-regexp-in-string docstring: "Replace all matches for REGEXP with REP in STRING."
>
> Yes, and there is a single match in this case, so a single
> replacement. The _entire_ input string matches the regexp, so after
> that match there's nothing else left to match.
>
> What am I missing?
"^." matches only the first character of "foo bar", but maybe you have
a different idea of "matches" than I do. I would consider "^..*" to
match the whole string.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 15:34:01 GMT)
Full text and
rfc822 format available.
Message #30 received at 15107 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I suspect that since ".*" is such a commonly used term in regexps, Eli
might be misreading the regexp.
From the Emacs manual on regular expression special characters:
"‘.’ (Period)
is a special character that matches any single character except a newline.
Using concatenation, we can make regular expressions like ‘a.b’, which
matches any three-character string that begins with ‘a’ and ends with ‘b’."
You can verify the behavior of "."
(string-match "^." "No greedy modifiers here")
(match-data)
> (0 1)
(string-match "^.*" "This has a greedy modifier")
(match-data)
> (0 26)
This is a helpful document:
https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Special.html#Regexp-Special
Further discussion should be moved off this list.
-Erik.
On Wed, Aug 31, 2016 at 10:13 AM Noam Postavsky <
npostavs <at> users.sourceforge.net> wrote:
> On Wed, Aug 31, 2016 at 11:01 AM, Eli Zaretskii <eliz <at> gnu.org> wrote:
> >> From: Erik Anderson <erikbpanderson <at> gmail.com>
> >> Date: Wed, 31 Aug 2016 14:36:06 +0000
> >> Cc: 15107 <at> debbugs.gnu.org
> >>
> >> Per the replace-regexp-in-string docstring: "Replace all matches for
> REGEXP with REP in STRING."
> >
> > Yes, and there is a single match in this case, so a single
> > replacement. The _entire_ input string matches the regexp, so after
> > that match there's nothing else left to match.
> >
> > What am I missing?
>
> "^." matches only the first character of "foo bar", but maybe you have
> a different idea of "matches" than I do. I would consider "^..*" to
> match the whole string.
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Wed, 31 Aug 2016 16:06:02 GMT)
Full text and
rfc822 format available.
Message #33 received at 15107 <at> debbugs.gnu.org (full text, mbox):
> From: Noam Postavsky <npostavs <at> users.sourceforge.net>
> Date: Wed, 31 Aug 2016 11:13:05 -0400
> Cc: Erik Anderson <erikbpanderson <at> gmail.com>, 15107 <at> debbugs.gnu.org
>
> "^." matches only the first character of "foo bar", but maybe you have
> a different idea of "matches" than I do. I would consider "^..*" to
> match the whole string.
OMG, I was sure the * was there!
Sorry.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#15107
; Package
emacs
.
(Thu, 01 Sep 2016 15:47:02 GMT)
Full text and
rfc822 format available.
Message #36 received at 15107 <at> debbugs.gnu.org (full text, mbox):
> From: Erik Anderson <erikbpanderson <at> gmail.com>
> Date: Tue, 30 Aug 2016 23:57:35 +0000
>
> (replace-regexp-in-string "^.\\| ." #'upcase "foo bar")
> > "Foo bar" (should be "Foo Bar")
It looks like an algorithmic design flaw. Here's the relevant part of
replace-regexp-in-string:
(while (and (< start l) (string-match regexp string start))
(setq mb (match-beginning 0)
me (match-end 0))
;; If we matched the empty string, make sure we advance by one char
(when (= me mb) (setq me (min l (1+ mb))))
;; Generate a replacement for the matched substring.
;; Operate only on the substring to minimize string consing.
;; Set up match data for the substring for replacement;
;; presumably this is likely to be faster than munging the
;; match data directly in Lisp.
(string-match regexp (setq str (substring string mb me)))
(setq matches
(cons (replace-match (if (stringp rep)
rep
(funcall rep (match-string 0 str)))
fixedcase literal str subexp)
As you see, it first matches the (rest of the) string against REGEXP,
then takes the substring that matched, and matches that substring
again. But the evident assumption that the match in the substring
will yield the same result is false. In this case, the substring of
"oo bar" that matches "^.\\| ." is " b", but matching it again against
the same regexp yields just " ", because the first alternative
matches. So 'upcase' is applied to the blank, and the rest is
history.
Forcibly Merged 15107 44861.
Request was from
Mattias Engdegård <mattiase <at> acm.org>
to
control <at> debbugs.gnu.org
.
(Wed, 25 Nov 2020 14:59:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 28 Dec 2020 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 4 years and 170 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.