GNU bug report logs - #44861
27.1; [PATCH] signal in `replace-regexp-in-string'

Previous Next

Package: emacs;

Reported by: Shigeru Fukaya <shigeru.fukaya <at> gmail.com>

Date: Wed, 25 Nov 2020 04:03:02 UTC

Severity: normal

Tags: confirmed, patch

Merged with 15107

Found in versions 24.3, 25.1, 27.1

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 44861 in the body.
You can then email your comments to 44861 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Wed, 25 Nov 2020 04:03:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Shigeru Fukaya <shigeru.fukaya <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 25 Nov 2020 04:03:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Wed, 25 Nov 2020 13:02:11 +0900
[Message part 1 (text/plain, inline)]
`replace-regexp-in-string' sometimes signals an error when REGEXP
contains some bondary sequence.  Difference of searches between
against an original passed string and against an extracted substring
causes the incident.


(replace-regexp-in-string-simple "a\\B" "A" "a aaaa")
	error --> cons: Args out of range: 2, 3
    expected
	==> "a AAAa"

(replace-regexp-in-string "\\Ba" "A" "a aaaa")
	error --> cons: Args out of range: 3, 4
    expected
	==> "a aAAA"

--
Shigeru
[subr.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Wed, 25 Nov 2020 10:59:02 GMT) Full text and rfc822 format available.

Message #8 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Cc: 44861 <at> debbugs.gnu.org
Subject: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string' 
Date: Wed, 25 Nov 2020 11:58:08 +0100
Thank you, and I agree, we probably want to do something like your suggested patch.
We would need to write a test suite, of course, but it looks like the general approach would solve bug#15107 as well which looks like the same bug.

Some benchmarking would also be in order.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Wed, 25 Nov 2020 14:59:01 GMT) Full text and rfc822 format available.

Message #11 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Cc: 44861 <at> debbugs.gnu.org
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string' 
Date: Wed, 25 Nov 2020 15:58:22 +0100
[Message part 1 (text/plain, inline)]
forcemerge 15107 44861
stop

Suggested patch attached. A small test suite for replace-regexp-in-string has already been pushed to master -- very rudimentary, but better than nothing -- and the patch amends it with some new relevant cases that didn't work before.

It is basically your patch but slightly optimised; it turned out that the function call and allocation overhead of the original patch made it a tad too expensive (a pity, because it was very neat). Now performance is about the same as before when the pattern contains no submatches, and slightly above (< 10% slower) with one submatch. It seems worth the correctness.

[0001-Fix-replace-regexp-in-string-substring-match-data-tr.patch (application/octet-stream, attachment)]

Forcibly Merged 15107 44861. Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Wed, 25 Nov 2020 14:59:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Wed, 25 Nov 2020 21:40:01 GMT) Full text and rfc822 format available.

Message #16 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Mattias Engdegård <mattiase <at> acm.org>, 
 Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Cc: 44861 <at> debbugs.gnu.org
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Wed, 25 Nov 2020 16:39:10 -0500
Mattias Engdegård <mattiase <at> acm.org> writes:

> It is basically your patch but slightly optimised; it turned out that
> the function call and allocation overhead of the original patch made
> it a tad too expensive (a pity, because it was very neat). Now
> performance is about the same as before when the pattern contains no
> submatches, and slightly above (< 10% slower) with one submatch. It
> seems worth the correctness.

Presumably this hasn't worked correctly for a long time, if ever.  Is
that correct?

I personally worry about the performance here.  Since we use regexps
heavily all over, it is not clear (to me) that 10 % overall performance
drop with subexpressions is worth it to work correctly in these rare
edge-cases.  I suppose we do have to fix the bug here, but is it
feasible to solve this in a way that has less performance impact?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 12:59:01 GMT) Full text and rfc822 format available.

Message #19 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 44861 <at> debbugs.gnu.org, Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string' 
Date: Thu, 26 Nov 2020 13:57:59 +0100
[Message part 1 (text/plain, inline)]
25 nov. 2020 kl. 22.39 skrev Stefan Kangas <stefankangas <at> gmail.com>:

> I personally worry about the performance here.  Since we use regexps
> heavily all over, it is not clear (to me) that 10 % overall performance
> drop with subexpressions is worth it to work correctly in these rare
> edge-cases.  I suppose we do have to fix the bug here, but is it
> feasible to solve this in a way that has less performance impact?

We can't really let it remain buggy, especially as the consequence can be an error or silently wrong results. Also remember that one man's edge case is another's reasonable use.

However, unlike Boris we can eat our cake and have it! The attached patch performs the match-data translation in a C function, which obviously is much faster and indeed speeds up replace-regexp-in-string in all cases (as long as there is any match at all). The new primitive is a bit ad-hoc, but does one well-defined thing and isn't intended for use by the general public anyway.

[0001-Fix-replace-regexp-in-string-substring-match-data-tr.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 13:13:02 GMT) Full text and rfc822 format available.

Message #22 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 44861 <at> debbugs.gnu.org, Stefan Kangas <stefankangas <at> gmail.com>,
 Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 14:12:35 +0100
Mattias Engdegård <mattiase <at> acm.org> writes:

> However, unlike Boris we can eat our cake and have it! The attached
> patch performs the match-data translation in a C function, which
> obviously is much faster and indeed speeds up replace-regexp-in-string
> in all cases (as long as there is any match at all).

I'm all for speeding up replace-regexp-in-string (which is used all over
the place), so your change looks reasonable to me.

But I wonder -- would it make sense to move the entire
replace-regexp-in-string function to C?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Thu, 26 Nov 2020 13:40:02 GMT) Full text and rfc822 format available.

Notification sent to Shigeru Fukaya <shigeru.fukaya <at> gmail.com>:
bug acknowledged by developer. (Thu, 26 Nov 2020 13:40:02 GMT) Full text and rfc822 format available.

Message #27 received at 44861-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Dmitry Gutov <dgutov <at> yandex.ru>, 44861-done <at> debbugs.gnu.org,
 Stefan Kangas <stefankangas <at> gmail.com>,
 Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 14:39:01 +0100
26 nov. 2020 kl. 14.12 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> I'm all for speeding up replace-regexp-in-string (which is used all over
> the place), so your change looks reasonable to me.

Thank you! Pushed to master.

> But I wonder -- would it make sense to move the entire
> replace-regexp-in-string function to C?

Probably, but that would be a pure performance improvement. Most of the time is currently consumed in primitives (string-match, replace-match, substring, concat) so don't expect huge savings unless a substantially different approach is taken.

(Dmitry Gutov asked for a C implementation in bug#20273 for improving the speed of json encoding; is that still relevant?)

A bigger saving yet would be to use the much faster string-replace wherever possible. A little sweeping refactoring project perhaps? It would also improve readability -- no regexp quoting, fewer mysterious arguments like LITERAL and FIXEDCASE to worry about, etc.





Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Thu, 26 Nov 2020 13:40:02 GMT) Full text and rfc822 format available.

Notification sent to Kevin Ryde <user42 <at> zip.com.au>:
bug acknowledged by developer. (Thu, 26 Nov 2020 13:40:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 13:44:02 GMT) Full text and rfc822 format available.

Message #35 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Lars Ingebrigtsen <larsi <at> gnus.org>,
 Mattias Engdegård <mattiase <at> acm.org>
Cc: 44861 <at> debbugs.gnu.org, Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 05:43:29 -0800
Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> But I wonder -- would it make sense to move the entire
> replace-regexp-in-string function to C?

Before we undertake any major changes in that direction, perhaps we
should benchmark the relevant functions on the native-comp branch?  It
changes the benchmarks by quite a lot in some cases.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 14:04:01 GMT) Full text and rfc822 format available.

Message #38 received at 44861-done <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Dmitry Gutov <dgutov <at> yandex.ru>, 44861-done <at> debbugs.gnu.org,
 Stefan Kangas <stefankangas <at> gmail.com>,
 Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 15:03:30 +0100
Mattias Engdegård <mattiase <at> acm.org> writes:

> Probably, but that would be a pure performance improvement. Most of
> the time is currently consumed in primitives (string-match,
> replace-match, substring, concat) so don't expect huge savings unless
> a substantially different approach is taken.

Yeah, perhaps there's isn't a lot to be gained there, unless a lot of
the re-checking of all the arguments (etc.) (which is unnecessary once
we've ascertained that everything is, indeed, a string) can be done by
refactoring some of the underlying primitives.

> (Dmitry Gutov asked for a C implementation in bug#20273 for improving
> the speed of json encoding; is that still relevant?)

No, probably not, since it's now done by Jansson?  So I'm closing that
one.

> A bigger saving yet would be to use the much faster string-replace
> wherever possible. A little sweeping refactoring project perhaps? It
> would also improve readability -- no regexp quoting, fewer mysterious
> arguments like LITERAL and FIXEDCASE to worry about, etc.

I started looking at that, and there's a huge pile of calls like

(replace-regexp-in-string ":" ";" string)

that can be rewritten to use string-replace.  But!  Every single case
requires careful analysis, exactly because replace-regexp-in-string sets
the match data.  Perhaps five lines later, there's a reference to
(match-string 0 string)?  Perhaps the reference is in the function that
called this function?

So most changes are fraught with possible unforeseen breakages, the code
is super-duper straightforward like

(setq string (replace-regexp-in-string ":" ";" string))
(setq string (replace-regexp-in-string "a" "b" string))

Then you know that you can replace the first one without any danger.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 14:05:02 GMT) Full text and rfc822 format available.

Message #41 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: Mattias Engdegård <mattiase <at> acm.org>,
 44861 <at> debbugs.gnu.org, Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 15:03:54 +0100
Stefan Kangas <stefankangas <at> gmail.com> writes:

> Before we undertake any major changes in that direction, perhaps we
> should benchmark the relevant functions on the native-comp branch?  It
> changes the benchmarks by quite a lot in some cases.

Yes, that's true.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 14:42:01 GMT) Full text and rfc822 format available.

Message #44 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: mattiase <at> acm.org, larsi <at> gnus.org, 44861 <at> debbugs.gnu.org,
 shigeru.fukaya <at> gmail.com
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 16:41:32 +0200
> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Thu, 26 Nov 2020 05:43:29 -0800
> Cc: 44861 <at> debbugs.gnu.org, Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
> 
> Lars Ingebrigtsen <larsi <at> gnus.org> writes:
> 
> > But I wonder -- would it make sense to move the entire
> > replace-regexp-in-string function to C?
> 
> Before we undertake any major changes in that direction, perhaps we
> should benchmark the relevant functions on the native-comp branch?  It
> changes the benchmarks by quite a lot in some cases.

Benchmarking is always welcome, but I don't think we should dismiss
performance improvements by assuming everyone will use the
natively-compiled Lisp code VSN.  I'm quite sure *.elc files will be
used in the observable future by many people.  I also expect C code to
run faster than natively-compiled Lisp, when significant
implementation changes, such as those suggested by Mattias, are
involved.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Thu, 26 Nov 2020 14:56:01 GMT) Full text and rfc822 format available.

Message #47 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Dmitry Gutov <dgutov <at> yandex.ru>, 44861 <at> debbugs.gnu.org,
 Stefan Kangas <stefankangas <at> gmail.com>,
 Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Thu, 26 Nov 2020 15:54:56 +0100
26 nov. 2020 kl. 15.03 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> I started looking at that, and there's a huge pile of calls like
> 
> (replace-regexp-in-string ":" ";" string)
> 
> that can be rewritten to use string-replace.  But!  Every single case
> requires careful analysis, exactly because replace-regexp-in-string sets
> the match data.

No it doesn't; the entire body is wrapped in save-match-data.
It does set the match data locally for use in the replacement function, if any, but then string-replace cannot be used anyway.

There are other things that may need investigating for a switch to string-replace: whether case-folding is relevant, and whether a nil-or-absent argument to FIXEDCASE is intended, an oversight, or irrelevant.

In my experience, most code does not take case-folding into account at all or tacitly assume it does not apply. (Having a global variable controlling it is a terrible interface, by the way.) Similarly, most calls that omit FIXEDCASE do it without any thought that the replacement would be anything but literal. Thus, the risk isn't very big for either but these are still issues requiring some consideration.





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#44861; Package emacs. (Sun, 29 Nov 2020 13:30:02 GMT) Full text and rfc822 format available.

Message #50 received at 44861 <at> debbugs.gnu.org (full text, mbox):

From: "Basil L. Contovounesios" <contovob <at> tcd.ie>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Mattias Engdegård <mattiase <at> acm.org>,
 Dmitry Gutov <dgutov <at> yandex.ru>, 44861 <at> debbugs.gnu.org,
 Stefan Kangas <stefankangas <at> gmail.com>,
 Shigeru Fukaya <shigeru.fukaya <at> gmail.com>
Subject: Re: bug#44861: 27.1; [PATCH] signal in `replace-regexp-in-string'
Date: Sun, 29 Nov 2020 13:28:55 +0000
Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> Mattias Engdegård <mattiase <at> acm.org> writes:
>
>> (Dmitry Gutov asked for a C implementation in bug#20273 for improving
>> the speed of json encoding; is that still relevant?)
>
> No, probably not, since it's now done by Jansson?  So I'm closing that
> one.

Also, bug#20273 was prompted by bug#20154, which was addressed by
avoiding replace-regexp-in-string, and further optimised in bug#40693.

-- 
Basil




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 28 Dec 2020 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 168 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.