GNU bug report logs - #43598
replace-in-string: finishing touches

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Thu, 24 Sep 2020 20:53:02 UTC

Severity: normal

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 43598 in the body.
You can then email your comments to 43598 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Thu, 24 Sep 2020 20:53:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 24 Sep 2020 20:53:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>
Subject: replace-in-string: finishing touches
Date: Thu, 24 Sep 2020 22:52:06 +0200

The new replace-in-string function is welcome but needs a few tweaks before we can call it done:

1. It doesn't quite work correctly with raw bytes:

  (replace-in-string "\377" "x" "a\377b")
  => "axb"
  (replace-in-string "\377" "x" "a\377ø")
  => "a\377ø"

The easiest solution is to reimplement it in terms of replace-regexp-in-string for now, and optimise it later (although I feel a bit bad undoing Lars's pretty handiwork...)

We have messy semantics here, because string-equal does not equate "\377" and (string-to-multibyte "\377"), but string-match-p does...

2. It is documented always to return a new string, but that's a tad over-generous nowadays; very few string functions do that. If we drop that guarantee, we get some optimisation opportunities:

- it can return the input string itself if no matches were found (a fairly common case)
- it can be marked pure, not just side-effect-free, so that the byte compiler can constant-propagate through calls to it

3. The name is somewhat unfortunate since a function by that name in XEmacs uses regexp matching.
In fact, the new function probably broke prolog-mode because of that (see prolog-replace-in-string).
While we can fix prolog-mode, we can't easily fix code outside the Emacs tree that may have similar problems.

Perhaps we should rename it to string-replace, in line with the modern naming convention discussed some time ago.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Thu, 24 Sep 2020 21:13:02 GMT) Full text and rfc822 format available.

Message #8 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Thu, 24 Sep 2020 23:12:29 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> The new replace-in-string function is welcome but needs a few tweaks
> before we can call it done:
>
> 1. It doesn't quite work correctly with raw bytes:
>
>   (replace-in-string "\377" "x" "a\377b")
>   => "axb"
>   (replace-in-string "\377" "x" "a\377ø")
>   => "a\377ø"
>
> The easiest solution is to reimplement it in terms of
> replace-regexp-in-string for now, and optimise it later (although I
> feel a bit bad undoing Lars's pretty handiwork...)

The point of the function is to have something very lightweight, so if
it's reimplemented on top of replace-regexp-in-string, there's not much
point of the function.

> We have messy semantics here, because string-equal does not equate
> "\377" and (string-to-multibyte "\377"), but string-match-p does...

Yes, I don't even know what the semantics should be.

(string-replace "\377" "x" "a\377ø")
=> "axø"

would make sense, but what about

(string-replace "\270" "x" "a\377ø")
=> ?

(\270 is the last byte in the ø.)

Doing anything here wouldn't make much sense at all, which means...  we
could just throw up our hands and say "don't do that, then", which is
approx. what string-equal does.

> 2. It is documented always to return a new string, but that's a tad
> over-generous nowadays; very few string functions do that. If we drop
> that guarantee, we get some optimisation opportunities:
>
> - it can return the input string itself if no matches were found (a
> fairly common case)
> - it can be marked pure, not just side-effect-free, so that the byte
> compiler can constant-propagate through calls to it

Yup, good idea.

> 3. The name is somewhat unfortunate since a function by that name in
> XEmacs uses regexp matching.
> In fact, the new function probably broke prolog-mode because of that
> (see prolog-replace-in-string).
> While we can fix prolog-mode, we can't easily fix code outside the
> Emacs tree that may have similar problems.
>
> Perhaps we should rename it to string-replace, in line with the modern
> naming convention discussed some time ago.

string-replace seems like a good name.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Thu, 24 Sep 2020 21:20:02 GMT) Full text and rfc822 format available.

Message #11 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Thu, 24 Sep 2020 23:19:26 +0200

That is, we could just say "the results are undefined if the strings
contain raw bytes".  Well, rather, if both strings are raw bytes, or
none of them are, then it's well-defined, but not otherwise.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Thu, 24 Sep 2020 23:19:02 GMT) Full text and rfc822 format available.

Message #14 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 01:18:13 +0200

Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> That is, we could just say "the results are undefined if the strings
> contain raw bytes".  Well, rather, if both strings are raw bytes, or
> none of them are, then it's well-defined, but not otherwise.

Or...  OK, I've never actually looked at the strings this closely, I've
just used the various accessors which hide all the complexity.

So: "a\377ø" is a multibyte string with five bytes (the "raw byte" is in
the private plane).

"a\377a" is a unibyte string with three bytes.

So searching for "\377" (one-byte unibyte string) and (make-string 1
255) (two-byte multibyte string) should be well-defined in either
combination here?

"\377" is in both "a\377ø" and "a\377a".

(make-string 1 255) is in neither "a\377ø", nor "a\377a".

And:

(eq (elt (make-string 1 255) 0) (elt "\377" 0))
=> t

But, like, whatevs.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Thu, 24 Sep 2020 23:55:01 GMT) Full text and rfc822 format available.

Message #17 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 01:54:46 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> The new replace-in-string function is welcome but needs a few tweaks
> before we can call it done:
>
> 1. It doesn't quite work correctly with raw bytes:
>
>   (replace-in-string "\377" "x" "a\377b")
>   => "axb"
>   (replace-in-string "\377" "x" "a\377ø")
>   => "a\377ø"

I went ahead and checked in a new C-level function string-search, which
should be an efficient way to search for strings in strings (using
memmem, which Emacs has via Gnulib?), and this fixed these corner cases.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Fri, 25 Sep 2020 09:22:02 GMT) Full text and rfc822 format available.

Message #20 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: mattiase <at> acm.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 12:21:45 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Date: Fri, 25 Sep 2020 01:18:13 +0200
> Cc: 43598 <at> debbugs.gnu.org
> 
> So: "a\377ø" is a multibyte string with five bytes (the "raw byte" is in
> the private plane).
> 
> "a\377a" is a unibyte string with three bytes.
> 
> So searching for "\377" (one-byte unibyte string) and (make-string 1
> 255) (two-byte multibyte string) should be well-defined in either
> combination here?
> 
> "\377" is in both "a\377ø" and "a\377a".
> 
> (make-string 1 255) is in neither "a\377ø", nor "a\377a".
> 
> And:
> 
> (eq (elt (make-string 1 255) 0) (elt "\377" 0))
> => t

Would it help to always convert the first argument of
replace-in-string to a multibyte string, before replacing?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Fri, 25 Sep 2020 10:10:02 GMT) Full text and rfc822 format available.

Message #23 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: mattiase <at> acm.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 12:09:16 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> Would it help to always convert the first argument of
> replace-in-string to a multibyte string, before replacing?

Yes, but not when the third argument is a unibyte string.

I've now done the conversion in the new string-search C-level function,
converting the search string both ways, depending on what the HAYSTACK
string is.  I'm not 100% sure that I'm doing the right thing here,
though, but it seems to pass all the test cases I could come up with.  I
wrote it very late last night, though, so...  :-/

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Fri, 25 Sep 2020 10:43:01 GMT) Full text and rfc822 format available.

Message #26 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 12:42:06 +0200

[Message part 1 (text/plain, inline)]

25 sep. 2020 kl. 01.54 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> I went ahead and checked in a new C-level function string-search, which
> should be an efficient way to search for strings in strings (using
> memmem, which Emacs has via Gnulib?), and this fixed these corner cases.

Thank you! Here are some proposed tweaks (diff attached):

1. Check the range of the START-POS argument so that we don't crash.
The permitted range is [0..N] where N is (length HAYSTACK), thus we permit a start right after the last character but no further.
We could also return nil in these cases but I think an error is more useful.

2. Make the docs more precise about various things.

3. Slight simplification of the implementation logic to avoid testing the same conditions multiple times.

4. More tests, especially for edge cases. Can't have too many!
One test still fails:

 (string-search "ø" "\303\270")

which should return nil but currently matches.
I think it's wrong to convert the needle to unibyte (using Fstring_as_unibyte) in this case, but I haven't decided what the best solution would be.

We should also consider the optimisations:
- If SCHARS(needle)>SCHARS(haystack) then no match is possible.
- If either needle or haystack is all-ASCII (all bytes in 0..127), then we can use memmem without conversion.

[string-search.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Fri, 25 Sep 2020 11:12:02 GMT) Full text and rfc822 format available.

Message #29 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 13:11:15 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> 1. Check the range of the START-POS argument so that we don't crash.
> The permitted range is [0..N] where N is (length HAYSTACK), thus we
> permit a start right after the last character but no further.
> We could also return nil in these cases but I think an error is more useful.

Good point.  :-)

> 2. Make the docs more precise about various things.
>
> 3. Slight simplification of the implementation logic to avoid testing
> the same conditions multiple times.
>
> 4. More tests, especially for edge cases. Can't have too many!

It all looks good to me; please apply.

> One test still fails:
>
>  (string-search "ø" "\303\270")
>
> which should return nil but currently matches.
> I think it's wrong to convert the needle to unibyte (using
> Fstring_as_unibyte) in this case, but I haven't decided what the best
> solution would be.

Yeah, that's the bit I was most unsure about, because it just didn't
look quite correct to me, but I couldn't come up with the correct test
case last night; thanks.

> We should also consider the optimisations:
> - If SCHARS(needle)>SCHARS(haystack) then no match is possible.

Yup.

> - If either needle or haystack is all-ASCII (all bytes in 0..127),
> then we can use memmem without conversion.

Right, so if the multibyteness differs, then do another check to see
whether both strings are all-ASCII anyway, and do the comparison without
conversion...  Yes, makes sense to me.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Fri, 25 Sep 2020 11:23:02 GMT) Full text and rfc822 format available.

Message #32 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 13:22:41 +0200

25 sep. 2020 kl. 13.11 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> It all looks good to me; please apply.

Thanks, will do shortly.

> Right, so if the multibyteness differs, then do another check to see
> whether both strings are all-ASCII anyway, and do the comparison without
> conversion...

Both strings don't need to be all-ASCII; one of them suffices.

By the way, I added an argument check to replace-in-string to prevent it from entering an infinite loop.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Fri, 25 Sep 2020 11:33:02 GMT) Full text and rfc822 format available.

Message #35 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Fri, 25 Sep 2020 13:32:38 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

>> Right, so if the multibyteness differs, then do another check to see
>> whether both strings are all-ASCII anyway, and do the comparison without
>> conversion...
>
> Both strings don't need to be all-ASCII; one of them suffices.

Hm, yes, that's true...  and I guess a further micro-optimisation would
be if NEEDLE is non-ASCII and HAYSTACK is all-ASCII, then there's no
point in memmem-ing at all. 

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sat, 26 Sep 2020 22:26:01 GMT) Full text and rfc822 format available.

Message #38 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 00:25:41 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> 3. The name is somewhat unfortunate since a function by that name in
> XEmacs uses regexp matching.

[...]

> Perhaps we should rename it to string-replace, in line with the modern
> naming convention discussed some time ago.

Yup; now done.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sat, 26 Sep 2020 22:45:01 GMT) Full text and rfc822 format available.

Message #41 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 00:44:38 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> We should also consider the optimisations:
> - If SCHARS(needle)>SCHARS(haystack) then no match is possible.

I've now done this.

> - If either needle or haystack is all-ASCII (all bytes in 0..127),
> then we can use memmem without conversion.

I thought that surely there's be a function like that in Emacs, but I
can't find it?

Instead there's code like

          && (STRING_MULTIBYTE (string)
              ? (chars == bytes) : string_ascii_p (string))
[...]
/* Whether STRING only contains chars in the 0..127 range.  */
static bool
string_ascii_p (Lisp_Object string)
{
  ptrdiff_t nbytes = SBYTES (string);
  for (ptrdiff_t i = 0; i < nbytes; i++)
    if (SREF (string, i) > 127)
      return false;
  return true;
}

and

	  unsigned char *p = SDATA (name);
	  while (*p && ASCII_CHAR_P (*p))
	    p++;

sprinkled around the code base.

Would it make sense to add a new utility function that does the right
thing for both multibyte and unibyte strings?  (The multibyte case is
just chars == bytes, but the unibyte case would be a loop.)

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 00:05:02 GMT) Full text and rfc822 format available.

Message #44 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 02:03:53 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> Both strings don't need to be all-ASCII; one of them suffices.

I've now added that, and after that, everything fell into place for the
multibyte-needle/unibyte-haystack case, too.

That can only match if the needle contains nothing but ASCII and
eighth-bit chars, so I've altered it to return Qnil if there's any other
chars, and then convert to unibyte and do memmem otherwise.

*phew*

Is that all cases covered now?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 00:35:01 GMT) Full text and rfc822 format available.

Message #47 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 02:34:27 +0200

Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> Is that all cases covered now?

Well, I can't think of any more, but I said that before.  :-/

Anyway, we now have this slightly amusing situation:

(string-search (string-to-multibyte "o\303\270") "o\303\270")
=> 0

(string-match (string-to-multibyte "o\303\270") "o\303\270")
=> 0

(equal (string-to-multibyte "o\303\270") "o\303\270")
=> nil

But I guess we've lived with this for...  decades?  So that's probably
OK.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 08:46:02 GMT) Full text and rfc822 format available.

Message #50 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 10:45:15 +0200

27 sep. 2020 kl. 02.34 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> (string-search (string-to-multibyte "o\303\270") "o\303\270")
> => 0
> 
> (string-match (string-to-multibyte "o\303\270") "o\303\270")
> => 0
> 
> (equal (string-to-multibyte "o\303\270") "o\303\270")
> => nil

(compare-strings (string-to-multibyte "o\303\270") nil nil "o\303\270" nil nil)
=> t

I'd say it's equal, string-equal, string-lessp and string-greaterp that are odd and that we probably should fix if it can be done without making them slower. Unless, of course, we can come up with an alternative theory of operation that is satisfactory.

> But I guess we've lived with this for...  decades?  So that's probably
> OK.

Yes, there is no hurry.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 11:13:02 GMT) Full text and rfc822 format available.

Message #53 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 13:12:04 +0200

[Message part 1 (text/plain, inline)]

27 sep. 2020 kl. 02.03 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> *phew*

Not bad! This seems to work all right.
Here are some minor optimisations:

- Do the fast all-ASCII test (bytes == chars) before iterating through the bytes to check for non-ASCII chars.
- Faster check for non-ASCII non-raw bytes (no need for the complex code in string_char_advance).

It is tempting to vectorise the all-ASCII loop. Maybe another day...

The patch also adds some more test cases for completeness.

[string-search-tweaks.diff (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 11:49:02 GMT) Full text and rfc822 format available.

Message #56 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 13:48:12 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> Here are some minor optimisations:
>
> - Do the fast all-ASCII test (bytes == chars) before iterating through
> the bytes to check for non-ASCII chars.

[...]

> -  if (STRING_MULTIBYTE (string))
> -    return SBYTES (string) == SCHARS (string);

[...]

> -  if (STRING_MULTIBYTE (haystack) == STRING_MULTIBYTE (needle)
> -      || string_ascii_p (needle)
> -      || string_ascii_p (haystack))
> +  /* We can do a direct byte-string search if both strings have the
> +     same multibyteness, or if at least one of them consists of ASCII
> +     characters only.  */
> +  if (STRING_MULTIBYTE (haystack)
> +      ? (STRING_MULTIBYTE (needle)
> +         || SCHARS (haystack) == SBYTES (haystack) || string_ascii_p (needle))
> +      : (!STRING_MULTIBYTE (needle)
> +         || SCHARS (needle) == SBYTES (needle) || string_ascii_p (haystack)))

Didn't you just move the STRING_MULTIBYTE bits of the test from the
string_ascii_p function and open-code it into Fstring_search function
here?  I'm not sure how that's an optimisation? 

> +      ptrdiff_t nbytes = SBYTES (needle);
> +      for (ptrdiff_t i = 0; i < nbytes; i++)
> +        {
> +          int c = SREF (needle, i);
> +          if (CHAR_BYTE8_HEAD_P (c))
> +            i++;                /* Skip raw byte.  */
> +          else if (!ASCII_CHAR_P (c))
> +            return Qnil;  /* Found a char that can't be in the haystack.  */
> +        }

Looks good.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 11:58:02 GMT) Full text and rfc822 format available.

Message #59 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 13:57:35 +0200

27 sep. 2020 kl. 13.48 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> Didn't you just move the STRING_MULTIBYTE bits of the test from the
> string_ascii_p function and open-code it into Fstring_search function
> here?

No, look again. Previously, we would loop through all bytes of a unibyte needle before checking the lengths of a multibyte haystack. With the patch, we always do the cheap (length) check first. That's why that check had to be moved out of the helper function.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 12:04:01 GMT) Full text and rfc822 format available.

Message #62 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 14:02:50 +0200

Mattias Engdegård <mattiase <at> acm.org> writes:

> No, look again. Previously, we would loop through all bytes of a
> unibyte needle before checking the lengths of a multibyte
> haystack.

Duh; you're right.  Please go ahead and apply.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Sun, 27 Sep 2020 16:15:02 GMT) Full text and rfc822 format available.

Notification sent to Mattias Engdegård <mattiase <at> acm.org>:
bug acknowledged by developer. (Sun, 27 Sep 2020 16:15:02 GMT) Full text and rfc822 format available.

Message #67 received at 43598-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 43598-done <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 18:14:36 +0200

27 sep. 2020 kl. 14.02 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> Please go ahead and apply.

Applied, thank you.

Looks like we are done now. string-replace seems to be substantially faster than replace-regexp-in-string.
Good job!

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 16:21:01 GMT) Full text and rfc822 format available.

Message #70 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: mattiase <at> acm.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 19:19:52 +0300

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Sun, 27 Sep 2020 18:14:36 +0200
> Cc: 43598-done <at> debbugs.gnu.org
> 
> Looks like we are done now. string-replace seems to be substantially faster than replace-regexp-in-string.
> Good job!

Thanks.  Is it possible to have some speed comparison for these two?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 16:42:01 GMT) Full text and rfc822 format available.

Message #73 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Mattias Engdegård <mattiase <at> acm.org>,
 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 18:41:39 +0200

Eli Zaretskii <eliz <at> gnu.org> writes:

> Thanks.  Is it possible to have some speed comparison for these two?

This is what I used:

(let ((elems (mapcar (lambda (s)
		       (let ((start (random 80)))
			 (cons (substring s start (+ start (random 20)))
			       s)))
		     (cl-loop repeat 1000
			      collect (cl-coerce
				       (cl-loop repeat 100
						collect (+ (random 26) ?a))
				       'string)))))
  (list
   (benchmark-run 10000 (dolist (elem elems)
			  (string-search (car elem) (cdr elem))))
   (benchmark-run 10000 (dolist (elem elems)
			  (string-match (car elem) (cdr elem))))))

=>
((7.47099299 29 3.773541741999992)
 (19.673036086 74 9.616665831000006))

This is rather geared towards the weaknesses of string-match, though --
we're blowing through the regexp cache.

If you decrease the number of regexps to 10 and the run to 1000000, we get:

((7.818917279000001 37 4.791844609999998)
 (11.049133279 37 4.713127558000011))

And to compare with a "do-nothing" version:

   (benchmark-run 10000 (dolist (elem elems)
			  elem))))

=>

((5.74714395 28 3.722243896000009))

Using that as a baseline, the difference is 2s vs 5.2s.  

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Sun, 27 Sep 2020 16:50:01 GMT) Full text and rfc822 format available.

Message #76 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: mattiase <at> acm.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 19:48:56 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org>
> Cc: Mattias Engdegård <mattiase <at> acm.org>,
>   43598 <at> debbugs.gnu.org
> Date: Sun, 27 Sep 2020 18:41:39 +0200
> 
> Using that as a baseline, the difference is 2s vs 5.2s.  

Thanks.  So a factor of 2.5, not bad.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Mon, 28 Sep 2020 03:43:01 GMT) Full text and rfc822 format available.

Message #79 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Richard Stallman <rms <at> gnu.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: mattiase <at> acm.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Sun, 27 Sep 2020 23:41:59 -0400

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > (string-match (string-to-multibyte "o\303\270") "o\303\270")
  > => 0

  > (equal (string-to-multibyte "o\303\270") "o\303\270")
  > => nil

It is paradoxical, but I think it is correct.
Equal compares the type of the string, not just the
characters in it.

-- 
Dr Richard Stallman
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Mon, 28 Sep 2020 09:41:02 GMT) Full text and rfc822 format available.

Message #82 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Richard Stallman <rms <at> gnu.org>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Mon, 28 Sep 2020 11:40:02 +0200

28 sep. 2020 kl. 05.41 skrev Richard Stallman <rms <at> gnu.org>:

> It is paradoxical, but I think it is correct.
> Equal compares the type of the string, not just the
> characters in it.

No it doesn't. (equal (string-to-multibyte "A") "A") => t.

There is no deep reason for the current behaviour. It's just how things came to be, and nobody has been sufficiently annoyed to change it. The implementation is efficient and good enough for most purposes.

It is inconsistent and confusing though, and occasionally it does break down. One such case is when two strings that are not 'equal' become 'equal' after printing and reading them back, since unibyte and multibyte strings have the same printed representation. This can arise in conjunction with byte compilation.

Again, I have no plans to do anything about it right now.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Tue, 29 Sep 2020 03:30:01 GMT) Full text and rfc822 format available.

Message #85 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Richard Stallman <rms <at> gnu.org>
To: Mattias EngdegÃ¥rd <mattiase <at> acm.org>
Cc: larsi <at> gnus.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Mon, 28 Sep 2020 23:29:34 -0400

[[[ To any NSA and FBI agents reading my email: please consider    ]]]
[[[ whether defending the US Constitution against all enemies,     ]]]
[[[ foreign or domestic, requires you to follow Snowden's example. ]]]

  > > It is paradoxical, but I think it is correct.
  > > Equal compares the type of the string, not just the
  > > characters in it.

  > No it doesn't. (equal (string-to-multibyte "A") "A") => t.

I am puzzled, then.  Why DOES the other example return nil?

-- 
Dr Richard Stallman
Chief GNUisance of the GNU Project (https://gnu.org)
Founder, Free Software Foundation (https://fsf.org)
Internet Hall-of-Famer (https://internethalloffame.org)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#43598; Package emacs. (Tue, 29 Sep 2020 04:13:02 GMT) Full text and rfc822 format available.

Message #88 received at 43598 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: rms <at> gnu.org, Richard Stallman <rms <at> gnu.org>,
 Mattias EngdegÃ¥rd <mattiase <at> acm.org>
Cc: larsi <at> gnus.org, 43598 <at> debbugs.gnu.org
Subject: Re: bug#43598: replace-in-string: finishing touches
Date: Tue, 29 Sep 2020 07:12:16 +0300

On September 29, 2020 6:29:34 AM GMT+03:00, Richard Stallman <rms <at> gnu.org> wrote:
> 
>   > > It is paradoxical, but I think it is correct.
>   > > Equal compares the type of the string, not just the
>   > > characters in it.
> 
>   > No it doesn't. (equal (string-to-multibyte "A") "A") => t.
> 
> I am puzzled, then.  Why DOES the other example return nil?

Because the byte sequences of the two strings are different when there are non-ASCII bytes in the original unibyte string.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 27 Oct 2020 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 295 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #43598 replace-in-string: finishing touches

GNU bug report logs - #43598
replace-in-string: finishing touches