GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Previous Next

Package: emacs;

Reported by: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Mattias EngdegÄrd <mattias.engdegard <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 58168 <at> debbugs.gnu.org
Subject: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 6 Oct 2022 11:05:04 +0200
4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz <at> gnu.org>:

> If the fact that string= says strings are not equal, but string-lessp
> says they are equal, is what bothers you, we could document that
> results of comparing unibyte and multibyte strings are unspecified, or
> document explicitly that string= and string-lessp behave differently
> in this case.

(It's not just string= but `equal` since they use the same comparison.)
But it's just a part of a set of related problems:

* string< / string= inconsistency
* undesirable string< ordering (unibyte strings are treated as Latin-1)
* bad string< performance 

Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try.

Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either.

> I see no reason to worry about 100% consistency here: the order
> is _really_ undefined in these cases, and trying to make it defined
> will not produce any tangible gains,

Yes it would: better performance and wider applicability. Even when the order isn't defined the user expects there to be some order between distinct strings.

> Once again, slowing down string-lessp when raw-bytes are involved
> shouldn't be a problem.  So, if memchr finds a C0 or C1 in a string,
> fall back to a slower comparison.  memchr is fast enough to not slow
> down the "usual" case.  Would that be a good solution?

There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs.

The best we can hope for is hand-written vectorised code that does everything in one pass but it's still slower than just a memcmp.
Even then our chosen semantics make that more difficult (and slower) than it needs to be: for example, we cannot assume that any byte with the high bit set indicates a mismatch when comparing unibyte strings with multibyte, since we equate unibyte chars with Latin-1. It's a decision that we will keep paying for.

> Alternatively, we could introduce a new primitive which could assume
> multibyte or plain-ASCII unibyte strings without checking, and then
> code which is sure raw-bytes cannot happen, and needs to compare long
> strings, could use that for speed.

That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead?

> E.g., are you saying that unibyte strings that are
> pure-ASCII also cause performance problems?

They do because we have no efficient way of ascertaining that they are pure-ASCII. The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now.

I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done.
And thank you for your comments!





This bug report was last modified 2 years and 276 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.