#58168 - string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

View this message in rfc822 format

From: Mattias Engdegård <mattias.engdegard <at> gmail.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 58168 <at> debbugs.gnu.org Subject: bug#58168: string-lessp glitches and inconsistencies Date: Thu, 6 Oct 2022 11:05:04 +0200

4 okt. 2022 kl. 07.55 skrev Eli Zaretskii <eliz <at> gnu.org>: > If the fact that string= says strings are not equal, but string-lessp > says they are equal, is what bothers you, we could document that > results of comparing unibyte and multibyte strings are unspecified, or > document explicitly that string= and string-lessp behave differently > in this case. (It's not just string= but `equal` since they use the same comparison.) But it's just a part of a set of related problems: * string< / string= inconsistency * undesirable string< ordering (unibyte strings are treated as Latin-1) * bad string< performance Ideally we should be able to do something about all three at the same time since they are interrelated. At the very least it's worth a try. Just documenting the annoying parts won't make them go away -- they still have to be coded around by the user, and it doesn't solve any performance problems either. > I see no reason to worry about 100% consistency here: the order > is _really_ undefined in these cases, and trying to make it defined > will not produce any tangible gains, Yes it would: better performance and wider applicability. Even when the order isn't defined the user expects there to be some order between distinct strings. > Once again, slowing down string-lessp when raw-bytes are involved > shouldn't be a problem. So, if memchr finds a C0 or C1 in a string, > fall back to a slower comparison. memchr is fast enough to not slow > down the "usual" case. Would that be a good solution? There is no reason a comparison should need to look beyond the first mismatch; anything else is just algorithmically slow. Long strings are likely to differ early on. Any hack that has to special-case raw bytes will add costs. The best we can hope for is hand-written vectorised code that does everything in one pass but it's still slower than just a memcmp. Even then our chosen semantics make that more difficult (and slower) than it needs to be: for example, we cannot assume that any byte with the high bit set indicates a mismatch when comparing unibyte strings with multibyte, since we equate unibyte chars with Latin-1. It's a decision that we will keep paying for. > Alternatively, we could introduce a new primitive which could assume > multibyte or plain-ASCII unibyte strings without checking, and then > code which is sure raw-bytes cannot happen, and needs to compare long > strings, could use that for speed. That or variants thereof are indeed alternatives but users would be forgiven to wonder why we don't make what we have fast instead? > E.g., are you saying that unibyte strings that are > pure-ASCII also cause performance problems? They do because we have no efficient way of ascertaining that they are pure-ASCII. The long-term solution is to make multibyte strings the default in more cases but I'm not proposing such a change right now. I'll see to where further performance tweaking of the existing code can take us with a reasonable efforts, but there are hard limits to what can be done. And thank you for your comments!

This bug report was last modified 2 years and 324 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #58168 string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies