#58168 - string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

Message #59 received at 58168 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattias.engdegard <at> gmail.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 58168 <at> debbugs.gnu.org Subject: Re: bug#58168: string-lessp glitches and inconsistencies Date: Mon, 3 Oct 2022 21:48:14 +0200

2 okt. 2022 kl. 07.36 skrev Eli Zaretskii <eliz <at> gnu.org>: >> Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. > > Always converting to multibyte handles this case, doesn't it? I don't think it does -- string= treats raw bytes in unibyte and multibyte strings as distinct; converting to multibyte does not preserve (in)equality. >> Actually I was talking about multibyte-multibyte comparisons. > > Then why did you mention raw bytes? their multibyte representation > presents no performance problems In a way they do -- the way raw bytes are represented (they start with C0 or C1) causes memcmp to sort them between U+007F and U+0080. If we accept that then comparisons are fast since memcmp will compare many character per data-dependent branch. The current code requires several data-dependent branches for each character. While we could probably bring down the comparison cost slightly by clever hand-coding, it's unlikely to be even nearly as fast as a memcmp and much messier. Since users are unlikely to care much about the ordering between raw bytes and something else (as long as there is an order), it would be a cheap way to improve performance while at the same time fixing the string< / string= mismatch. > You can compare under the assumption that a unibyte string is > pure-ASCII until you bump into the first non-ASCII one. If that > happens, abandon the comparison, convert the unibyte string to its > multibyte representation, and compare again. I don't quite see how that would improve performance but may be missing something.

This bug report was last modified 2 years and 324 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #58168 string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies