#58168 - string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

Message #53 received at 58168 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org> To: Mattias Engdegård <mattias.engdegard <at> gmail.com> Cc: 58168 <at> debbugs.gnu.org Subject: Re: bug#58168: string-lessp glitches and inconsistencies Date: Sun, 02 Oct 2022 08:36:46 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com> > Date: Sat, 1 Oct 2022 21:57:45 +0200 > Cc: 58168 <at> debbugs.gnu.org > > 1 okt. 2022 kl. 07.22 skrev Eli Zaretskii <eliz <at> gnu.org>: > > > It depends on the use case, but in general I see no problem with > > signaling errors when we cannot produce reasonably correct results. > > For example, string-to-unibyte does signal an error in some cases. > > That's fine because that function is documented to do so and always has, but making previously possible comparisons raise errors shouldn't be done lightly. I didn't say "lightly", nor do I think so. We need to discuss specific use cases. An alternative is to always convert unibyte non-ASCII strings to their multibyte representation before comparing. > Comparison between objects is not only useful when someone cares about their order, as in presenting a sorted list to the user. Often what is important is an ability to impose an order, preferably total, for use in building and searching data structures. I came across this bug when implementing a string set. Always converting to multibyte handles this case, doesn't it? > >> It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. > > > > I see no reason to make comparison of unibyte and multibyte strings > > perform better. > > Actually I was talking about multibyte-multibyte comparisons. Then why did you mention raw bytes? their multibyte representation presents no performance problems, AFAIU. > You were probably thinking about comparisons between unibyte strings that contain raw bytes and multibyte strings, and those are indeed not very performance-sensitive. However there is no way to detect whether a unibyte string contains non-ASCII chars without looking at every byte, and comparing unibyte ASCII with multibyte is definitely of interest. Strings are still unibyte by default. You can compare under the assumption that a unibyte string is pure-ASCII until you bump into the first non-ASCII one. If that happens, abandon the comparison, convert the unibyte string to its multibyte representation, and compare again.

This bug report was last modified 2 years and 324 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #58168 string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies