#58168 - string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

Message #23 received at 58168 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org> To: Mattias Engdegård <mattias.engdegard <at> gmail.com> Cc: 58168 <at> debbugs.gnu.org Subject: Re: bug#58168: string-lessp glitches and inconsistencies Date: Sat, 01 Oct 2022 08:22:03 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com> > Date: Fri, 30 Sep 2022 22:04:47 +0200 > Cc: 58168 <at> debbugs.gnu.org > > 29 sep. 2022 kl. 19.11 skrev Eli Zaretskii <eliz <at> gnu.org>: > > > Unibyte strings should never be compared with > > multibyte, unless they are both pure-ASCII. > > It's perfectly fine to compare "Madrid" (unibyte) with "Málaga" (non-ASCII multibyte). Not relevant: I meant unibyte non-ASCII strings. The ASCII case is easy and un-problematic, and is really just a straw-man here. > If you mean that all strings (literals in particular) should be multibyte by default then I agree and at some point we should take that step, but it would be quite a breaking change. Perhaps less in practice than we fear, though... That's not what I meant. I think unibyte strings are with us for the observable future. > > Unibyte characters don't belong to this order. They > > should be converted to multibyte representation to be sensibly > > comparable. > > Oh I agree to some extent but we can't really raise an error if someone tries so we might as well return something reasonable and coherent. It depends on the use case, but in general I see no problem with signaling errors when we cannot produce reasonably correct results. For example, string-to-unibyte does signal an error in some cases. > Besides, there are more good reasons for ordering strings (both multibyte and unibyte) than might be apparent at first. Examples, please. > Working from the assumption that we can't change string= to equate raw bytes in unibyte and multibyte strings, we need to invent an order between normally incommensurate values I don't agree with the conclusion. It is not the only possible conclusion. Signaling an error is another one, and I'm sure we could think of more. > It's also a matter of performance -- string< has been improved recently but currently we compare text in Latin and Swahili much faster than French and Arabic; it would be nice to close that gap. UTF-8 is designed so that comparing strings by scalar values can be done byte-wise, but the way we encode raw bytes make them sort right between ASCII and Latin-1. Given that the specific order doesn't matter much, we could just run with that. I see no reason to make comparison of unibyte and multibyte strings perform better.

This bug report was last modified 2 years and 324 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #58168 string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies