#58168 - string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

Message #11 received at 58168 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org> To: Mattias Engdegård <mattias.engdegard <at> gmail.com> Cc: 58168 <at> debbugs.gnu.org Subject: Re: bug#58168: string-lessp glitches and inconsistencies Date: Thu, 29 Sep 2022 20:11:57 +0300

> From: Mattias Engdegård <mattias.engdegard <at> gmail.com> > Date: Thu, 29 Sep 2022 18:24:04 +0200 > > We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications. > This means that for any pair of strings A and B, we should either have A<B, B<A or A=B. > > Unfortunately: > > (let* ((a "ü") > (b "\xfc")) > (list (string= a b) > (string< a b) > (string< b a))) > => (nil nil nil) > > because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different. Why do we care? Unibyte strings should never be compared with multibyte, unless they are both pure-ASCII. > So, what can be done? The current string< implementation uses the character order > > ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF > > in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1. It would be unimaginable to unify raw bytes with Latin-1. Raw bytes are not Latin-1 characters, they can stand for any characters, or for no characters at all. > It suggests the following alternative collation orders: > > A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF > > which puts all non-ASCII multibyte chars after unibyte. > > B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF > > which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings. > > C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF Neither, IMNSHO. Unibyte characters don't belong to this order. They should be converted to multibyte representation to be sensibly comparable. > Otherwise, I'll go with B or C, depending on what the resulting code looks like. Please don't. Let's first decide that we want to change this, and what are the reasons for that. Theoretical "impurity" doesn't count, IMO.

This bug report was last modified 2 years and 324 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #58168 string-lessp glitches and inconsistencies

GNU bug report logs - #58168
string-lessp glitches and inconsistencies