GNU bug report logs - #58168
string-lessp glitches and inconsistencies

Previous Next

Package: emacs;

Reported by: Mattias Engdegård <mattias.engdegard <at> gmail.com>

Date: Thu, 29 Sep 2022 16:25:01 UTC

Severity: normal

Full log


Message #11 received at 58168 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattias.engdegard <at> gmail.com>
Cc: 58168 <at> debbugs.gnu.org
Subject: Re: bug#58168: string-lessp glitches and inconsistencies
Date: Thu, 29 Sep 2022 20:11:57 +0300
> From: Mattias Engdegård <mattias.engdegard <at> gmail.com>
> Date: Thu, 29 Sep 2022 18:24:04 +0200
> 
> We really want string< to be consistent with string= and itself since this is fundamental for string ordering in searching and sorting applications.
> This means that for any pair of strings A and B, we should either have A<B, B<A or A=B.
> 
> Unfortunately:
> 
>   (let* ((a "ü")
>          (b "\xfc"))
>     (list (string= a b)
>           (string< a b)
>           (string< b a)))
> => (nil nil nil)
> 
> because string< considers the unibyte raw byte 0xFC and the multibyte char U+00FC to be the same, but string= thinks they are different.

Why do we care?  Unibyte strings should never be compared with
multibyte, unless they are both pure-ASCII.

> So, what can be done? The current string< implementation uses the character order
> 
>  ASCII < ub raw 80..FF = mb U+0080..U+00FF < U+0100..10FFFF < mb raw 80..FF
> 
> in conflict with string= which unifies unibyte and multibyte ASCII but not raw bytes and Latin-1.

It would be unimaginable to unify raw bytes with Latin-1.  Raw bytes
are not Latin-1 characters, they can stand for any characters, or for
no characters at all.

> It suggests the following alternative collation orders:
> 
> A. ASCII < ub raw 80..FF < mb U+0080..10FFFF < mb raw 80..FF
> 
> which puts all non-ASCII multibyte chars after unibyte.
> 
> B. ASCII < ub raw 80..FF < mb raw 80..FF < mb U+0080..10FFFF
> 
> which inserts multibyte raw bytes after the unibyte ones, permitting any ub-ub and mb-mb comparisons to be made using memcmp, and a slow decoding loop only required for unibyte against non-ASCII multibyte strings.
> 
> C. ASCII < mb U+0080..10FFFF < mb raw 80..FF < ub raw 80..FF

Neither, IMNSHO.  Unibyte characters don't belong to this order.  They
should be converted to multibyte representation to be sensibly
comparable.

> Otherwise, I'll go with B or C, depending on what the resulting code looks like.

Please don't.  Let's first decide that we want to change this, and
what are the reasons for that.  Theoretical "impurity" doesn't count,
IMO.




This bug report was last modified 2 years and 276 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.