GNU bug report logs - #11309
24.1.50; Case problems with [:upper:] and Cyrillic, Greek

Previous Next

Package: emacs;

Reported by: Aidan Kehoe <kehoea <at> parhasard.net>

Date: Sun, 22 Apr 2012 10:13:02 UTC

Severity: normal

Tags: patch

Found in version 24.1.50

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11309 in the body.
You can then email your comments to 11309 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Sun, 22 Apr 2012 10:13:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Aidan Kehoe <kehoea <at> parhasard.net>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 22 Apr 2012 10:13:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Aidan Kehoe <kehoea <at> parhasard.net>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.1.50; Case problems with [:upper:] and Cyrillic, Greek
Date: Sun, 22 Apr 2012 11:11:30 +0100

This bug report will be sent to the Bug-GNU-Emacs mailing list
and the GNU bug tracker at debbugs.gnu.org.  Please check that
the From: line contains a valid email address.  After a delay of up
to one day, you should receive an acknowledgement at that address.

Please write in English if possible, as the Emacs maintainers
usually do not have translators for other languages.

Please describe exactly what actions triggered the bug, and
the precise symptoms of the bug.  If you can, give a recipe
starting from `emacs -Q':

The Lisp manual says this when describing character classes:

  `[:lower:]'
       This matches any lower-case letter, as determined by the current
       case table (*note Case Tables::).  If `case-fold-search' is
       non-`nil', this also matches any upper-case letter.

And:

  `[:upper:]'
       This matches any upper-case letter, as determined by the current
       case table (*note Case Tables::).  If `case-fold-search' is
       non-`nil', this also matches any lower-case letter.
  
OK, so let's test this:

(let ((case-fold-search t))
  (string-match "[[:upper:]]" "a\u0686"))
=> 0 ;; As documented

(upcase "\u0430") ;; CYRILLIC SMALL LETTER A
=> "А" ;; "\u0410", so it's in the case table

(let ((case-fold-search t))
  (string-match "[[:upper:]]" "\u0430\u0686"))
=> nil ;; Ah, this is unexpected.

(let ((case-fold-search t))
  (string-match "[[:lower:]]" "\u0410\u0686"))
=> 0 ;; But this works as documented. 

(upcase "\u03b2") ;; GREEK SMALL LETTER BETA
=> "Β" ;; "\u0392", it's in the case table

(let ((case-fold-search t))
  (string-match "[[:upper:]]" "\u03b2\u5357"))
=> nil ;; Oops

(let ((case-fold-search t))
  (string-match "[[:lower:]]" "\u0392\u5357"))
=> 0 ;; But this works, again. 

If Emacs crashed, and you have the Emacs process in the gdb debugger,
please include the output from the following gdb commands:
    `bt full' and `xbacktrace'.
For information about debugging Emacs, please read the file
/Sources/emacs/nextstep/Emacs.app/Contents/Resources/etc/DEBUG.


In GNU Emacs 24.1.50.1 (i386-apple-darwin10.8.0, NS apple-appkit-1038.36)
 of 2012-04-22 on bonbon
Windowing system distributor `Apple', version 10.3.1038
Configured using:
 `configure '--with-ns''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: de_DE.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t

Major mode: Info

Minor modes in effect:
  tooltip-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
C-b C-b C-b C-b C-b C-b C-b C-f SPC \ x 7 f C-e C-j 
C-p C-f C-f C-f C-x = C-a ( SPC C-f C-x = C-a C-f s 
t <backspace> <backspace> m u l t <backspace> i b y 
t e - s t r i n g - p C-a C-f C-f C-f C-f t C-e ) C-j 
C-p C-p C-p C-n C-f C-f C-f C-f C-f C-f C-f C-f C-f 
C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f C-f 
C-f C-f C-f C-b C-b C-b C-f C-x = C-x 1 C-f C-f C-f 
C-b C-k <escape> b <left> C-k C-p C-p C-p C-p C-p C-p 
C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p C-p 
C-p C-e C-b C-b C-b C-y C-k ) C-j C-p C-p C-e C-b C-b 
C-b C-b C-d C-e C-j C-p C-p C-e C-b C-b C-b C-t C-e 
C-j C-p C-p C-e C-x C-b C-x o C-n C-n C-n RET C-x 1 
C-x b <return> C-x b * s c <tab> <return> C-n C-p C-n 
C-n e n a b l e - m u l t i b y t e - c h a r a c t 
e r s C-j C-x b <return> C-p C-n RET C-v l C-a C-n 
C-n C-n C-e C-x 2 C-x o C-x b * s c <backspace> <backspace> 
<backspace> C-g C-x C-b C-x o C-n C-n C-n C-n RET C-p 
C-p C-p C-x o C-p C-p C-a C-n C-SPC C-n C-n C-n C-n 
<escape> w <escape> x r e p o r t - e m a c s - b u 
g s <tab> C-g <escape> x r e p o r t - e m a c s - 
b u g <return>

Recent messages:
insert-file-contents-literally: Opening input file: no such file or directory, /Sources/emacs/nextstep/Emacs.app/Contents/Resources/etc/DOC-24.1.50.1
Mark set
Char: ä (228, #o344, #xe4, file ...) point=499 of 612 (81%) column=1 [2 times]
Char: DEL (127, #o177, #x7f) point=466 of 623 (75%) column=3
Char: ä (228, #o344, #xe4, file ...) point=466 of 625 (74%) column=3
Char: DEL (127, #o177, #x7f) point=486 of 647 (75%) column=23
Mark set
Quit
byte-code: Beginning of buffer [2 times]
Mark set
Quit

Load-path shadows:
None found.

Features:
(shadow sort gnus-util mail-extr emacsbug message format-spec rfc822 mml
mml-sec mm-decode mm-bodies mm-encode mail-parse rfc2231 mailabbrev
gmm-utils mailheader sendmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils find-func vc-git cc-mode cc-fonts cc-guess
cc-menus cc-cmds cc-styles cc-align cc-engine cc-vars cc-defs mule-util
multi-isearch info help-mode easymenu view help-fns byte-opt warnings cl
compile comint ansi-color ring bytecomp byte-compile cconv macroexp
vc-hg time-date tooltip ediff-hook vc-hooks lisp-float-type mwheel
ns-win tool-bar dnd fontset image regexp-opt fringe lisp-mode register
page menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock
font-lock syntax facemenu font-core frame cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese hebrew
greek romanian slovak czech european ethiopic indian cyrillic chinese
case-table epa-hook jka-cmpr-hook help simple abbrev minibuffer loaddefs
button faces cus-face files text-properties overlay sha1 md5 base64
format env code-pages mule custom widget hashtable-print-readable
backquote make-network-process dbusbind ns multi-tty emacs)

-- 
‘Iodine deficiency was endemic in parts of the UK until, through what has been
described as “an unplanned and accidental public health triumph”, iodine was
added to cattle feed to improve milk production in the 1930s.’
(EN Pearce, Lancet, June 2011)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Mon, 07 Dec 2020 17:25:02 GMT) Full text and rfc822 format available.

Message #8 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Aidan Kehoe <kehoea <at> parhasard.net>
Cc: 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Mon, 07 Dec 2020 18:24:34 +0100

Aidan Kehoe <kehoea <at> parhasard.net> writes:

> (let ((case-fold-search t))
>   (string-match "[[:upper:]]" "a\u0686"))
> => 0 ;; As documented
>
> (upcase "\u0430") ;; CYRILLIC SMALL LETTER A
> => "А" ;; "\u0410", so it's in the case table
>
> (let ((case-fold-search t))
>   (string-match "[[:upper:]]" "\u0430\u0686"))
> => nil ;; Ah, this is unexpected.

I tried this in Emacs 28, and I can confirm that this behaviour is still
present.

> (let ((case-fold-search t))
>   (string-match "[[:lower:]]" "\u0410\u0686"))
> => 0 ;; But this works as documented. 
>
> (upcase "\u03b2") ;; GREEK SMALL LETTER BETA
> => "Β" ;; "\u0392", it's in the case table
>
> (let ((case-fold-search t))
>   (string-match "[[:upper:]]" "\u03b2\u5357"))
> => nil ;; Oops
>
> (let ((case-fold-search t))
>   (string-match "[[:lower:]]" "\u0392\u5357"))
> => 0 ;; But this works, again. 

And this, too.

Anybody have any insight here?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Mon, 07 Dec 2020 22:15:02 GMT) Full text and rfc822 format available.

Message #11 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>, Aidan Kehoe <kehoea <at> parhasard.net>
Cc: 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Mon, 7 Dec 2020 23:14:45 +0100

Not surprising in the least given the broken logic:

	  ((class_bits & BIT_UPPER) &&
	   (ISUPPER (c) || (corig != c &&
			    c == downcase (corig) && ISLOWER (c)))) ||
	  ((class_bits & BIT_LOWER) &&
	   (ISLOWER (c) || (corig != c &&
			    c == upcase (corig) && ISUPPER(c)))) ||

where corig is the character being matched and c is corig after canonicalising, which appears to mean downcasing in practice.
This means that the second case (BIT_LOWER means [:lower:]) works more or less as intended (by accident) but the [:upper:] case is less lucky and doesn't, as observed.

ASCII characters aren't affected by this bug since they are handled by a separate bitmap.

This has probably never worked properly.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 14:49:02 GMT) Full text and rfc822 format available.

Message #14 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>, Aidan Kehoe <kehoea <at> parhasard.net>
Cc: 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Tue, 8 Dec 2020 15:48:42 +0100

[Message part 1 (text/plain, inline)]

tags 11309 patch
stop

The attached patch should fix the bug for all characters except ß which still is not matched by [:lower:] nor by [:upper:] no matter the value of case-fold-search.

The remaining problem seems to be that the upcase table maps ß to itself, which is wrong -- as long as we don't upcase ß to U+1E9E, it should not have an upcase table entry at all. I'll see what can be done about that.

[0001-Fix-upper-and-lower-for-Unicode-characters-bug-11309.patch (application/octet-stream, attachment)]

Added tag(s) patch. Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Tue, 08 Dec 2020 14:49:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 16:03:01 GMT) Full text and rfc822 format available.

Message #19 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: kehoea <at> parhasard.net, larsi <at> gnus.org, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50;
 Case problems with [:upper:] and Cyrillic,  Greek
Date: Tue, 08 Dec 2020 18:02:05 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Tue, 8 Dec 2020 15:48:42 +0100
> Cc: 11309 <at> debbugs.gnu.org
> 
> The remaining problem seems to be that the upcase table maps ß to itself, which is wrong -- as long as we don't upcase ß to U+1E9E, it should not have an upcase table entry at all. I'll see what can be done about that.

Why is this a problem?  AFAIR characters that don't have an upper-case
form map to themselves when downcased.  E.g.

  (upcase ?1) => ?1

Why should ß violate this convention?

> * src/regex-emacs.c (execute_charset): Add canon_table argument to
> allow expression of a correct predicate for [:upper:] and [:lower:].
> (mutually_exclusive_p, re_match_2_internal): Pass extra argument.
> * test/src/regex-emacs-tests.el (regexp-case-fold, regexp-eszett):
> New tests.  Parts of regexp-eszett still fail and are commented out.

Thanks, LGTM.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 16:11:02 GMT) Full text and rfc822 format available.

Message #22 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Aidan Kehoe <kehoea <at> parhasard.net>, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Tue, 08 Dec 2020 17:10:22 +0100

On Dez 08 2020, Mattias Engdegård wrote:

> diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> index 971a5f6374..6b5dded8e5 100644
> --- a/src/regex-emacs.c
> +++ b/src/regex-emacs.c
> @@ -3575,9 +3575,11 @@ skip_noops (re_char *p, re_char *pend)
>     opcode.  When the function finishes, *PP will be advanced past that opcode.
>     C is character to test (possibly after translations) and CORIG is original
>     character (i.e. without any translations).  UNIBYTE denotes whether c is
> -   unibyte or multibyte character. */
> +   unibyte or multibyte character.
> +   CANON_TABLE is the canonicalisation table for case folding or Qnil.  */

The function uses that only as a boolean, so why not pass it as that?

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 16:20:01 GMT) Full text and rfc822 format available.

Message #25 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Andreas Schwab <schwab <at> linux-m68k.org>
Cc: Aidan Kehoe <kehoea <at> parhasard.net>, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Tue, 8 Dec 2020 17:19:38 +0100

8 dec. 2020 kl. 17.10 skrev Andreas Schwab <schwab <at> linux-m68k.org>:

> The function uses that only as a boolean, so why not pass it as that?

Thanks for reading the patch! It's a micro-optimisation: passing it as a boolean would entail an unconditional comparison against Qnil, but it is only used for [:lower:] and [:upper:] which are used in a small fraction of character alternatives. Maybe there is a cleaner way to do this without making the code slower.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 16:58:01 GMT) Full text and rfc822 format available.

Message #28 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: kehoea <at> parhasard.net, larsi <at> gnus.org, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Tue, 8 Dec 2020 17:57:32 +0100

8 dec. 2020 kl. 17.02 skrev Eli Zaretskii <eliz <at> gnu.org>:

> AFAIR characters that don't have an upper-case
> form map to themselves when downcased.  E.g.
> 
>  (upcase ?1) => ?1

This is not about the Lisp (upcase x) function but the C upcase(x) function, which uses the upcase table directly.
They affect the uppercasep and lowercasep functions which are used in the regexp engine. Thus we get uppercasep(ß)=lowercasep(ß)=false which is wrong.

The logic of 'lowercasep' may need to be changed because its use of upcase and downcase which return their argument if the respective table has no entry for it. Let's see what can be done.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 17:03:02 GMT) Full text and rfc822 format available.

Message #31 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: "Basil L. Contovounesios" <contovob <at> tcd.ie>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: Aidan Kehoe <kehoea <at> parhasard.net>, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Tue, 08 Dec 2020 17:01:53 +0000

Mattias Engdegård <mattiase <at> acm.org> writes:

> @@ -3617,11 +3619,9 @@ execute_charset (re_char **pp, int c, int corig, bool unibyte)
>            (class_bits & BIT_BLANK && ISBLANK (c)) ||
>  	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
>  	  ((class_bits & BIT_UPPER) &&
> -	   (ISUPPER (c) || (corig != c &&
> -			    c == downcase (corig) && ISLOWER (c)))) ||
> +	   (ISUPPER (corig) || (canon_table != Qnil && ISLOWER (corig)))) ||
>  	  ((class_bits & BIT_LOWER) &&
> -	   (ISLOWER (c) || (corig != c &&
> -			    c == upcase (corig) && ISUPPER(c)))) ||
> +	   (ISLOWER (corig) || (canon_table != Qnil && ISUPPER (corig)))) ||

Just curious: why not NILP?

Thanks,

-- 
Basil

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 17:05:02 GMT) Full text and rfc822 format available.

Message #34 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: "Basil L. Contovounesios" <contovob <at> tcd.ie>
Cc: Aidan Kehoe <kehoea <at> parhasard.net>, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Tue, 8 Dec 2020 18:04:00 +0100

8 dec. 2020 kl. 18.01 skrev Basil L. Contovounesios <contovob <at> tcd.ie>:

> Just curious: why not NILP?

Momentary amnesia. Will change, thank you!

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Tue, 08 Dec 2020 17:07:01 GMT) Full text and rfc822 format available.

Message #37 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: kehoea <at> parhasard.net, larsi <at> gnus.org, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Tue, 08 Dec 2020 19:05:53 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Tue, 8 Dec 2020 17:57:32 +0100
> Cc: larsi <at> gnus.org, kehoea <at> parhasard.net, 11309 <at> debbugs.gnu.org
> 
> This is not about the Lisp (upcase x) function but the C upcase(x) function, which uses the upcase table directly.
> They affect the uppercasep and lowercasep functions which are used in the regexp engine. Thus we get uppercasep(ß)=lowercasep(ß)=false which is wrong.

Why is it wrong, and what practical problems does this cause?

> The logic of 'lowercasep' may need to be changed because its use of upcase and downcase which return their argument if the respective table has no entry for it. Let's see what can be done.

I don't want us to change the logic of such basic functions for the
benefit of a single obscure character.  Let's first see what problems
with this character we have in practice, and then discuss what is the
best way of solving those problems.

TIA

Reply sent to Mattias Engdegård <mattiase <at> acm.org>:
You have taken responsibility. (Wed, 09 Dec 2020 14:38:01 GMT) Full text and rfc822 format available.

Notification sent to Aidan Kehoe <kehoea <at> parhasard.net>:
bug acknowledged by developer. (Wed, 09 Dec 2020 14:38:01 GMT) Full text and rfc822 format available.

Message #42 received at 11309-done <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Aidan Kehoe <kehoea <at> parhasard.net>, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309-done <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Wed, 9 Dec 2020 15:37:19 +0100

Eli, thanks for looking at the patch, now pushed to master (with Basil's suggested tweak).

> Why is it wrong, and what practical problems does this cause?

ß is a lower case letter so lowercasep(ß)=false is wrong. As a consequence, matching ß with [:lower:] and [:upper:] don't work correctly: ß should be matched by [:lower:] when case-fold-search is nil, and by both [:lower:] and [:upper:] when case-fold-search is non-nil.

The problem stems from the fact that uppercasep and lowercasep don't use the Unicode case information directly (which perhaps they should) but derive the case indirectly from the upcase and downcase tables, and there is no way to state that a char is lower case but cannot be upcased or downcased. (Below I'm going to use the notation T[C] for the table T indexed by character C.)

Currently, characters missing from or self-mapping in the upcase and downcase tables are considered to be caseless. For instance, upcase[*]=downcase[*]=* and upcase[中]=downcase[中]=nil. However, we also have upcase[ß]=downcase[ß]=ß, causing the incorrect lowercasep result.

The solution that I ended up applying was the simplest possible: set upcase[ß]=ẞ (U+7838). The special-uppercase properties ensure that (upcase "ß") => "SS", and now all tests pass.

(An acceptable alternative would have been to set upcase[ß]=nil and adapt lowercasep accordingly. I tried that and it works flawlessly, but involves slightly more changes.)

And that concludes the resolution of this bug.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Wed, 09 Dec 2020 15:47:03 GMT) Full text and rfc822 format available.

Message #45 received at 11309-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: kehoea <at> parhasard.net, larsi <at> gnus.org, 11309-done <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Wed, 09 Dec 2020 17:46:10 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Wed, 9 Dec 2020 15:37:19 +0100
> Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Aidan Kehoe <kehoea <at> parhasard.net>,
>         11309-done <at> debbugs.gnu.org
> 
> ß is a lower case letter so lowercasep(ß)=false is wrong. As a consequence, matching ß with [:lower:] and [:upper:] don't work correctly: ß should be matched by [:lower:] when case-fold-search is nil, and by both [:lower:] and [:upper:] when case-fold-search is non-nil.
> 
> The problem stems from the fact that uppercasep and lowercasep don't use the Unicode case information directly (which perhaps they should) but derive the case indirectly from the upcase and downcase tables, and there is no way to state that a char is lower case but cannot be upcased or downcased. (Below I'm going to use the notation T[C] for the table T indexed by character C.)
> 
> Currently, characters missing from or self-mapping in the upcase and downcase tables are considered to be caseless. For instance, upcase[*]=downcase[*]=* and upcase[中]=downcase[中]=nil. However, we also have upcase[ß]=downcase[ß]=ß, causing the incorrect lowercasep result.
> 
> The solution that I ended up applying was the simplest possible: set upcase[ß]=ẞ (U+7838). The special-uppercase properties ensure that (upcase "ß") => "SS", and now all tests pass.
> 
> (An acceptable alternative would have been to set upcase[ß]=nil and adapt lowercasep accordingly. I tried that and it works flawlessly, but involves slightly more changes.)
> 
> And that concludes the resolution of this bug.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Thu, 10 Dec 2020 09:37:01 GMT) Full text and rfc822 format available.

Message #48 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Aidan Kehoe <kehoea <at> parhasard.net>, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Thu, 10 Dec 2020 10:36:12 +0100

As it turns out I had completely forgotten about Fupcase with a character argument -- (upcase ?ß) previously returned ?ß but ?ẞ after the change -- which was caught by casefiddle-tests. Now, what to do about it?

One solution would be the previous plan B: set upcase[ß]=nil, modify the uppercasep logic, and we will have (upcase ?ß) => ?ß again. However, I would argue that the current state is actually preferable:

Upcasing ß to ß never really makes sense. Words containing ß are written with SS in upper case: groß -> GROSS - which is one reason why the character-to-character use of Fupcase normally cannot be used for text containing the letter. The capital ß, ?ẞ, is still not widely employed but one of its purposes is when it is important to preserve the exact spelling of proper names when written in all caps: Gauß -> GAUẞ, not GAUSS. (I wouldn't be surprised if this will eventually become the general convention for all text, but we are getting ahead of society here.)

For these reasons, I'm adapting casefiddle-tests and calling it a feature.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Thu, 10 Dec 2020 14:19:02 GMT) Full text and rfc822 format available.

Message #51 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: kehoea <at> parhasard.net, larsi <at> gnus.org, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Thu, 10 Dec 2020 16:17:41 +0200

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 10 Dec 2020 10:36:12 +0100
> Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, Aidan Kehoe <kehoea <at> parhasard.net>,
>         11309 <at> debbugs.gnu.org
> 
> Upcasing ß to ß never really makes sense. Words containing ß are written with SS in upper case: groß -> GROSS - which is one reason why the character-to-character use of Fupcase normally cannot be used for text containing the letter. The capital ß, ?ẞ, is still not widely employed but one of its purposes is when it is important to preserve the exact spelling of proper names when written in all caps: Gauß -> GAUẞ, not GAUSS. (I wouldn't be surprised if this will eventually become the general convention for all text, but we are getting ahead of society here.)

Wouldn't it be confusing that upcase treats ?ß and "ß" differently?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Thu, 10 Dec 2020 15:49:01 GMT) Full text and rfc822 format available.

Message #54 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: kehoea <at> parhasard.net, Lars Ingebrigtsen <larsi <at> gnus.org>,
 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic, 
 Greek
Date: Thu, 10 Dec 2020 16:48:03 +0100

10 dec. 2020 kl. 15.17 skrev Eli Zaretskii <eliz <at> gnu.org>:

> Wouldn't it be confusing that upcase treats ?ß and "ß" differently?

Well it already did so before (returning ?ß and "SS", respectively) and it's not as if we have much of a choice since
(1) upcase is documented to return a value of the same type as its argument, and
(2) "SS" is definitely the right return value for "ß".

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Thu, 10 Dec 2020 15:54:02 GMT) Full text and rfc822 format available.

Message #57 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: kehoea <at> parhasard.net, Eli Zaretskii <eliz <at> gnu.org>, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Thu, 10 Dec 2020 16:53:07 +0100

Mattias Engdegård <mattiase <at> acm.org> writes:

> Well it already did so before (returning ?ß and "SS", respectively)
> and it's not as if we have much of a choice since
> (1) upcase is documented to return a value of the same type as its argument, and
> (2) "SS" is definitely the right return value for "ß".

I can only vaguely read German, but doesn't that depend one the locale?
That is, whether an upcase of ß should be SS or ẞ depends on...  what
time and place we're at?

So returning either, or both (as after your patch), sounds fine to me --
it's an improvement on what Emacs did before.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Fri, 11 Dec 2020 09:19:02 GMT) Full text and rfc822 format available.

Message #60 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: kehoea <at> parhasard.net, Eli Zaretskii <eliz <at> gnu.org>, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Fri, 11 Dec 2020 10:18:03 +0100

10 dec. 2020 kl. 16.53 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:

> I can only vaguely read German, but doesn't that depend one the locale?
> That is, whether an upcase of ß should be SS or ẞ depends on...  what
> time and place we're at?

I suppose, but upcasing to ẞ is not standard practice (at least not yet) in any German-speaking country. The Swiss prefer not using ß at all and write ss instead, but that doesn't affect the case-conversion rules.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11309; Package emacs. (Fri, 11 Dec 2020 15:27:01 GMT) Full text and rfc822 format available.

Message #63 received at 11309 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: kehoea <at> parhasard.net, Eli Zaretskii <eliz <at> gnu.org>, 11309 <at> debbugs.gnu.org
Subject: Re: bug#11309: 24.1.50; Case problems with [:upper:] and Cyrillic,
 Greek
Date: Fri, 11 Dec 2020 16:26:33 +0100

Mattias Engdegård <mattiase <at> acm.org> writes:

> 10 dec. 2020 kl. 16.53 skrev Lars Ingebrigtsen <larsi <at> gnus.org>:
>
>> I can only vaguely read German, but doesn't that depend one the locale?
>> That is, whether an upcase of ß should be SS or ẞ depends on...  what
>> time and place we're at?
>
> I suppose, but upcasing to ẞ is not standard practice (at least not
> yet) in any German-speaking country. The Swiss prefer not using ß at
> all and write ss instead, but that doesn't affect the case-conversion
> rules.

I thought I vaguely remembered somebody somewhere making ẞ a standard
upcase, but it seems I remembered wrong.  They only say that it's "also
possible":

"According to the council’s 2017 spelling manual: When writing the
uppercase [of ß], write SS. It’s also possible to use the uppercase
ẞ. Example: Straße — STRASSE — STRAẞE"

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 09 Jan 2021 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 223 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #11309 24.1.50; Case problems with [:upper:] and Cyrillic, Greek

GNU bug report logs - #11309
24.1.50; Case problems with [:upper:] and Cyrillic, Greek