#38235 - string-foldcase bug for trailing sigma

GNU bug report logs - #38235
string-foldcase bug for trailing sigma

Package: guile;

Reported by: Andy Wingo <wingo <at> pobox.com>

Date: Sat, 16 Nov 2019 20:42:02 UTC

Severity: normal

View this message in rfc822 format

From: John Cowan <cowan <at> ccil.org> To: Andy Wingo <wingo <at> pobox.com>, tomas <at> tuxteam.de Cc: 38235 <at> debbugs.gnu.org Subject: bug#38235: string-foldcase bug for trailing sigma Date: Sun, 17 Nov 2019 13:13:42 -0500

[Message part 1 (text/plain, inline)]

On Sat, Nov 16, 2019 at 3:42 PM Andy Wingo <wingo <at> pobox.com> wrote: > The expected result is "μέλοσ"; see R6RS libraries section 1.2. However > instead Guile's result is "μέλος". Note that although Σ usually > downcases to σ, at the end of a string it's ς. More precisely, it downcases to σ if a letter follows and to ς if not (being at the end of a string is a particular case). However, this is not actually always Greekly correct: the string "ΦΙΛΟΣ." with a period at the end downcases to "φιλος." if it is the word φίλος 'friend' (without its proper accent) at the end of a sentence, but as "φιλος." if it is an abbreviation for φιλοσοφία 'philosophy'. For this reason, R7RS does not require mapping to ς in this situation as R6RS does. This test shows a > limitation of defining string-foldcase as simply (string-downcase > (string-upcase str)). > As explained in Unicode section 5.18, the foldcase mappings (in < https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>, the lines with status C and F) actually create a set of equivalence classes that are closed under {upper,lower,title}case mapping, and then choose a single character to represent each class. This is usually the unique lowercase character, but not always: in Cherokee it is the uppercase character, and in the set {Σ, σ, ς} it is σ. On Sun, Nov 17, 2019 at 6:20 AM <tomas <at> tuxteam.de> wrote: Good catch. I think there's even a worse example: dotless > and dotted I [1]. Here it seems even impossible to do > up- and downcase correctly without knowing the language > context. > Language-specific case mappings are explicitly out of Scheme's remit: they have to be performed by specialized libraries. There is an additional situation in Lithuanian dictionaries (but not running text): an "i" with a tone accent is represented as "i" + dot above + accent, like this: "i̇́". However, this dot above must be dropped when uppercasing, producing ordinary "Í".

[Message part 2 (text/html, inline)]

This bug report was last modified 5 years and 298 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #38235 string-foldcase bug for trailing sigma

GNU bug report logs - #38235
string-foldcase bug for trailing sigma