From unknown Fri Jun 13 10:11:46 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#38235 <38235@debbugs.gnu.org> To: bug#38235 <38235@debbugs.gnu.org> Subject: Status: string-foldcase bug for trailing sigma Reply-To: bug#38235 <38235@debbugs.gnu.org> Date: Fri, 13 Jun 2025 17:11:46 +0000 retitle 38235 string-foldcase bug for trailing sigma reassign 38235 guile submitter 38235 Andy Wingo severity 38235 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 16 15:41:36 2019 Received: (at submit) by debbugs.gnu.org; 16 Nov 2019 20:41:36 +0000 Received: from localhost ([127.0.0.1]:39759 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iW4sx-00056B-Rm for submit@debbugs.gnu.org; Sat, 16 Nov 2019 15:41:36 -0500 Received: from lists.gnu.org ([209.51.188.17]:46925) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iW4sw-000561-7G for submit@debbugs.gnu.org; Sat, 16 Nov 2019 15:41:34 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:53452) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iW4su-0002b4-9E for bug-guile@gnu.org; Sat, 16 Nov 2019 15:41:33 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iW4sr-0007NT-9K for bug-guile@gnu.org; Sat, 16 Nov 2019 15:41:32 -0500 Received: from fanzine.igalia.com ([178.60.130.6]:57537) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iW4sq-0007Ll-Lv for bug-guile@gnu.org; Sat, 16 Nov 2019 15:41:29 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=igalia.com; s=20170329; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:Date:Subject:To:From; bh=jzoxyQO6pGsm6bcRZioqIYrJlu1taKjghWuJf8dUNwU=; b=gxFrQgio1GjZZUcvre91Sb6TKau+XZ7SHXJdQxbZgYdmXe0e4h2g0asuGpXKpsI4Ewh9lkvspMq1A7VcbVDxjloaCPZhRNh4JDOoxO0PK7PI0pi39DIGeOm5ZQWsAHmpmyYinRQ3rFbPF9YNnWyrK0U4Nll2FCexnGrFOlyRegHNYL1d1Js5P1Xdcw+QeGna3uupzSoO3479zmVYLMTE93u4X6oC3mHuozMeUb/ESd3i5M/QU+VpZlQPEA8z5kvPOfDHnEae1ABnO2yGUHMKz5aIwhnu8yhowGV+Ddly4zIQN4Dle8MCc3GusxssQG8UG1UMYP58gXWC+BsBkURvSA==; Received: from cha74-2-88-160-189-213.fbx.proxad.net ([88.160.189.213] helo=sparrow) by fanzine.igalia.com with esmtpsa (Cipher TLS1.0:ECDHE_RSA_AES_256_CBC_SHA1:256) (Exim) id 1iW4sm-0005oK-M7 for ; Sat, 16 Nov 2019 21:41:24 +0100 From: Andy Wingo To: bug-guile@gnu.org Subject: string-foldcase bug for trailing sigma Date: Sat, 16 Nov 2019 21:41:05 +0100 Message-ID: <87tv73mu5a.fsf@pobox.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] [fuzzy] X-Received-From: 178.60.130.6 X-Spam-Score: -1.6 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) Given the following example, using (rnrs unicode): (string-foldcase "=CE=9C=CE=88=CE=9B=CE=9F=CE=A3") The expected result is "=CE=BC=CE=AD=CE=BB=CE=BF=CF=83"; see R6RS libraries= section 1.2. However instead Guile's result is "=CE=BC=CE=AD=CE=BB=CE=BF=CF=82". Note that alth= ough =CE=A3 usually downcases to =CF=83, at the end of a string it's =CF=82. This test shows a limitation of defining string-foldcase as simply (string-downcase (string-upcase str)). From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 17 06:19:34 2019 Received: (at submit) by debbugs.gnu.org; 17 Nov 2019 11:19:34 +0000 Received: from localhost ([127.0.0.1]:40408 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iWIac-000791-6E for submit@debbugs.gnu.org; Sun, 17 Nov 2019 06:19:34 -0500 Received: from lists.gnu.org ([209.51.188.17]:47880) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iWIaZ-00078t-Tj for submit@debbugs.gnu.org; Sun, 17 Nov 2019 06:19:32 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:39143) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iWIaY-0002Py-F5 for bug-guile@gnu.org; Sun, 17 Nov 2019 06:19:31 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_NONE, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iWIaW-0006bt-7Z for bug-guile@gnu.org; Sun, 17 Nov 2019 06:19:30 -0500 Received: from mail.tuxteam.de ([5.199.139.25]:42889) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iWIaV-0006Zk-OG for bug-guile@gnu.org; Sun, 17 Nov 2019 06:19:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tuxteam.de; s=mail; h=From:In-Reply-To:Content-Type:MIME-Version:References:Message-ID:Subject:To:Date; bh=msB2wk72Ipco0Jiy8Z17Oc8cATQjFcONYPHBcZOZDGE=; b=O1/H3sNMjWzpgH/xnFngawsvmH15eWlJtjz8JA9wprWzAfrLKArra7DhPLRVU2ivjoNV/oBPo4WbLzcdxlBo/giyOU9mIzmMpIdG78kYruiacp5T0vkoiqlX7+ZAGHtR2LJUQU5BibjcDbReJ6FYXCnvG1fzD+DsIQ/+Ozq7h+qi4eRU4zmLf2oV6dzVy/0HYiAtRIlM08MmTR2zs6halhKAfAVSMulhgqN/k6DaQKin8vkJybJfUCE0SeufYZwKl+1rc/nRykQiiXM43Wu94o7YDBIMxb5OLl/vV8g5H/ktLQS+eaA46OI+0LGI5qI8Bjt1M06ZvCa3//WbfSJQBQ==; Received: from tomas by mail.tuxteam.de with local (Exim 4.80) (envelope-from ) id 1iWIaM-00040t-TV for bug-guile@gnu.org; Sun, 17 Nov 2019 12:19:18 +0100 Date: Sun, 17 Nov 2019 12:19:18 +0100 To: bug-guile@gnu.org Subject: Re: bug#38235: string-foldcase bug for trailing sigma Message-ID: <20191117111918.GA15143@tuxteam.de> References: <87tv73mu5a.fsf@pobox.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="FL5UXtIhxfXey3p5" Content-Disposition: inline In-Reply-To: <87tv73mu5a.fsf@pobox.com> User-Agent: Mutt/1.5.21 (2010-09-15) From: X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 5.199.139.25 X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --FL5UXtIhxfXey3p5 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sat, Nov 16, 2019 at 09:41:05PM +0100, Andy Wingo wrote: > Given the following example, using (rnrs unicode): >=20 > (string-foldcase "=CE=9C=CE=88=CE=9B=CE=9F=CE=A3") Good catch. I think there's even a worse example: dotless and dotted I [1]. Here it seems even impossible to do up- and downcase correctly without knowing the language context. Cheers [1] https://en.wikipedia.org/wiki/%C4%B0 -- tom=C3=A1s --FL5UXtIhxfXey3p5 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.12 (GNU/Linux) iEYEARECAAYFAl3RLLYACgkQBcgs9XrR2kYLqgCffjW+xLAhkMeLqP/gR3wG79yN 96QAn1uNFevak0LtvUhdghbeuvbVGHPH =MB7J -----END PGP SIGNATURE----- --FL5UXtIhxfXey3p5-- From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 17 13:14:00 2019 Received: (at 38235) by debbugs.gnu.org; 17 Nov 2019 18:14:00 +0000 Received: from localhost ([127.0.0.1]:42546 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iWP3f-00043W-K1 for submit@debbugs.gnu.org; Sun, 17 Nov 2019 13:13:59 -0500 Received: from mail-qk1-f169.google.com ([209.85.222.169]:46796) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iWP3d-00043J-2A for 38235@debbugs.gnu.org; Sun, 17 Nov 2019 13:13:57 -0500 Received: by mail-qk1-f169.google.com with SMTP id h15so12454750qka.13 for <38235@debbugs.gnu.org>; Sun, 17 Nov 2019 10:13:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ccil-org.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=5O4jivPCu71/wLIoE6lphqgFQVhFrvc9FNEwY7Qi6m0=; b=ohtCiNbgMD+BonVSGT+DmkmXqu1PGip5smzXw7MfMivKpDswINtwaMGbMf5ZDSKFY+ Gw73X8TwuVKgINcZbr3lU1+yVX279toGNvJ+07QiU2n2IpHg8jpnrpKk9s4xIEUYG4Ib WSnaHuzZyMfsuzLmpDzaTsTN5sYDlV3BPFgg8B0ooSS0JcOLc6hGDLyXMpND3KFNoi2Y YBGyUabOdPQhyiCtLf9vUZhwedK6e5Ydp52uvijlNW6Z7BoYnwOaXxEmHrQyhBuEob2b Ip0TrzK4OSLfqYTSKkz8rgocN7nvVwownPGrjLjb4ydOJ1+TKtA+XbGglHz0XJaKB1lc lUfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=5O4jivPCu71/wLIoE6lphqgFQVhFrvc9FNEwY7Qi6m0=; b=AL6f67juLxmZVm7JwtGhNafCGHxlCB6Gh72gcrg217NU6bUOcDWH9UlJjHDtDI7yuF KWD95Ds0ASpolu0h5r5H2xHp8QcXM8Lt7NpaqtnRKpQ220UhnkDGv/sRYBUgZbv86dsB tuZr+cdpmapNRzOhz0Jkf78SXZyO7e2uBS4RZmMGwLlrwIJmNd2kNMLq2yukns2glwJQ hsyjUgnPai5lIiWW1zfy+/eN6qHNCtFabCgigjz3Mq+rVQC7DYp2+ZreGElkWXSMWYkC HQLhVpCyyJ+LSOB5K1NWd/t3555fk1Wqr9T670rNxZ+ygO9zxwIzT9087tT3Jw7jMiTM 6TZw== X-Gm-Message-State: APjAAAX/3BqV7Jxorig4IvPPtRVpnUZYEKEbXOVgClLXE98w3TUmrk/I G0Isn/rC4d7qRNQ8614Y4jVxsrwrp7Ia+ibeU8SNaw== X-Google-Smtp-Source: APXvYqzevOt4yftRzavDgY0JiBDqSqC69uYhVpME30aRTSWD96uqBZb8I0kCfdoRtctuvnoML5wdfCHidW5fUgjbwWU= X-Received: by 2002:a37:6f07:: with SMTP id k7mr20484077qkc.118.1574014431508; Sun, 17 Nov 2019 10:13:51 -0800 (PST) MIME-Version: 1.0 References: <87tv73mu5a.fsf@pobox.com> In-Reply-To: <87tv73mu5a.fsf@pobox.com> From: John Cowan Date: Sun, 17 Nov 2019 13:13:42 -0500 Message-ID: Subject: Re: bug#38235: string-foldcase bug for trailing sigma To: Andy Wingo , tomas@tuxteam.de Content-Type: multipart/alternative; boundary="0000000000000fcec405978ecbde" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 38235 Cc: 38235@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --0000000000000fcec405978ecbde Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sat, Nov 16, 2019 at 3:42 PM Andy Wingo wrote: > The expected result is "=CE=BC=CE=AD=CE=BB=CE=BF=CF=83"; see R6RS librari= es section 1.2. However > instead Guile's result is "=CE=BC=CE=AD=CE=BB=CE=BF=CF=82". Note that al= though =CE=A3 usually > downcases to =CF=83, at the end of a string it's =CF=82. More precisely, it downcases to =CF=83 if a letter follows and to =CF=82 if= not (being at the end of a string is a particular case). However, this is not actually always Greekly correct: the string "=CE=A6=CE=99=CE=9B=CE=9F=CE= =A3." with a period at the end downcases to "=CF=86=CE=B9=CE=BB=CE=BF=CF=82." if it is the word =CF=86= =CE=AF=CE=BB=CE=BF=CF=82 'friend' (without its proper accent) at the end of a sentence, but as "=CF=86=CE=B9=CE=BB=CE=BF= =CF=82." if it is an abbreviation for =CF=86=CE=B9=CE=BB=CE=BF=CF=83=CE=BF=CF=86=CE=AF=CE=B1 'ph= ilosophy'. For this reason, R7RS does not require mapping to =CF=82 in this situation as R6RS does. This test shows a > limitation of defining string-foldcase as simply (string-downcase > (string-upcase str)). > As explained in Unicode section 5.18, the foldcase mappings (in < https://www.unicode.org/Public/UNIDATA/CaseFolding.txt>, the lines with status C and F) actually create a set of equivalence classes that are closed under {upper,lower,title}case mapping, and then choose a single character to represent each class. This is usually the unique lowercase character, but not always: in Cherokee it is the uppercase character, and in the set {=CE=A3, =CF=83, =CF=82} it is =CF=83. On Sun, Nov 17, 2019 at 6:20 AM wrote: Good catch. I think there's even a worse example: dotless > and dotted I [1]. Here it seems even impossible to do > up- and downcase correctly without knowing the language > context. > Language-specific case mappings are explicitly out of Scheme's remit: they have to be performed by specialized libraries. There is an additional situation in Lithuanian dictionaries (but not running text): an "i" with a tone accent is represented as "i" + dot above + accent, like this: "i=CC= =87=CC=81". However, this dot above must be dropped when uppercasing, producing ordinary "=C3=8D". --0000000000000fcec405978ecbde Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Sat, Nov 16, 2019 at 3:42 PM Andy Wingo <wingo@pobox.com> wrote:
=C2=A0<= /div>
The expected result is "=CE=BC=CE= =AD=CE=BB=CE=BF=CF=83"; see R6RS libraries section 1.2.=C2=A0 However<= br> instead Guile's result is "=CE=BC=CE=AD=CE=BB=CE=BF=CF=82".= =C2=A0 Note that although =CE=A3 usually
downcases to =CF=83, at the end of a string it's =CF=82.

More precisely, it downcases to =CF=83 if a letter follows and to =CF=82 if not (being at the end of a string is a particular case).=C2=A0 Ho= wever, this is not actually always Greekly correct:=C2=A0 the string "= =CE=A6=CE=99=CE=9B=CE=9F=CE=A3." with a period at the end downcases to= "=CF=86=CE=B9=CE=BB=CE=BF=CF=82." if it is the word =CF=86=CE=AF= =CE=BB=CE=BF=CF=82 'friend' (without its proper accent) at the end = of a sentence, but as "=CF=86=CE=B9=CE=BB=CE=BF=CF=82." if it is = an abbreviation for =CF=86=CE=B9=CE=BB=CE=BF=CF=83=CE=BF=CF=86=CE=AF=CE=B1 = 'philosophy'.=C2=A0 For this reason, R7RS does not require mapping = to=C2=A0 =CF=82 in this situation as R6RS does.

This test shows a
limitation of defining string-foldcase as simply (string-downcase
(string-upcase str)).

As explained in U= nicode section 5.18, the foldcase mappings (in <https://www.unicode.org/Public/U= NIDATA/CaseFolding.txt>, the lines with status C and F) actually cre= ate a set of equivalence classes that are closed under {upper,lower,title}c= ase mapping, and then choose a single character to represent each class.=C2= =A0 This is usually the unique lowercase character, but not always: in Cher= okee it is the uppercase character, and in the set {=CE=A3, =CF=83, =CF=82}= it is=C2=A0 =CF=83.=C2=A0=C2=A0

On Sun, Nov 17, 2019 at 6:20 AM <tomas@tuxteam.de> wrote:

Good c= atch. I think there's even a worse example: dotless
and dotted I [1]. Here it seems even impossible to do
up- and downcase correctly without knowing the language
context.

Language-specific case mapping= s are explicitly out of Scheme's remit: they have to be performed by sp= ecialized libraries.=C2=A0 There is an additional situation in Lithuanian d= ictionaries (but not running text): an "i" with a tone accent is = represented as "i"=C2=A0+ dot above=C2=A0+ accent, like this:=C2= =A0 "i=CC=87=CC=81".=C2=A0 However, this dot above must be droppe= d when uppercasing, producing ordinary "=C3=8D".
--0000000000000fcec405978ecbde--