From unknown Sun Jun 22 03:59:19 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 12 Sep 2016 22:48:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 24425 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: 24425@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.14737204496183 (code B ref -1); Mon, 12 Sep 2016 22:48:02 +0000 Received: (at submit) by debbugs.gnu.org; 12 Sep 2016 22:47:29 +0000 Received: from localhost ([127.0.0.1]:58014 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja0e-0001be-St for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:29 -0400 Received: from eggs.gnu.org ([208.118.235.92]:53453) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja0d-0001bT-Iy for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja0X-0008Pm-Aj for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:22 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_50,RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:40286) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0X-0008PS-7i for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:21 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44419) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0U-0007GM-OH for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja0Q-0008OS-IC for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:17 -0400 Received: from mail-wm0-f46.google.com ([74.125.82.46]:37688) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0Q-0008OM-8U for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:14 -0400 Received: by mail-wm0-f46.google.com with SMTP id c131so82453676wmh.0 for ; Mon, 12 Sep 2016 15:47:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=1zhxQVmV81Y0CHINHSPKoCDPnbmWwiLDIxlYu45NrNc=; b=gK5OJjKCsBPBhuQdj2KxisacVV/FK0xLCsMLWMkm5nCgwvZzAhcCWSM0QUz5t9bcJd UvkKHlD8+KhYmdhsEJxGiZEq9sY/N4I6B8+kuRs3yPfisn43fZz8/pAN/tIE1qHN+a8Q xAHon/9JlruP3UeDj9tNq1HlXNXNnZyV0iZfnX0FqaNhGqbTuZ8MM7XrBqxjnBmlJJeb bnY3AQj2lPrZThynKcQ23YxGWUpZB2BbYpO+pwymTcg/oL9+BMkC1t5AYDZF9ZM7RhoN pYK4lItT14Q8gNPPyzSZMwfPPW+ikfRQVXTpLFP9jMgiex5eTITscrOdqk1T4GBjiJlz tDOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=1zhxQVmV81Y0CHINHSPKoCDPnbmWwiLDIxlYu45NrNc=; b=liw7lJTWuUPxZqXTsD7a/0AxFxLNNGR+lhSOOtGUOkAXhgg2imEEHsdt+cCIzVcOX7 6xl1rloWLOrm4wVsUWQMF3tKDwjFUcBahr0JEtr8o5B6/aSP5MaNK/K9uyDdM6qeFLw1 eHFSRPbx6hi0HTHJ52yIq/sOBzj1d1oaZp2a00v93ApPnmOAviPBSs8DHhojZDt2Iqr3 pT5aV6/Hzr8MXqs+wNsBRd84dS85COX354AB61kfwj5uCP8K+L6V1nQuD1kUWXyT7iHn v2oUks6dq0JxOsa2tZIz2ZrMPWQDbVAMOc9w94xXhfUQxXNg4WQIdDo49GRH4TbEKFfZ xMng== X-Gm-Message-State: AE9vXwMK//BkqT1mY6aC8d9Pr/fu/34Bvbtr7RSwF3D/sg+NoweIko5L+flu1tM49Cmah+Id X-Received: by 10.28.146.133 with SMTP id u127mr1881802wmd.21.1473720372962; Mon, 12 Sep 2016 15:46:12 -0700 (PDT) Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id d62sm19988523wmd.7.2016.09.12.15.46.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Sep 2016 15:46:11 -0700 (PDT) Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id B043F1E0208; Tue, 13 Sep 2016 00:46:10 +0200 (CEST) From: Michal Nazarewicz Date: Tue, 13 Sep 2016 00:46:07 +0200 Message-Id: <1473720367-2807-1-git-send-email-mina86@mina86.com> X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -3.5 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.5 (---) Currently, when operating on unibyte strings and buffers, if casing ASCII character results in a Unicode character the result is forcefully converted to 8-bit by masking all but the eight least significant bits. This has awkward results such as: (let ((table (make-char-table 'case-table))) (set-char-table-parent table (current-case-table)) (set-case-syntax-pair ?I ?ı table) (set-case-syntax-pair ?İ ?i table) (with-case-table table (concat (upcase "istanabul") " " (downcase "IRMA")))) => "0STANABUL 1rma" Change the code so that ASCII characters being cased to Unicode characters are left unchanged when operating on unibyte data. In other words, aforementioned example will produce: => "iSTANBUL "Irma" Arguably this isn’t correct either but it’s less wrong and ther’s not much we can do when the strings are unibyte. Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed case. * src/casefiddle.c (casify_object, casify_region): When dealing with unibyte data, don’t attempt to store Unicode characters in the result. --- src/casefiddle.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) Unless there are objections, I’ll commit it in a few days. diff --git a/src/casefiddle.c b/src/casefiddle.c index 2d32f49..247cc6f 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -71,8 +71,8 @@ casify_object (enum case_action flag, Lisp_Object obj) { if (! inword) c = upcase1 (c1); - if (! multibyte) - MAKE_CHAR_UNIBYTE (c); + if (! multibyte && CHAR_BYTE8_P (c)) + c = CHAR_TO_BYTE8 (c); XSETFASTINT (obj, c | flags); } return obj; @@ -93,18 +93,19 @@ casify_object (enum case_action flag, Lisp_Object obj) c1 = c; if (inword && flag != CASE_CAPITALIZE_UP) c = downcase (c); - else if (!uppercasep (c) - && (!inword || flag != CASE_CAPITALIZE_UP)) - c = upcase1 (c1); + else if (!inword || flag != CASE_CAPITALIZE_UP) + c = upcase (c1); if ((int) flag >= (int) CASE_CAPITALIZE) inword = (SYNTAX (c) == Sword); if (c != c1) { - MAKE_CHAR_UNIBYTE (c); - /* If the char can't be converted to a valid byte, just don't - change it. */ - if (c >= 0 && c < 256) - SSET (obj, i, c); + if (CHAR_BYTE8_P (c)) + c = CHAR_TO_BYTE8 (c); + else if (!ASCII_CHAR_P (c)) + /* If the char can't be converted to a valid byte, just don't + change it. */ + continue; + SSET (obj, i, c); } } return obj; @@ -250,8 +251,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e) if (! multibyte) { - MAKE_CHAR_UNIBYTE (c); - FETCH_BYTE (start_byte) = c; + /* If the char can't be converted to a valid byte, just don't + change it. */ + if (ASCII_CHAR_P (c) || + (CHAR_BYTE8_P (c) && ((c = CHAR_TO_BYTE8 (c)), true))) + FETCH_BYTE (start_byte) = c; } else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c)) FETCH_BYTE (start_byte) = c; -- 2.8.0.rc3.226.g39d4020 From unknown Sun Jun 22 03:59:19 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 13 Sep 2016 14:34:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24425 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Michal Nazarewicz Cc: 24425@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 24425-submit@debbugs.gnu.org id=B24425.147377720011239 (code B ref 24425); Tue, 13 Sep 2016 14:34:01 +0000 Received: (at 24425) by debbugs.gnu.org; 13 Sep 2016 14:33:20 +0000 Received: from localhost ([127.0.0.1]:58798 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bjolv-0002v8-5B for submit@debbugs.gnu.org; Tue, 13 Sep 2016 10:33:20 -0400 Received: from eggs.gnu.org ([208.118.235.92]:52547) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bjolp-0002uq-UV for 24425@debbugs.gnu.org; Tue, 13 Sep 2016 10:33:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bjolg-0005wN-AB for 24425@debbugs.gnu.org; Tue, 13 Sep 2016 10:33:04 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_50,RP_MATCHES_RCVD autolearn=disabled version=3.3.2 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:58135) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bjolg-0005uI-7i; Tue, 13 Sep 2016 10:33:00 -0400 Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2552 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1bjole-0002Ks-As; Tue, 13 Sep 2016 10:32:58 -0400 Date: Tue, 13 Sep 2016 17:33:02 +0300 Message-Id: <83mvjb98f5.fsf@gnu.org> From: Eli Zaretskii In-reply-to: <1473720367-2807-1-git-send-email-mina86@mina86.com> (message from Michal Nazarewicz on Tue, 13 Sep 2016 00:46:07 +0200) References: <1473720367-2807-1-git-send-email-mina86@mina86.com> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-Spam-Score: -7.3 (-------) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -7.3 (-------) > From: Michal Nazarewicz > Date: Tue, 13 Sep 2016 00:46:07 +0200 > > Currently, when operating on unibyte strings and buffers, if casing > ASCII character results in a Unicode character the result is forcefully > converted to 8-bit by masking all but the eight least significant bits. > This has awkward results such as: > > (let ((table (make-char-table 'case-table))) > (set-char-table-parent table (current-case-table)) > (set-case-syntax-pair ?I ?ı table) > (set-case-syntax-pair ?İ ?i table) > (with-case-table table > (concat (upcase "istanabul") " " (downcase "IRMA")))) > => "0STANABUL 1rma" > > Change the code so that ASCII characters being cased to Unicode > characters are left unchanged when operating on unibyte data. In other > words, aforementioned example will produce: > > => "iSTANBUL "Irma" > > Arguably this isn’t correct either but it’s less wrong and ther’s not > much we can do when the strings are unibyte. Thanks, but I don't think it's TRT to fix this in a way that produces a semi-broken result. Second-guessing what the user/caller means and silently producing results that only make sense if the guess was correct is about the worst thing we could do in these dark-corner situations. Currently, case changes in unibyte characters and strings are only well defined for pure ASCII text; if the input or the result is not pure ASCII, we produce "undefined behavior". In particular, case tables are not set at all for unibyte characters, because it's not text, it's a byte stream. Either we decide that we don't want to support case changes in unibyte non-ASCII characters, and we stick to the current behavior (or maybe even signal an error, except that I'm afraid that would break too many things); or we decide we want to support this use case, but then do it properly. Properly means that upcasing "istanbul" in the above example will produce "İSTANBUL", not "iSTANBUL", and downcasing "IRMA" will produce "ırma". Yes, these are multibyte strings produced from unibyte input, but I think it's the only result we can claim to be correct for a supported use case. (Such a change could still break some code somewhere, but at least it's a defendable breakage.) > Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since > CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode > characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed > case. We could convert that condition into an eassert, if we are certain the condition should never trigger. But that's an aside. Thanks. From unknown Sun Jun 22 03:59:19 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 15 Sep 2016 14:25:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24425 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Eli Zaretskii Cc: 24425@debbugs.gnu.org Received: via spool by 24425-submit@debbugs.gnu.org id=B24425.147394945517807 (code B ref 24425); Thu, 15 Sep 2016 14:25:02 +0000 Received: (at 24425) by debbugs.gnu.org; 15 Sep 2016 14:24:15 +0000 Received: from localhost ([127.0.0.1]:60568 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkXaD-0004d3-6v for submit@debbugs.gnu.org; Thu, 15 Sep 2016 10:24:14 -0400 Received: from mail-qt0-f180.google.com ([209.85.216.180]:33121) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkXa6-0004cU-7n for 24425@debbugs.gnu.org; Thu, 15 Sep 2016 10:24:08 -0400 Received: by mail-qt0-f180.google.com with SMTP id 11so26013154qtc.0 for <24425@debbugs.gnu.org>; Thu, 15 Sep 2016 07:24:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=3iomiEMJcEIzJu2Zc8CzwowLb5xzCJx1jK3zOsV0aIM=; b=HAUP80U7TuLf3AJ+s7ZkCK8skwcA/BIOB41qhrTjzXIFr5w1OsGUjD+jwC1ens4njy a6nC1mhl9q8YyXt4FhNkLkR7y6YJYeXKx6Nu7xOGhXQH2ghlXgQGunrfzYGL2pGFG34o m4C+GZIZglk9HFecLAiu8AUF4GI8N+j8rCLA0/HAKnkiiAm9FwUC/Fm8C2PPRW2hojp+ ATGQefjSKlFYK25sAyJzCi1WIGviwlUnG42ZkUhyb9gWAkI9BHZOSrFBuwuRI23A1Vc/ PYoqA4r/+w8cUX6qJoAOHUXOMRsW8/aYrwy7VeZLN9l08TVyP6KpGK1QZSpzuW5JuXhA pJZg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:user-agent:face:date:message-id :mime-version:content-transfer-encoding; bh=3iomiEMJcEIzJu2Zc8CzwowLb5xzCJx1jK3zOsV0aIM=; b=krXs7sJwgD8XTL5ukwZw5ZZSDtdEeg9GszL+W48oPdwX1lVOlkGzhiQ1PgfcBFaJTI uHGNl+TVFaF/A6DD/0cZrbwfU+rmm0Z0SUymgJqFFwXFw+2KU+pR3G473nJTtG1oEUZP hpf+l8g3McwrANF9Dj2LXAMWzbNVDmFY1Wny9R11E4NMqvr6lkPS/pXGALAjI6THpFY1 sQPIdrFYyCk1hN/fxgbZ1alVriNNiDVXwB49c/+qndbr/C3vNc8atDZQ5WPU6hSjqnYF rPIuldmp6jcBH/JVdPAValzMtatWhBBH37o+smP97KS9kj145BZe3BeVWPSHE091ro9b Jc/Q== X-Gm-Message-State: AE9vXwNfkt/8UxKJf3kY7AU3700HMqdbh4TCEyEu3wC0VICoLEV3dViLS4yBJJdknJfNbL59 X-Received: by 10.28.134.136 with SMTP id i130mr3178775wmd.76.1473949436137; Thu, 15 Sep 2016 07:23:56 -0700 (PDT) Received: from mpn-glaptop ([2620:0:105f:301:894b:a703:c2ff:3827]) by smtp.gmail.com with ESMTPSA id y2sm3699527wji.42.2016.09.15.07.23.55 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Thu, 15 Sep 2016 07:23:55 -0700 (PDT) From: Michal Nazarewicz In-Reply-To: <83mvjb98f5.fsf@gnu.org> Organization: http://mina86.com/ References: <1473720367-2807-1-git-send-email-mina86@mina86.com> <83mvjb98f5.fsf@gnu.org> User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.106 (x86_64-unknown-linux-gnu) Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:160915:eliz@gnu.org::EbQD7ERt7MgC2PGJ:0000062bm X-Hashcash: 1:20:160915:24425@debbugs.gnu.org::p4W1RwUmAgTxmm1X:00000000000000000000000000000000000000006pie Date: Thu, 15 Sep 2016 16:23:54 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.9 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) On Tue, Sep 13 2016, Eli Zaretskii wrote: > Currently, case changes in unibyte characters and strings are only > well defined for pure ASCII text; if the input or the result is not > pure ASCII, we produce "undefined behavior". Would the following (not tested) make sense then: diff --git a/src/casefiddle.c b/src/casefiddle.c index 2d32f49..4dc2357 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -89,23 +89,19 @@ casify_object (enum case_action flag, Lisp_Object obj) for (i =3D 0; i < size; i++) { c =3D SREF (obj, i); - MAKE_CHAR_MULTIBYTE (c); c1 =3D c; - if (inword && flag !=3D CASE_CAPITALIZE_UP) - c =3D downcase (c); - else if (!uppercasep (c) - && (!inword || flag !=3D CASE_CAPITALIZE_UP)) - c =3D upcase1 (c1); - if ((int) flag >=3D (int) CASE_CAPITALIZE) - inword =3D (SYNTAX (c) =3D=3D Sword); - if (c !=3D c1) + if (ASCII_CHAR_P (c)) { - MAKE_CHAR_UNIBYTE (c); - /* If the char can't be converted to a valid byte, just don't - change it. */ - if (c >=3D 0 && c < 256) - SSET (obj, i, c); + if (inword && flag !=3D CASE_CAPITALIZE_UP) + c =3D downcase (c); + else if (!uppercasep (c) + && (!inword || flag !=3D CASE_CAPITALIZE_UP)) + c =3D upcase1 (c1); } + if ((int) flag >=3D (int) CASE_CAPITALIZE) + inword =3D (SYNTAX (c) =3D=3D Sword); + if (c !=3D c1 && ASCII_CHAR_P (c)) + SSET (obj, i, c); } return obj; } @@ -230,8 +226,9 @@ casify_region (enum case_action flag, Lisp_Object b, Li= sp_Object e) else { c =3D FETCH_BYTE (start_byte); - MAKE_CHAR_MULTIBYTE (c); len =3D 1; + if (!ASCII_CHAR_P (c)) + goto done; } c2 =3D c; if (inword && flag !=3D CASE_CAPITALIZE_UP) @@ -239,9 +236,6 @@ casify_region (enum case_action flag, Lisp_Object b, Li= sp_Object e) else if (!uppercasep (c) && (!inword || flag !=3D CASE_CAPITALIZE_UP)) c =3D upcase1 (c); - if ((int) flag >=3D (int) CASE_CAPITALIZE) - inword =3D ((SYNTAX (c) =3D=3D Sword) - && (inword || !syntax_prefix_flag_p (c))); if (c !=3D c2) { last =3D start; @@ -250,8 +244,8 @@ casify_region (enum case_action flag, Lisp_Object b, Li= sp_Object e) =20 if (! multibyte) { - MAKE_CHAR_UNIBYTE (c); - FETCH_BYTE (start_byte) =3D c; + if (ASCII_CHAR_P (c)) + FETCH_BYTE (start_byte) =3D c; } else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c)) FETCH_BYTE (start_byte) =3D c; @@ -280,6 +274,10 @@ casify_region (enum case_action flag, Lisp_Object b, L= isp_Object e) } } } + done: + if ((int) flag >=3D (int) CASE_CAPITALIZE) + inword =3D ((SYNTAX (c) =3D=3D Sword) + && (inword || !syntax_prefix_flag_p (c))); start++; start_byte +=3D len; } If working on non-ASCII characters isn=E2=80=99t supported we might just as= well skip all the logic that handles non-ASCII unibyte characters. > Properly means that upcasing "istanbul" in the above example will > produce "=C4=B0STANBUL", not "iSTANBUL", and downcasing "IRMA" will produ= ce > "=C4=B1rma". I thought about that but then another corner case is "istanbul\xff" which is a unibyte string with 8-bit bytes. I have no strong feelings either way so I=E2=80=99m happy just leaving it a= s is as well. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB From unknown Sun Jun 22 03:59:19 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Thu, 15 Sep 2016 18:56:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24425 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Michal Nazarewicz Cc: 24425@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 24425-submit@debbugs.gnu.org id=B24425.147396574622060 (code B ref 24425); Thu, 15 Sep 2016 18:56:02 +0000 Received: (at 24425) by debbugs.gnu.org; 15 Sep 2016 18:55:46 +0000 Received: from localhost ([127.0.0.1]:60757 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkbp2-0005jj-Bf for submit@debbugs.gnu.org; Thu, 15 Sep 2016 14:55:46 -0400 Received: from eggs.gnu.org ([208.118.235.92]:45555) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkbp0-0005jT-7c for 24425@debbugs.gnu.org; Thu, 15 Sep 2016 14:55:42 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bkbos-0008WD-7B for 24425@debbugs.gnu.org; Thu, 15 Sep 2016 14:55:36 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.4 required=5.0 tests=BAYES_50,RP_MATCHES_RCVD autolearn=disabled version=3.3.2 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:40157) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bkbor-0008Vo-T8; Thu, 15 Sep 2016 14:55:34 -0400 Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2250 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1bkboo-0003HR-31; Thu, 15 Sep 2016 14:55:32 -0400 Date: Thu, 15 Sep 2016 21:55:20 +0300 Message-Id: <83twdh56xz.fsf@gnu.org> From: Eli Zaretskii In-reply-to: (message from Michal Nazarewicz on Thu, 15 Sep 2016 16:23:54 +0200) References: <1473720367-2807-1-git-send-email-mina86@mina86.com> <83mvjb98f5.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-Spam-Score: -7.2 (-------) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -7.2 (-------) > From: Michal Nazarewicz > Cc: 24425@debbugs.gnu.org > Date: Thu, 15 Sep 2016 16:23:54 +0200 > > On Tue, Sep 13 2016, Eli Zaretskii wrote: > > Currently, case changes in unibyte characters and strings are only > > well defined for pure ASCII text; if the input or the result is not > > pure ASCII, we produce "undefined behavior". > > Would the following (not tested) make sense then: AFAIU, it would disallow handling unibyte text by setting up case tables for 8-bit characters in their multibyte representation, i.e. above #x3FFF00. I'd rather not lose that, although I don't think I've ever seen that used. > > Properly means that upcasing "istanbul" in the above example will > > produce "İSTANBUL", not "iSTANBUL", and downcasing "IRMA" will produce > > "ırma". > > I thought about that but then another corner case is "istanbul\xff" > which is a unibyte string with 8-bit bytes. And what is the problem in that case? > I have no strong feelings either way so I’m happy just leaving it as is > as well. That is fine with me. Was there some real-life use case where you bumped into this? If so, maybe we should discuss that use case, perhaps the solution, if we need one, is something other than what we talked about until now. From unknown Sun Jun 22 03:59:19 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Michal Nazarewicz Subject: bug#24425: closed (Re: bug#24425: [PATCH] =?UTF-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings) Message-ID: References: <1473720367-2807-1-git-send-email-mina86@mina86.com> X-Gnu-PR-Message: they-closed 24425 X-Gnu-PR-Package: emacs X-Gnu-PR-Keywords: patch Reply-To: 24425@debbugs.gnu.org Date: Fri, 16 Sep 2016 17:42:01 +0000 Content-Type: multipart/mixed; boundary="----------=_1474047721-14461-1" This is a multi-part message in MIME format... ------------=_1474047721-14461-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #24425: [PATCH] Don=E2=80=99t cast Unicode to 8-bit when casing unibyte str= ings which was filed against the emacs package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 24425@debbugs.gnu.org. --=20 24425: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24425 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1474047721-14461-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 24425-done) by debbugs.gnu.org; 16 Sep 2016 17:41:53 +0000 Received: from localhost ([127.0.0.1]:33534 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkx97-0003ks-Ha for submit@debbugs.gnu.org; Fri, 16 Sep 2016 13:41:53 -0400 Received: from mail-wm0-f44.google.com ([74.125.82.44]:35564) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bkx96-0003kd-84 for 24425-done@debbugs.gnu.org; Fri, 16 Sep 2016 13:41:52 -0400 Received: by mail-wm0-f44.google.com with SMTP id l132so50123447wmf.0 for <24425-done@debbugs.gnu.org>; Fri, 16 Sep 2016 10:41:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=sYuE8a+i7RGm2rCnVVe9sXdgEJ05Eia8UTuKginiWDU=; b=obeAPD+9krcESiVbFktzYNf6YZEYRRXxGcHwzKkLzgWDKWeQ0M6W0Zyical2UH+tjl QZ1YJjmZZ9+SKZCfTB8wEoXkiu+SI7zK4l3sse38Ll4kteRDYX23Lvn4Ewyx/t2cliUl dGeuR5m9QoUltHv57MLylT8ANK+XLyuEdPMSYqtdr06CwQ71QEHPxVg5+ifuXN89+blt Bsnc/HW0cxI+UuyEe8shwJzIyM+XPnjR6PFsxaD4rhmKG1/Awo9ZVzVeY8lAXrKRRuMb TaYiP1vHRMh1jrO9i1vTUqqofONMTuae6Ed00E0N9dZKxIi9h1M1SpoHUE0oU5KfCCcW Ad0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:user-agent:face:date:message-id :mime-version:content-transfer-encoding; bh=sYuE8a+i7RGm2rCnVVe9sXdgEJ05Eia8UTuKginiWDU=; b=gZvSrgPh+bIIL8L3/Cd/m2/8DpZfP36KggAKMTHgWHTe+YdK3/bhYdKeCPYu5cmfih p09LQHQ+6O2SO1r/XScsr7xHFLwg9IXpeK6ZcRZOqvzgrQ/boX1neYCP7ySD+Ij0VUgz WpR/5mqgkIngu+Hi0YNNfz8oG13Av7ItvZGXZrEy7m9FhRXLfhNReGIGYJQXIhny0hyy 2mSoRNnvDPq0BScoONaikfa1+A1rE6oBG2+PwOkxfbasQj4QFXeLg5kSK0P0JSa/OC63 ec5Ai04x9bHcV7/RFsa3T3aqPwxZeDueQQDs305NXakAMEIK2lonFrRivHoUNyHRC6eg rJmg== X-Gm-Message-State: AE9vXwMBpiJsYRfFAXcWB4/HkecRQVlw1b5oIPjsegcpZeJOhBvv2sQAW/4KX0LqZWUeiM7C X-Received: by 10.194.176.69 with SMTP id cg5mr13609943wjc.52.1474047705940; Fri, 16 Sep 2016 10:41:45 -0700 (PDT) Received: from mpn-glaptop ([2620:0:105f:301:65e5:bbeb:d997:820c]) by smtp.gmail.com with ESMTPSA id a1sm9367194wju.41.2016.09.16.10.41.45 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 16 Sep 2016 10:41:45 -0700 (PDT) From: Michal Nazarewicz To: Eli Zaretskii Subject: Re: bug#24425: [PATCH] =?utf-8?Q?Don=E2=80=99t?= cast Unicode to 8-bit when casing unibyte strings In-Reply-To: <83twdh56xz.fsf@gnu.org> Organization: http://mina86.com/ References: <1473720367-2807-1-git-send-email-mina86@mina86.com> <83mvjb98f5.fsf@gnu.org> <83twdh56xz.fsf@gnu.org> User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.106 (x86_64-unknown-linux-gnu) Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:160916:eliz@gnu.org::ieCoL/o+60qlHoDZ:000009ZFq X-Hashcash: 1:20:160916:24425@debbugs.gnu.org::tpRgillOhk4xg8FI:0000000000000000000000000000000000000000UpCM Date: Fri, 16 Sep 2016 19:41:44 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.5 (--) X-Debbugs-Envelope-To: 24425-done Cc: 24425-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.5 (--) >> I thought about that but then another corner case is "istanbul\xff" >> which is a unibyte string with 8-bit bytes. On Thu, Sep 15 2016, Eli Zaretskii wrote: > And what is the problem in that case? Disregard. It=E2=80=99s actually fine. >> I have no strong feelings either way so I=E2=80=99m happy just leaving i= t as >> is as well. > That is fine with me. > > Was there some real-life use case where you bumped into this? If so, > maybe we should discuss that use case, perhaps the solution, if we > need one, is something other than what we talked about until now. There=E2=80=99s no real-life use case I=E2=80=99ve stumbled upon. I=E2=80=99m playing around with src/casefiddle.c adding support for various corner cases (such as =EF=AC=81sh becoming Fish or FISH) and was surprised = by (upcase "istanbul") when testing Turkish support. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB ------------=_1474047721-14461-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 12 Sep 2016 22:47:29 +0000 Received: from localhost ([127.0.0.1]:58014 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja0e-0001be-St for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:29 -0400 Received: from eggs.gnu.org ([208.118.235.92]:53453) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bja0d-0001bT-Iy for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja0X-0008Pm-Aj for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:22 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_50,RCVD_IN_SORBS_SPAM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:40286) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0X-0008PS-7i for submit@debbugs.gnu.org; Mon, 12 Sep 2016 18:47:21 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44419) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0U-0007GM-OH for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:19 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bja0Q-0008OS-IC for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:17 -0400 Received: from mail-wm0-f46.google.com ([74.125.82.46]:37688) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bja0Q-0008OM-8U for bug-gnu-emacs@gnu.org; Mon, 12 Sep 2016 18:47:14 -0400 Received: by mail-wm0-f46.google.com with SMTP id c131so82453676wmh.0 for ; Mon, 12 Sep 2016 15:47:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=1zhxQVmV81Y0CHINHSPKoCDPnbmWwiLDIxlYu45NrNc=; b=gK5OJjKCsBPBhuQdj2KxisacVV/FK0xLCsMLWMkm5nCgwvZzAhcCWSM0QUz5t9bcJd UvkKHlD8+KhYmdhsEJxGiZEq9sY/N4I6B8+kuRs3yPfisn43fZz8/pAN/tIE1qHN+a8Q xAHon/9JlruP3UeDj9tNq1HlXNXNnZyV0iZfnX0FqaNhGqbTuZ8MM7XrBqxjnBmlJJeb bnY3AQj2lPrZThynKcQ23YxGWUpZB2BbYpO+pwymTcg/oL9+BMkC1t5AYDZF9ZM7RhoN pYK4lItT14Q8gNPPyzSZMwfPPW+ikfRQVXTpLFP9jMgiex5eTITscrOdqk1T4GBjiJlz tDOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=1zhxQVmV81Y0CHINHSPKoCDPnbmWwiLDIxlYu45NrNc=; b=liw7lJTWuUPxZqXTsD7a/0AxFxLNNGR+lhSOOtGUOkAXhgg2imEEHsdt+cCIzVcOX7 6xl1rloWLOrm4wVsUWQMF3tKDwjFUcBahr0JEtr8o5B6/aSP5MaNK/K9uyDdM6qeFLw1 eHFSRPbx6hi0HTHJ52yIq/sOBzj1d1oaZp2a00v93ApPnmOAviPBSs8DHhojZDt2Iqr3 pT5aV6/Hzr8MXqs+wNsBRd84dS85COX354AB61kfwj5uCP8K+L6V1nQuD1kUWXyT7iHn v2oUks6dq0JxOsa2tZIz2ZrMPWQDbVAMOc9w94xXhfUQxXNg4WQIdDo49GRH4TbEKFfZ xMng== X-Gm-Message-State: AE9vXwMK//BkqT1mY6aC8d9Pr/fu/34Bvbtr7RSwF3D/sg+NoweIko5L+flu1tM49Cmah+Id X-Received: by 10.28.146.133 with SMTP id u127mr1881802wmd.21.1473720372962; Mon, 12 Sep 2016 15:46:12 -0700 (PDT) Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id d62sm19988523wmd.7.2016.09.12.15.46.11 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 12 Sep 2016 15:46:11 -0700 (PDT) Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id B043F1E0208; Tue, 13 Sep 2016 00:46:10 +0200 (CEST) From: Michal Nazarewicz To: bug-gnu-emacs@gnu.org Subject: [PATCH] =?UTF-8?q?Don=E2=80=99t=20cast=20Unicode=20to=208-bit=20w?= =?UTF-8?q?hen=20casing=20unibyte=20strings?= Date: Tue, 13 Sep 2016 00:46:07 +0200 Message-Id: <1473720367-2807-1-git-send-email-mina86@mina86.com> X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -3.5 (---) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.5 (---) Currently, when operating on unibyte strings and buffers, if casing ASCII character results in a Unicode character the result is forcefully converted to 8-bit by masking all but the eight least significant bits. This has awkward results such as: (let ((table (make-char-table 'case-table))) (set-char-table-parent table (current-case-table)) (set-case-syntax-pair ?I ?ı table) (set-case-syntax-pair ?İ ?i table) (with-case-table table (concat (upcase "istanabul") " " (downcase "IRMA")))) => "0STANABUL 1rma" Change the code so that ASCII characters being cased to Unicode characters are left unchanged when operating on unibyte data. In other words, aforementioned example will produce: => "iSTANBUL "Irma" Arguably this isn’t correct either but it’s less wrong and ther’s not much we can do when the strings are unibyte. Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed case. * src/casefiddle.c (casify_object, casify_region): When dealing with unibyte data, don’t attempt to store Unicode characters in the result. --- src/casefiddle.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) Unless there are objections, I’ll commit it in a few days. diff --git a/src/casefiddle.c b/src/casefiddle.c index 2d32f49..247cc6f 100644 --- a/src/casefiddle.c +++ b/src/casefiddle.c @@ -71,8 +71,8 @@ casify_object (enum case_action flag, Lisp_Object obj) { if (! inword) c = upcase1 (c1); - if (! multibyte) - MAKE_CHAR_UNIBYTE (c); + if (! multibyte && CHAR_BYTE8_P (c)) + c = CHAR_TO_BYTE8 (c); XSETFASTINT (obj, c | flags); } return obj; @@ -93,18 +93,19 @@ casify_object (enum case_action flag, Lisp_Object obj) c1 = c; if (inword && flag != CASE_CAPITALIZE_UP) c = downcase (c); - else if (!uppercasep (c) - && (!inword || flag != CASE_CAPITALIZE_UP)) - c = upcase1 (c1); + else if (!inword || flag != CASE_CAPITALIZE_UP) + c = upcase (c1); if ((int) flag >= (int) CASE_CAPITALIZE) inword = (SYNTAX (c) == Sword); if (c != c1) { - MAKE_CHAR_UNIBYTE (c); - /* If the char can't be converted to a valid byte, just don't - change it. */ - if (c >= 0 && c < 256) - SSET (obj, i, c); + if (CHAR_BYTE8_P (c)) + c = CHAR_TO_BYTE8 (c); + else if (!ASCII_CHAR_P (c)) + /* If the char can't be converted to a valid byte, just don't + change it. */ + continue; + SSET (obj, i, c); } } return obj; @@ -250,8 +251,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e) if (! multibyte) { - MAKE_CHAR_UNIBYTE (c); - FETCH_BYTE (start_byte) = c; + /* If the char can't be converted to a valid byte, just don't + change it. */ + if (ASCII_CHAR_P (c) || + (CHAR_BYTE8_P (c) && ((c = CHAR_TO_BYTE8 (c)), true))) + FETCH_BYTE (start_byte) = c; } else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c)) FETCH_BYTE (start_byte) = c; -- 2.8.0.rc3.226.g39d4020 ------------=_1474047721-14461-1--