GNU bug report logs - #24425
[PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings

Previous Next

Package: emacs;

Reported by: Michal Nazarewicz <mina86 <at> mina86.com>

Date: Mon, 12 Sep 2016 22:48:02 UTC

Severity: normal

Tags: patch

Done: Michal Nazarewicz <mina86 <at> mina86.com>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Michal Nazarewicz <mina86 <at> mina86.com>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#24425: closed ([PATCH] Don’t cast Unicode
 to 8-bit when casing unibyte strings)
Date: Fri, 16 Sep 2016 17:42:01 +0000

[Message part 1 (text/plain, inline)]

Your message dated Fri, 16 Sep 2016 19:41:44 +0200
with message-id <xa1th99f3fon.fsf <at> mina86.com>
and subject line Re: bug#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
has caused the debbugs.gnu.org bug report #24425,
regarding [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
24425: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24425
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
Date: Tue, 13 Sep 2016 00:46:07 +0200

Currently, when operating on unibyte strings and buffers, if casing
ASCII character results in a Unicode character the result is forcefully
converted to 8-bit by masking all but the eight least significant bits.
This has awkward results such as:

	(let ((table (make-char-table 'case-table)))
	  (set-char-table-parent table (current-case-table))
	  (set-case-syntax-pair ?I ?ı table)
	  (set-case-syntax-pair ?İ ?i table)
	  (with-case-table table
	    (concat (upcase "istanabul") " " (downcase "IRMA"))))
	=> "0STANABUL 1rma"

Change the code so that ASCII characters being cased to Unicode
characters are left unchanged when operating on unibyte data.  In other
words, aforementioned example will produce:

	=> "iSTANBUL "Irma"

Arguably this isn’t correct either but it’s less wrong and ther’s not
much we can do when the strings are unibyte.

Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since
CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode
characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed
case.

* src/casefiddle.c (casify_object, casify_region): When dealing with
unibyte data, don’t attempt to store Unicode characters in the result.
---
 src/casefiddle.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

 Unless there are objections, I’ll commit it in a few days.

diff --git a/src/casefiddle.c b/src/casefiddle.c
index 2d32f49..247cc6f 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -71,8 +71,8 @@ casify_object (enum case_action flag, Lisp_Object obj)
 	{
 	  if (! inword)
 	    c = upcase1 (c1);
-	  if (! multibyte)
-	    MAKE_CHAR_UNIBYTE (c);
+	  if (! multibyte && CHAR_BYTE8_P (c))
+	    c = CHAR_TO_BYTE8 (c);
 	  XSETFASTINT (obj, c | flags);
 	}
       return obj;
@@ -93,18 +93,19 @@ casify_object (enum case_action flag, Lisp_Object obj)
 	  c1 = c;
 	  if (inword && flag != CASE_CAPITALIZE_UP)
 	    c = downcase (c);
-	  else if (!uppercasep (c)
-		   && (!inword || flag != CASE_CAPITALIZE_UP))
-	    c = upcase1 (c1);
+	  else if (!inword || flag != CASE_CAPITALIZE_UP)
+	    c = upcase (c1);
 	  if ((int) flag >= (int) CASE_CAPITALIZE)
 	    inword = (SYNTAX (c) == Sword);
 	  if (c != c1)
 	    {
-	      MAKE_CHAR_UNIBYTE (c);
-	      /* If the char can't be converted to a valid byte, just don't
-		 change it.  */
-	      if (c >= 0 && c < 256)
-		SSET (obj, i, c);
+	      if (CHAR_BYTE8_P (c))
+		c = CHAR_TO_BYTE8 (c);
+	      else if (!ASCII_CHAR_P (c))
+		/* If the char can't be converted to a valid byte, just don't
+		   change it.  */
+		continue;
+	      SSET (obj, i, c);
 	    }
 	}
       return obj;
@@ -250,8 +251,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
 
 	  if (! multibyte)
 	    {
-	      MAKE_CHAR_UNIBYTE (c);
-	      FETCH_BYTE (start_byte) = c;
+	      /* If the char can't be converted to a valid byte, just don't
+		 change it.  */
+	      if (ASCII_CHAR_P (c) ||
+		  (CHAR_BYTE8_P (c) && ((c = CHAR_TO_BYTE8 (c)), true)))
+		FETCH_BYTE (start_byte) = c;
 	    }
 	  else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c))
 	    FETCH_BYTE (start_byte) = c;
-- 
2.8.0.rc3.226.g39d4020

[Message part 3 (message/rfc822, inline)]

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 24425-done <at> debbugs.gnu.org
Subject: Re: bug#24425: [PATCH] Don’t cast Unicode to
 8-bit when casing unibyte strings
Date: Fri, 16 Sep 2016 19:41:44 +0200

>> I thought about that but then another corner case is "istanbul\xff"
>> which is a unibyte string with 8-bit bytes.

On Thu, Sep 15 2016, Eli Zaretskii wrote:
> And what is the problem in that case?

Disregard.  It’s actually fine.

>> I have no strong feelings either way so I’m happy just leaving it as
>> is as well.

> That is fine with me.
>
> Was there some real-life use case where you bumped into this?  If so,
> maybe we should discuss that use case, perhaps the solution, if we
> need one, is something other than what we talked about until now.

There’s no real-life use case I’ve stumbled upon.

I’m playing around with src/casefiddle.c adding support for various
corner cases (such as ﬁsh becoming Fish or FISH) and was surprised by
(upcase "istanbul") when testing Turkish support.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

This bug report was last modified 8 years and 308 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #24425 [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings

GNU bug report logs - #24425
[PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings