GNU bug report logs - #24425
[PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings

Previous Next

Package: emacs;

Reported by: Michal Nazarewicz <mina86 <at> mina86.com>

Date: Mon, 12 Sep 2016 22:48:02 UTC

Severity: normal

Tags: patch

Done: Michal Nazarewicz <mina86 <at> mina86.com>

Bug is archived. No further changes may be made.

Full log


Message #11 received at 24425 <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 24425 <at> debbugs.gnu.org
Subject: Re: bug#24425: [PATCH] Don’t cast Unicode to
 8-bit when casing unibyte strings
Date: Thu, 15 Sep 2016 16:23:54 +0200
On Tue, Sep 13 2016, Eli Zaretskii wrote:
> Currently, case changes in unibyte characters and strings are only
> well defined for pure ASCII text; if the input or the result is not
> pure ASCII, we produce "undefined behavior".

Would the following (not tested) make sense then:

diff --git a/src/casefiddle.c b/src/casefiddle.c
index 2d32f49..4dc2357 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -89,23 +89,19 @@ casify_object (enum case_action flag, Lisp_Object obj)
       for (i = 0; i < size; i++)
 	{
 	  c = SREF (obj, i);
-	  MAKE_CHAR_MULTIBYTE (c);
 	  c1 = c;
-	  if (inword && flag != CASE_CAPITALIZE_UP)
-	    c = downcase (c);
-	  else if (!uppercasep (c)
-		   && (!inword || flag != CASE_CAPITALIZE_UP))
-	    c = upcase1 (c1);
-	  if ((int) flag >= (int) CASE_CAPITALIZE)
-	    inword = (SYNTAX (c) == Sword);
-	  if (c != c1)
+	  if (ASCII_CHAR_P (c))
 	    {
-	      MAKE_CHAR_UNIBYTE (c);
-	      /* If the char can't be converted to a valid byte, just don't
-		 change it.  */
-	      if (c >= 0 && c < 256)
-		SSET (obj, i, c);
+	      if (inword && flag != CASE_CAPITALIZE_UP)
+		c = downcase (c);
+	      else if (!uppercasep (c)
+		       && (!inword || flag != CASE_CAPITALIZE_UP))
+		c = upcase1 (c1);
 	    }
+	  if ((int) flag >= (int) CASE_CAPITALIZE)
+	    inword = (SYNTAX (c) == Sword);
+	  if (c != c1 && ASCII_CHAR_P (c))
+	    SSET (obj, i, c);
 	}
       return obj;
     }
@@ -230,8 +226,9 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
       else
 	{
 	  c = FETCH_BYTE (start_byte);
-	  MAKE_CHAR_MULTIBYTE (c);
 	  len = 1;
+	  if (!ASCII_CHAR_P (c))
+	    goto done;
 	}
       c2 = c;
       if (inword && flag != CASE_CAPITALIZE_UP)
@@ -239,9 +236,6 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
       else if (!uppercasep (c)
 	       && (!inword || flag != CASE_CAPITALIZE_UP))
 	c = upcase1 (c);
-      if ((int) flag >= (int) CASE_CAPITALIZE)
-	inword = ((SYNTAX (c) == Sword)
-		  && (inword || !syntax_prefix_flag_p (c)));
       if (c != c2)
 	{
 	  last = start;
@@ -250,8 +244,8 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
 
 	  if (! multibyte)
 	    {
-	      MAKE_CHAR_UNIBYTE (c);
-	      FETCH_BYTE (start_byte) = c;
+	      if (ASCII_CHAR_P (c))
+		FETCH_BYTE (start_byte) = c;
 	    }
 	  else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c))
 	    FETCH_BYTE (start_byte) = c;
@@ -280,6 +274,10 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
 		}
 	    }
 	}
+    done:
+      if ((int) flag >= (int) CASE_CAPITALIZE)
+	inword = ((SYNTAX (c) == Sword)
+		  && (inword || !syntax_prefix_flag_p (c)));
       start++;
       start_byte += len;
     }

If working on non-ASCII characters isn’t supported we might just as well
skip all the logic that handles non-ASCII unibyte characters.

> Properly means that upcasing "istanbul" in the above example will
> produce "İSTANBUL", not "iSTANBUL", and downcasing "IRMA" will produce
> "ırma".

I thought about that but then another corner case is "istanbul\xff"
which is a unibyte string with 8-bit bytes.

I have no strong feelings either way so I’m happy just leaving it as is
as well.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»




This bug report was last modified 8 years and 252 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.