GNU bug report logs - #24425
[PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings

Previous Next

Package: emacs;

Reported by: Michal Nazarewicz <mina86 <at> mina86.com>

Date: Mon, 12 Sep 2016 22:48:02 UTC

Severity: normal

Tags: patch

Done: Michal Nazarewicz <mina86 <at> mina86.com>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Michal Nazarewicz <mina86 <at> mina86.com>
Subject: bug#24425: closed (Re: bug#24425: [PATCH] Don’t
 cast Unicode to 8-bit when casing unibyte strings)
Date: Fri, 16 Sep 2016 17:42:01 +0000

[Message part 1 (text/plain, inline)]

Your bug report

#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings

which was filed against the emacs package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 24425 <at> debbugs.gnu.org.

-- 
24425: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24425
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 24425-done <at> debbugs.gnu.org
Subject: Re: bug#24425: [PATCH] Don’t cast Unicode to
 8-bit when casing unibyte strings
Date: Fri, 16 Sep 2016 19:41:44 +0200

>> I thought about that but then another corner case is "istanbul\xff"
>> which is a unibyte string with 8-bit bytes.

On Thu, Sep 15 2016, Eli Zaretskii wrote:
> And what is the problem in that case?

Disregard.  It’s actually fine.

>> I have no strong feelings either way so I’m happy just leaving it as
>> is as well.

> That is fine with me.
>
> Was there some real-life use case where you bumped into this?  If so,
> maybe we should discuss that use case, perhaps the solution, if we
> need one, is something other than what we talked about until now.

There’s no real-life use case I’ve stumbled upon.

I’m playing around with src/casefiddle.c adding support for various
corner cases (such as ﬁsh becoming Fish or FISH) and was surprised by
(upcase "istanbul") when testing Turkish support.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

[Message part 3 (message/rfc822, inline)]

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
Date: Tue, 13 Sep 2016 00:46:07 +0200

Currently, when operating on unibyte strings and buffers, if casing
ASCII character results in a Unicode character the result is forcefully
converted to 8-bit by masking all but the eight least significant bits.
This has awkward results such as:

	(let ((table (make-char-table 'case-table)))
	  (set-char-table-parent table (current-case-table))
	  (set-case-syntax-pair ?I ?ı table)
	  (set-case-syntax-pair ?İ ?i table)
	  (with-case-table table
	    (concat (upcase "istanabul") " " (downcase "IRMA"))))
	=> "0STANABUL 1rma"

Change the code so that ASCII characters being cased to Unicode
characters are left unchanged when operating on unibyte data.  In other
words, aforementioned example will produce:

	=> "iSTANBUL "Irma"

Arguably this isn’t correct either but it’s less wrong and ther’s not
much we can do when the strings are unibyte.

Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since
CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode
characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed
case.

* src/casefiddle.c (casify_object, casify_region): When dealing with
unibyte data, don’t attempt to store Unicode characters in the result.
---
 src/casefiddle.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

 Unless there are objections, I’ll commit it in a few days.

diff --git a/src/casefiddle.c b/src/casefiddle.c
index 2d32f49..247cc6f 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -71,8 +71,8 @@ casify_object (enum case_action flag, Lisp_Object obj)
 	{
 	  if (! inword)
 	    c = upcase1 (c1);
-	  if (! multibyte)
-	    MAKE_CHAR_UNIBYTE (c);
+	  if (! multibyte && CHAR_BYTE8_P (c))
+	    c = CHAR_TO_BYTE8 (c);
 	  XSETFASTINT (obj, c | flags);
 	}
       return obj;
@@ -93,18 +93,19 @@ casify_object (enum case_action flag, Lisp_Object obj)
 	  c1 = c;
 	  if (inword && flag != CASE_CAPITALIZE_UP)
 	    c = downcase (c);
-	  else if (!uppercasep (c)
-		   && (!inword || flag != CASE_CAPITALIZE_UP))
-	    c = upcase1 (c1);
+	  else if (!inword || flag != CASE_CAPITALIZE_UP)
+	    c = upcase (c1);
 	  if ((int) flag >= (int) CASE_CAPITALIZE)
 	    inword = (SYNTAX (c) == Sword);
 	  if (c != c1)
 	    {
-	      MAKE_CHAR_UNIBYTE (c);
-	      /* If the char can't be converted to a valid byte, just don't
-		 change it.  */
-	      if (c >= 0 && c < 256)
-		SSET (obj, i, c);
+	      if (CHAR_BYTE8_P (c))
+		c = CHAR_TO_BYTE8 (c);
+	      else if (!ASCII_CHAR_P (c))
+		/* If the char can't be converted to a valid byte, just don't
+		   change it.  */
+		continue;
+	      SSET (obj, i, c);
 	    }
 	}
       return obj;
@@ -250,8 +251,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
 
 	  if (! multibyte)
 	    {
-	      MAKE_CHAR_UNIBYTE (c);
-	      FETCH_BYTE (start_byte) = c;
+	      /* If the char can't be converted to a valid byte, just don't
+		 change it.  */
+	      if (ASCII_CHAR_P (c) ||
+		  (CHAR_BYTE8_P (c) && ((c = CHAR_TO_BYTE8 (c)), true)))
+		FETCH_BYTE (start_byte) = c;
 	    }
 	  else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c))
 	    FETCH_BYTE (start_byte) = c;
-- 
2.8.0.rc3.226.g39d4020

This bug report was last modified 8 years and 307 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #24425 [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings

GNU bug report logs - #24425
[PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings