GNU bug report logs -
#24425
[PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
Previous Next
Reported by: Michal Nazarewicz <mina86 <at> mina86.com>
Date: Mon, 12 Sep 2016 22:48:02 UTC
Severity: normal
Tags: patch
Done: Michal Nazarewicz <mina86 <at> mina86.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your bug report
#24425: [PATCH] Don’t cast Unicode to 8-bit when casing unibyte strings
which was filed against the emacs package, has been closed.
The explanation is attached below, along with your original report.
If you require more details, please reply to 24425 <at> debbugs.gnu.org.
--
24425: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=24425
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
>> I thought about that but then another corner case is "istanbul\xff"
>> which is a unibyte string with 8-bit bytes.
On Thu, Sep 15 2016, Eli Zaretskii wrote:
> And what is the problem in that case?
Disregard. It’s actually fine.
>> I have no strong feelings either way so I’m happy just leaving it as
>> is as well.
> That is fine with me.
>
> Was there some real-life use case where you bumped into this? If so,
> maybe we should discuss that use case, perhaps the solution, if we
> need one, is something other than what we talked about until now.
There’s no real-life use case I’ve stumbled upon.
I’m playing around with src/casefiddle.c adding support for various
corner cases (such as fish becoming Fish or FISH) and was surprised by
(upcase "istanbul") when testing Turkish support.
--
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»
[Message part 3 (message/rfc822, inline)]
Currently, when operating on unibyte strings and buffers, if casing
ASCII character results in a Unicode character the result is forcefully
converted to 8-bit by masking all but the eight least significant bits.
This has awkward results such as:
(let ((table (make-char-table 'case-table)))
(set-char-table-parent table (current-case-table))
(set-case-syntax-pair ?I ?ı table)
(set-case-syntax-pair ?İ ?i table)
(with-case-table table
(concat (upcase "istanabul") " " (downcase "IRMA"))))
=> "0STANABUL 1rma"
Change the code so that ASCII characters being cased to Unicode
characters are left unchanged when operating on unibyte data. In other
words, aforementioned example will produce:
=> "iSTANBUL "Irma"
Arguably this isn’t correct either but it’s less wrong and ther’s not
much we can do when the strings are unibyte.
Note that casify_object had a ‘(c >= 0 && c < 256)’ condition but since
CHAR_TO_BYTE8 (and thus MAKE_CHAR_UNIBYTE) happily casts Unicode
characters to 8-bit (i.e. c & 0xFF), this never triggered for discussed
case.
* src/casefiddle.c (casify_object, casify_region): When dealing with
unibyte data, don’t attempt to store Unicode characters in the result.
---
src/casefiddle.c | 28 ++++++++++++++++------------
1 file changed, 16 insertions(+), 12 deletions(-)
Unless there are objections, I’ll commit it in a few days.
diff --git a/src/casefiddle.c b/src/casefiddle.c
index 2d32f49..247cc6f 100644
--- a/src/casefiddle.c
+++ b/src/casefiddle.c
@@ -71,8 +71,8 @@ casify_object (enum case_action flag, Lisp_Object obj)
{
if (! inword)
c = upcase1 (c1);
- if (! multibyte)
- MAKE_CHAR_UNIBYTE (c);
+ if (! multibyte && CHAR_BYTE8_P (c))
+ c = CHAR_TO_BYTE8 (c);
XSETFASTINT (obj, c | flags);
}
return obj;
@@ -93,18 +93,19 @@ casify_object (enum case_action flag, Lisp_Object obj)
c1 = c;
if (inword && flag != CASE_CAPITALIZE_UP)
c = downcase (c);
- else if (!uppercasep (c)
- && (!inword || flag != CASE_CAPITALIZE_UP))
- c = upcase1 (c1);
+ else if (!inword || flag != CASE_CAPITALIZE_UP)
+ c = upcase (c1);
if ((int) flag >= (int) CASE_CAPITALIZE)
inword = (SYNTAX (c) == Sword);
if (c != c1)
{
- MAKE_CHAR_UNIBYTE (c);
- /* If the char can't be converted to a valid byte, just don't
- change it. */
- if (c >= 0 && c < 256)
- SSET (obj, i, c);
+ if (CHAR_BYTE8_P (c))
+ c = CHAR_TO_BYTE8 (c);
+ else if (!ASCII_CHAR_P (c))
+ /* If the char can't be converted to a valid byte, just don't
+ change it. */
+ continue;
+ SSET (obj, i, c);
}
}
return obj;
@@ -250,8 +251,11 @@ casify_region (enum case_action flag, Lisp_Object b, Lisp_Object e)
if (! multibyte)
{
- MAKE_CHAR_UNIBYTE (c);
- FETCH_BYTE (start_byte) = c;
+ /* If the char can't be converted to a valid byte, just don't
+ change it. */
+ if (ASCII_CHAR_P (c) ||
+ (CHAR_BYTE8_P (c) && ((c = CHAR_TO_BYTE8 (c)), true)))
+ FETCH_BYTE (start_byte) = c;
}
else if (ASCII_CHAR_P (c2) && ASCII_CHAR_P (c))
FETCH_BYTE (start_byte) = c;
--
2.8.0.rc3.226.g39d4020
This bug report was last modified 8 years and 251 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.