GNU bug report logs -
#37580
26.3; setting buffer as unibyte temporarily may change buffer contents
Previous Next
Reported by: ynyaaa <at> gmail.com
Date: Wed, 2 Oct 2019 09:44:01 UTC
Severity: normal
Tags: notabug
Found in version 26.3
Done: Stefan Kangas <stefan <at> marxist.se>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
> From: ynyaaa <at> gmail.com
> Cc: 37580 <at> debbugs.gnu.org
> Date: Sun, 06 Oct 2019 02:18:08 +0900
>
> Sometimes I find broken utf-8 texts on the Internet.
> Some characters are split into surrogate pairs, and each surrogate
> character is encoded as if it is a normal BMP character.
>
> utf-8 coding system does not decode such sequences.
> Changing multibyte-ness converts them to surrogate characters.
> And encode-decode process with utf-16be outputs the intended characeters.
>
> Suppose the character is #x10000,
> the correspoding pair is (#xD800 #xDC00).
> The miss-encoded sequence is:
> (encode-coding-string "\xD800\xDC00" 'utf-8)
> => "\355\240\200\355\260\200"
>
> It is not decoded with utf-8.
> (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-8)
> 'utf-8)
> => "\355\240\200\355\260\200"
>
> Changing multibyte-ness, the sequence is converted into surrogate
> characters.
> (with-temp-buffer
> (insert (encode-coding-string "\xD800\xDC00" 'utf-8))
> (set-buffer-multibyte nil)
> (set-buffer-multibyte t)
> (buffer-string))
> => "\xD800\xDC00"
>
> The surrogate pair can be converted into the original character.
> (decode-coding-string (encode-coding-string "\xD800\xDC00" 'utf-16be)
> 'utf-16be)
> => "\x10000"
So where's the problem in all this? AFAIU, you describe a sequence of
actions that successfully recovers text in an obscure situation.
I think the problem is that you enable undo. So in that case, just
don't do that.
This bug report was last modified 5 years and 264 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.