GNU bug report logs - #66760
29.1; [BUG] GB18030 Incorrect Encoding

Previous Next

Package: emacs;

Reported by: Ruijie Yu <yuruijie <at> sics.ac.cn>

Date: Thu, 26 Oct 2023 13:18:01 UTC

Severity: normal

Tags: confirmed, help

Found in version 29.1

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> suse.de>
Cc: 66760 <at> debbugs.gnu.org, yuruijie <at> sics.ac.cn
Subject: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Sat, 04 Nov 2023 10:25:07 +0200
> Cc: 66760 <at> debbugs.gnu.org
> From: Andreas Schwab <schwab <at> suse.de>
> Date: Thu, 26 Oct 2023 16:20:59 +0200
> 
> On Okt 26 2023, Ruijie Yu wrote:
> 
> > I have noticed that in GB18030 encoding, certain ranges of characters
> > have incorrect encodings.
> >
> > One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
> > encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
> > and verified from other programs such as iconv and MySQL), whereas the
> > observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
> > offset.
> 
> This is a bug in the generation of GB180304.map.  The gb180303.awk
> script assumes that the 4-byte encodings of GB18030 are filling the
> holes in sequence of characters with a 2-byte encoding by Unicode
> codepoint order, but there are some places where codepoints from the PUA
> area are inserted into the sequence.  For example, U+1E3E maps to 81 35
> F4 36, the next codepoint not mapped to a 2-byte code is U+1E40, but
> that maps to 81 35 F4 38, whereas 81 35 F4 37 is the encoding of U+E7C7.
> So the output gets out of sync.

Thanks.  I don't think I understand the issue well enough, so patches
are welcome to fix this problem in the Awk script.




This bug report was last modified 1 year and 217 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.