GNU bug report logs - #66760
29.1; [BUG] GB18030 Incorrect Encoding

Previous Next

Package: emacs;

Reported by: Ruijie Yu <yuruijie <at> sics.ac.cn>

Date: Thu, 26 Oct 2023 13:18:01 UTC

Severity: normal

Tags: confirmed, help

Found in version 29.1

To reply to this bug, email your comments to 66760 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#66760; Package emacs. (Thu, 26 Oct 2023 13:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ruijie Yu <yuruijie <at> sics.ac.cn>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 26 Oct 2023 13:18:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Ruijie Yu" <yuruijie <at> sics.ac.cn>
To: bug-gnu-emacs <at> gnu.org
Subject: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Thu, 26 Oct 2023 19:43:54 +0800

Hello,

I have noticed that in GB18030 encoding, certain ranges of characters
have incorrect encodings.

One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
and verified from other programs such as iconv and MySQL), whereas the
observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
offset.

This behavior can be reproduced by the following recipe under both
GNU/Linux and Windows:

--8<---------------cut here---------------start------------->8---
$ emacs
C-x h DEL
C-x C-m f gb18030 RET
C-x 8 RET 217a RET
M-<
C-u C-x =
;; observe the "file code":
;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
--8<---------------cut here---------------end--------------->8---

In contrast, this is what I get on MySQL (which I have also verified
against the GB18030 standard):

--8<---------------cut here---------------start------------->8---
> CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> INSERT INTO gb VALUES (0, 'ⅺ');
> SELECT HEX(c) FROM gb;

+----------+
| hex(c)   |
+----------+
| 8136C530 |
+----------+
--8<---------------cut here---------------end--------------->8---

Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
which has an offset of 9 codepoints.

Could someone with more expertise and time look into why there is a
mismatch between Emacs' GB18030 data and the standard?

[1]:
https://openstd.samr.gov.cn/bzgk/gb/newGbInfo?hcno=A1931A578FE14957104988029B0833D3
(200+MB PDF.  Unfortunately this is the only official source which I can find, and it
requires a captcha.)

-- 

Best,

RY

In GNU Emacs 29.1 (build 2, x86_64-w64-mingw32) of 2023-08-02 built on
 AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.19045
System Description: Microsoft Windows 10 Enterprise (v10.0.2009.19045.3086)

Configured using:
 'configure --with-modules --without-dbus --with-native-compilation=aot
 --without-compress-install --with-tree-sitter CFLAGS=-O2'

Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB

(NATIVE_COMP present but libgccjit not available)

Important settings:
  value of $LANG: CHS
  locale-coding-system: cp936

Major mode: Lisp Interaction

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66760; Package emacs. (Thu, 26 Oct 2023 13:28:01 GMT) Full text and rfc822 format available.

Message #8 received at 66760 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Ruijie Yu <yuruijie <at> sics.ac.cn>
Cc: 66760 <at> debbugs.gnu.org
Subject: Re: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Thu, 26 Oct 2023 16:26:52 +0300

> Date: Thu, 26 Oct 2023 19:43:54 +0800
> From: "Ruijie Yu" <yuruijie <at> sics.ac.cn>
> 
> Hello,
> 
> I have noticed that in GB18030 encoding, certain ranges of characters
> have incorrect encodings.
> 
> One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
> encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
> and verified from other programs such as iconv and MySQL), whereas the
> observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
> offset.
> 
> This behavior can be reproduced by the following recipe under both
> GNU/Linux and Windows:
> 
> --8<---------------cut here---------------start------------->8---
> $ emacs
> C-x h DEL
> C-x C-m f gb18030 RET
> C-x 8 RET 217a RET
> M-<
> C-u C-x =
> ;; observe the "file code":
> ;; file code: #x81 #x36 #xC4 #x39 (encoded by coding system chinese-gb18030-dos)
> --8<---------------cut here---------------end--------------->8---
> 
> In contrast, this is what I get on MySQL (which I have also verified
> against the GB18030 standard):
> 
> --8<---------------cut here---------------start------------->8---
> > CREATE TABLE gb (id INT, c TEXT CHARACTER SET GB18030);
> > INSERT INTO gb VALUES (0, 'ⅺ');
> > SELECT HEX(c) FROM gb;
> 
> +----------+
> | hex(c)   |
> +----------+
> | 8136C530 |
> +----------+
> --8<---------------cut here---------------end--------------->8---
> 
> Beyond this, I also noticed that U+A642 (CYRILLIC CAPITAL LETTER DZELO)
> has the encoding 82 36 B9 36 on Emacs, whereas MySQL has 82 36 BA 35,
> which has an offset of 9 codepoints.
> 
> Could someone with more expertise and time look into why there is a
> mismatch between Emacs' GB18030 data and the standard?

Alas, we don't have such experts on board, not anymore.  So we must do
it on our own somehow.

The mapping of GB18030 to Unicode is taken from glibc, see
etc/charsets/GB180302.map and etc/charsets/GB180304.map.  It is
possible that you are talking about a newer version of the GB18030
standard than these two mappings.  It is also possible that glibc has
since updated the mappings, and we failed to follow suit.  If so, we
need either to update the existing mappings or to add newer mappings.
Could you please see what needs to be done in this regard?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66760; Package emacs. (Thu, 26 Oct 2023 14:22:02 GMT) Full text and rfc822 format available.

Message #11 received at 66760 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> suse.de>
To: "Ruijie Yu" <yuruijie <at> sics.ac.cn>
Cc: 66760 <at> debbugs.gnu.org
Subject: Re: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Thu, 26 Oct 2023 16:20:59 +0200

On Okt 26 2023, Ruijie Yu wrote:

> I have noticed that in GB18030 encoding, certain ranges of characters
> have incorrect encodings.
>
> One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
> encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
> and verified from other programs such as iconv and MySQL), whereas the
> observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
> offset.

This is a bug in the generation of GB180304.map.  The gb180303.awk
script assumes that the 4-byte encodings of GB18030 are filling the
holes in sequence of characters with a 2-byte encoding by Unicode
codepoint order, but there are some places where codepoints from the PUA
area are inserted into the sequence.  For example, U+1E3E maps to 81 35
F4 36, the next codepoint not mapped to a 2-byte code is U+1E40, but
that maps to 81 35 F4 38, whereas 81 35 F4 37 is the encoding of U+E7C7.
So the output gets out of sync.

-- 
Andreas Schwab, SUSE Labs, schwab <at> suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#66760; Package emacs. (Sat, 04 Nov 2023 08:27:02 GMT) Full text and rfc822 format available.

Message #14 received at 66760 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Schwab <schwab <at> suse.de>
Cc: 66760 <at> debbugs.gnu.org, yuruijie <at> sics.ac.cn
Subject: Re: bug#66760: 29.1; [BUG] GB18030 Incorrect Encoding
Date: Sat, 04 Nov 2023 10:25:07 +0200

> Cc: 66760 <at> debbugs.gnu.org
> From: Andreas Schwab <schwab <at> suse.de>
> Date: Thu, 26 Oct 2023 16:20:59 +0200
> 
> On Okt 26 2023, Ruijie Yu wrote:
> 
> > I have noticed that in GB18030 encoding, certain ranges of characters
> > have incorrect encodings.
> >
> > One example is U+217A (SMALL ROMAN NUMERAL ELEVEN).  The expected
> > encoding is 81 36 C5 30 (as can be seen from the GB18030 standard [1]
> > and verified from other programs such as iconv and MySQL), whereas the
> > observed encoding within Emacs is 81 36 C4 39, with a 1-codepoint
> > offset.
> 
> This is a bug in the generation of GB180304.map.  The gb180303.awk
> script assumes that the 4-byte encodings of GB18030 are filling the
> holes in sequence of characters with a 2-byte encoding by Unicode
> codepoint order, but there are some places where codepoints from the PUA
> area are inserted into the sequence.  For example, U+1E3E maps to 81 35
> F4 36, the next codepoint not mapped to a 2-byte code is U+1E40, but
> that maps to 81 35 F4 38, whereas 81 35 F4 37 is the encoding of U+E7C7.
> So the output gets out of sync.

Thanks.  I don't think I understand the issue well enough, so patches
are welcome to fix this problem in the Awk script.

Added tag(s) help and confirmed. Request was from Stefan Kangas <stefankangas <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 10 Jan 2024 18:01:02 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 217 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #66760 29.1; [BUG] GB18030 Incorrect Encoding

GNU bug report logs - #66760
29.1; [BUG] GB18030 Incorrect Encoding