GNU bug report logs - #23814
24.5; bug of hz coding-system

Previous Next

Package: emacs;

Reported by: ynyaaa <at> gmail.com

Date: Tue, 21 Jun 2016 12:23:02 UTC

Severity: normal

Found in version 24.5

Fixed in version 26.1

Done: Glenn Morris <rgm <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23814 in the body.
You can then email your comments to 23814 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Tue, 21 Jun 2016 12:23:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to ynyaaa <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Tue, 21 Jun 2016 12:23:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: bug-gnu-emacs <at> gnu.org
Subject: 24.5; bug of hz coding-system
Date: Tue, 21 Jun 2016 21:22:32 +0900
hz coding-system should encode chinese-gb2312 characters,
it may fail to encode text without charset property.

current-language-environment
=>"Japanese"

;; wrong
(encode-coding-string "\x4E00" 'hz)
=>"\e$B0l~}"

;; correct
(encode-coding-string (propertize "\x4E00" 'charset 'chinese-gb2312) 'hz)
=>"~{R;~}"


When the second byte of chinese-gb2312 character equals to ?~,
hz coding-system may faile to decode.

(encode-coding-string (propertize "\x670D" 'charset 'chinese-gb2312) 'hz)
=>"~{7~~}"

;; wrong
(decode-coding-string "~{7~~}" 'hz)
=>"\300\267"



In GNU Emacs 24.5.1 (i686-pc-mingw32)
 of 2015-04-11 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.0.6002
Configured using:
 `configure --prefix=/c/usr --host=i686-pc-mingw32'

Important settings:
  value of $LANG: JPN
  locale-coding-system: cp932

Major mode: Lisp Interaction

Minor modes in effect:
  tooltip-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:

Load-path shadows:
None found.

Features:
(network-stream starttls tls mailalias smtpmail auth-source eieio
byte-opt bytecomp byte-compile cl-extra cl-loaddefs cl-lib cconv
eieio-core password-cache rect warnings china-util misearch
multi-isearch pp shadow sort gnus-util mail-extr emacsbug message
format-spec rfc822 mml mml-sec mm-decode mm-bodies mm-encode mail-parse
rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045
ietf-drums mm-util mail-prsvr mail-utils help-mode easymenu advice
help-fns time-date japan-util tooltip electric uniquify ediff-hook
vc-hooks lisp-float-type mwheel dos-w32 ls-lisp w32-common-fns
disp-table w32-win w32-vars tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode prog-mode register page menu-bar
rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax
facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak
czech european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer nadvice loaddefs button
faces cus-face macroexp files text-properties overlay sha1 md5 base64
format env code-pages mule custom widget hashtable-print-readable
backquote make-network-process w32notify w32 multi-tty emacs)

Memory information:
((conses 8 94845 27098)
 (symbols 32 19573 0)
 (miscs 32 77 279)
 (strings 16 16482 13821)
 (string-bytes 1 462365)
 (vectors 8 12746)
 (vector-slots 4 519456 11240)
 (floats 8 62 556)
 (intervals 28 606 13)
 (buffers 508 18))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Tue, 21 Jun 2016 13:00:02 GMT) Full text and rfc822 format available.

Message #8 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Tue, 21 Jun 2016 15:58:39 +0300
> From: ynyaaa <at> gmail.com
> Date: Tue, 21 Jun 2016 21:22:32 +0900
> 
> hz coding-system should encode chinese-gb2312 characters,
> it may fail to encode text without charset property.

This is by design, and mentioned in the doc string of that
coding-system.  Since Emacs is Unicode based, the _only_ way of having
"chinese-gb2312 characters" is by using that text property.

IOW, I don't think this is a bug.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 22 Jun 2016 13:48:01 GMT) Full text and rfc822 format available.

Message #11 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 22 Jun 2016 22:47:00 +0900
Eli Zaretskii <eliz <at> gnu.org> writes:

> This is by design, and mentioned in the doc string of that
> coding-system.  Since Emacs is Unicode based, the _only_ way of having
> "chinese-gb2312 characters" is by using that text property.

`encode-hz-region' uses `iso-2022-7bit' coding-system internally,
replacing it with the coding-system below will work.

(define-coding-system 'iso-2022-cn-gb
  "ISO 2022 based 7bit encoding only for Chinese GB2312."
  :coding-type 'iso-2022
  :mnemonic ?C
  :charset-list '(ascii chinese-gb2312)
  :designation [(ascii chinese-gb2312) nil nil nil]
  :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
  )




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 22 Jun 2016 15:30:02 GMT) Full text and rfc822 format available.

Message #14 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 22 Jun 2016 18:28:15 +0300
> > From: ynyaaa <at> gmail.com
> Cc: 23814 <at> debbugs.gnu.org
> Date: Wed, 22 Jun 2016 22:47:00 +0900
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > This is by design, and mentioned in the doc string of that
> > coding-system.  Since Emacs is Unicode based, the _only_ way of having
> > "chinese-gb2312 characters" is by using that text property.
> 
> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> replacing it with the coding-system below will work.
> 
> (define-coding-system 'iso-2022-cn-gb
>   "ISO 2022 based 7bit encoding only for Chinese GB2312."
>   :coding-type 'iso-2022
>   :mnemonic ?C
>   :charset-list '(ascii chinese-gb2312)
>   :designation [(ascii chinese-gb2312) nil nil nil]
>   :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
>   )

What advantages does this change have?





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 22 Jun 2016 17:05:01 GMT) Full text and rfc822 format available.

Message #17 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Thu, 23 Jun 2016 02:04:18 +0900
Eli Zaretskii <eliz <at> gnu.org> writes:

>> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
>> replacing it with the coding-system below will work.
>> 
>> (define-coding-system 'iso-2022-cn-gb
>>   "ISO 2022 based 7bit encoding only for Chinese GB2312."
>>   :coding-type 'iso-2022
>>   :mnemonic ?C
>>   :charset-list '(ascii chinese-gb2312)
>>   :designation [(ascii chinese-gb2312) nil nil nil]
>>   :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
>>   )
>
> What advantages does this change have?

`iso-2022-7bit' may encode same character to various strings,
while `iso-2022-cn-gb' encodes same charcter to same string.

(mapcar (lambda (cs) (encode-coding-string
                      (propertize "\x4e00" 'charset cs)
                      'iso-2022-7bit))
        '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
                         chinese-cns11643-1))
=>("\e$AR;\e(B"
   "\e$B0l\e(B"
   "\e$(Cli\e(B"
   "\e$(GD!\e(B")

(mapcar (lambda (cs) (encode-coding-string
                      (propertize "\x4e00" 'charset cs)
                      'iso-2022-cn-gb))
        '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
                         chinese-cns11643-1))
=>("\e$AR;\e(B"
   "\e$AR;\e(B"
   "\e$AR;\e(B"
   "\e$AR;\e(B")

`encode-hz-region' expects `chinese-gb2312' characters are encoded
with "\e$A" sequences, and replaces them to "~{".




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 22 Jun 2016 17:28:02 GMT) Full text and rfc822 format available.

Message #20 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ynyaaa <at> gmail.com, Kenichi Handa <handa <at> gnu.org>
Cc: 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 22 Jun 2016 20:26:53 +0300
> From: ynyaaa <at> gmail.com
> Cc: 23814 <at> debbugs.gnu.org
> Date: Thu, 23 Jun 2016 02:04:18 +0900
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> >> replacing it with the coding-system below will work.
> >> 
> >> (define-coding-system 'iso-2022-cn-gb
> >>   "ISO 2022 based 7bit encoding only for Chinese GB2312."
> >>   :coding-type 'iso-2022
> >>   :mnemonic ?C
> >>   :charset-list '(ascii chinese-gb2312)
> >>   :designation [(ascii chinese-gb2312) nil nil nil]
> >>   :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> >>   )
> >
> > What advantages does this change have?
> 
> `iso-2022-7bit' may encode same character to various strings,
> while `iso-2022-cn-gb' encodes same charcter to same string.
> 
> (mapcar (lambda (cs) (encode-coding-string
>                       (propertize "\x4e00" 'charset cs)
>                       'iso-2022-7bit))
>         '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
>                          chinese-cns11643-1))
> =>("\e$AR;\e(B"
>    "\e$B0l\e(B"
>    "\e$(Cli\e(B"
>    "\e$(GD!\e(B")
> 
> (mapcar (lambda (cs) (encode-coding-string
>                       (propertize "\x4e00" 'charset cs)
>                       'iso-2022-cn-gb))
>         '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
>                          chinese-cns11643-1))
> =>("\e$AR;\e(B"
>    "\e$AR;\e(B"
>    "\e$AR;\e(B"
>    "\e$AR;\e(B")
> 
> `encode-hz-region' expects `chinese-gb2312' characters are encoded
> with "\e$A" sequences, and replaces them to "~{".

I understand, but as I said, I think this is by design, and should not
be changed.  However, maybe I'm missing something, so I'll CC
Handa-san and ask him to comment on this proposal and the issue in
general.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Sat, 09 Jul 2016 11:21:01 GMT) Full text and rfc822 format available.

Message #23 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: handa <at> gnu.org
Cc: ynyaaa <at> gmail.com, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Sat, 09 Jul 2016 14:20:19 +0300
Ping!  Could you please comment on this issue?

> Date: Wed, 22 Jun 2016 20:26:53 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 23814 <at> debbugs.gnu.org
> 
> > From: ynyaaa <at> gmail.com
> > Cc: 23814 <at> debbugs.gnu.org
> > Date: Thu, 23 Jun 2016 02:04:18 +0900
> > 
> > Eli Zaretskii <eliz <at> gnu.org> writes:
> > 
> > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> > >> replacing it with the coding-system below will work.
> > >> 
> > >> (define-coding-system 'iso-2022-cn-gb
> > >>   "ISO 2022 based 7bit encoding only for Chinese GB2312."
> > >>   :coding-type 'iso-2022
> > >>   :mnemonic ?C
> > >>   :charset-list '(ascii chinese-gb2312)
> > >>   :designation [(ascii chinese-gb2312) nil nil nil]
> > >>   :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> > >>   )
> > >
> > > What advantages does this change have?
> > 
> > `iso-2022-7bit' may encode same character to various strings,
> > while `iso-2022-cn-gb' encodes same charcter to same string.
> > 
> > (mapcar (lambda (cs) (encode-coding-string
> >                       (propertize "\x4e00" 'charset cs)
> >                       'iso-2022-7bit))
> >         '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
> >                          chinese-cns11643-1))
> > =>("\e$AR;\e(B"
> >    "\e$B0l\e(B"
> >    "\e$(Cli\e(B"
> >    "\e$(GD!\e(B")
> > 
> > (mapcar (lambda (cs) (encode-coding-string
> >                       (propertize "\x4e00" 'charset cs)
> >                       'iso-2022-cn-gb))
> >         '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
> >                          chinese-cns11643-1))
> > =>("\e$AR;\e(B"
> >    "\e$AR;\e(B"
> >    "\e$AR;\e(B"
> >    "\e$AR;\e(B")
> > 
> > `encode-hz-region' expects `chinese-gb2312' characters are encoded
> > with "\e$A" sequences, and replaces them to "~{".
> 
> I understand, but as I said, I think this is by design, and should not
> be changed.  However, maybe I'm missing something, so I'll CC
> Handa-san and ask him to comment on this proposal and the issue in
> general.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 13 Jul 2016 14:14:01 GMT) Full text and rfc822 format available.

Message #26 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: handa <handa <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: ynyaaa <at> gmail.com, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 13 Jul 2016 23:12:47 +0900
In article <83d1mngirw.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> Ping!  Could you please comment on this issue?

Sorry, I've overlooked that mail.

> > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> > > >> replacing it with the coding-system below will work.
> > > >> 
> > > >> (define-coding-system 'iso-2022-cn-gb
> > > >>   "ISO 2022 based 7bit encoding only for Chinese GB2312."
> > > >>   :coding-type 'iso-2022
> > > >>   :mnemonic ?C
> > > >>   :charset-list '(ascii chinese-gb2312)
> > > >>   :designation [(ascii chinese-gb2312) nil nil nil]
> > > >>   :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> > > >>   )

Right.  But, as there are already so many iso-2022 based coding systems,
I'd like to avoid adding a new one just for encode-hz-region.  I think
the attached patch is sufficent.  Could you please try it?  It also
fixes the problem of incorrect decoding of "~{7~~}".

---
K. Handa
handa <at> gnu.org

diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index e531640..9735bd6 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -95,7 +95,9 @@ decode-hz-region
 	(goto-char (point-min))
 	(while (search-forward "~" nil t)
 	  (setq ch (following-char))
-	  (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
+          (if (= ch ?{)
+              (search-forward "~}" nil 'move)
+            (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))))
 
 	;; "^zW...\n" -> Chinese GB2312
 	;; "~{...~}"  -> Chinese GB2312
@@ -141,7 +143,7 @@ encode-hz-region
   (save-excursion
     (save-restriction
       (narrow-to-region beg end)
-
+      (put-text-property beg end 'charset 'chinese-gb2312)
       ;; "~" -> "~~"
       (goto-char (point-min))
       (while (search-forward "~" nil t)	(insert ?~))





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Sat, 23 Jul 2016 17:48:02 GMT) Full text and rfc822 format available.

Message #29 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: handa <handa <at> gnu.org>, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Sat, 23 Jul 2016 20:47:27 +0300
Ping!  Could you please try this patch and see if it solves the
problem?

> From: handa <handa <at> gnu.org>
> Cc: ynyaaa <at> gmail.com, 23814 <at> debbugs.gnu.org
> Date: Wed, 13 Jul 2016 23:12:47 +0900
> 
> > > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> > > > >> replacing it with the coding-system below will work.
> > > > >> 
> > > > >> (define-coding-system 'iso-2022-cn-gb
> > > > >>   "ISO 2022 based 7bit encoding only for Chinese GB2312."
> > > > >>   :coding-type 'iso-2022
> > > > >>   :mnemonic ?C
> > > > >>   :charset-list '(ascii chinese-gb2312)
> > > > >>   :designation [(ascii chinese-gb2312) nil nil nil]
> > > > >>   :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> > > > >>   )
> 
> Right.  But, as there are already so many iso-2022 based coding systems,
> I'd like to avoid adding a new one just for encode-hz-region.  I think
> the attached patch is sufficent.  Could you please try it?  It also
> fixes the problem of incorrect decoding of "~{7~~}".
> 
> ---
> K. Handa
> handa <at> gnu.org
> 
> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
> index e531640..9735bd6 100644
> --- a/lisp/language/china-util.el
> +++ b/lisp/language/china-util.el
> @@ -95,7 +95,9 @@ decode-hz-region
>  	(goto-char (point-min))
>  	(while (search-forward "~" nil t)
>  	  (setq ch (following-char))
> -	  (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
> +          (if (= ch ?{)
> +              (search-forward "~}" nil 'move)
> +            (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))))
>  
>  	;; "^zW...\n" -> Chinese GB2312
>  	;; "~{...~}"  -> Chinese GB2312
> @@ -141,7 +143,7 @@ encode-hz-region
>    (save-excursion
>      (save-restriction
>        (narrow-to-region beg end)
> -
> +      (put-text-property beg end 'charset 'chinese-gb2312)
>        ;; "~" -> "~~"
>        (goto-char (point-min))
>        (while (search-forward "~" nil t)	(insert ?~))
> 
> 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Sun, 24 Jul 2016 08:22:02 GMT) Full text and rfc822 format available.

Message #32 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: handa <handa <at> gnu.org>, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Sun, 24 Jul 2016 17:21:08 +0900
Eli Zaretskii <eliz <at> gnu.org> writes:

> Ping!  Could you please try this patch and see if it solves the
> problem?

The patch seems to make better results.

But I found other bugs about decodings of "~" escape.
"~~" and "~{!!~}" should be encoded and decoded as below.
    "~~" -> "~~~~" -> "~~"
    "~{!!~}" -> "~~{!!~~}" -> "~{!!~}"

In really they are encoded properly, but decoded in wrong way.
    (decode-coding-string (encode-coding-string "~~" 'hz) 'hz)
    => "~"
    (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz)
    => #("\x3000" 0 1 (charset chinese-gb2312))

These behaviors are not affected by the patch.

>> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
>> index e531640..9735bd6 100644
>> --- a/lisp/language/china-util.el
>> +++ b/lisp/language/china-util.el
>> @@ -95,7 +95,9 @@ decode-hz-region
>>  	(goto-char (point-min))
>>  	(while (search-forward "~" nil t)
>>  	  (setq ch (following-char))
>> -	  (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
>> +          (if (= ch ?{)
>> +              (search-forward "~}" nil 'move)
>> +            (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))))
>>  
>>  	;; "^zW...\n" -> Chinese GB2312
>>  	;; "~{...~}"  -> Chinese GB2312
>> @@ -141,7 +143,7 @@ encode-hz-region
>>    (save-excursion
>>      (save-restriction
>>        (narrow-to-region beg end)
>> -
>> +      (put-text-property beg end 'charset 'chinese-gb2312)
>>        ;; "~" -> "~~"
>>        (goto-char (point-min))
>>        (while (search-forward "~" nil t)	(insert ?~))
>> 
>> 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Tue, 26 Jul 2016 15:10:02 GMT) Full text and rfc822 format available.

Message #35 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: handa <handa <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: eliz <at> gnu.org, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 27 Jul 2016 00:09:24 +0900
In article <87twffigzv.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:

> But I found other bugs about decodings of "~" escape.
> "~~" and "~{!!~}" should be encoded and decoded as below.
>     "~~" -> "~~~~" -> "~~"
>     "~{!!~}" -> "~~{!!~~}" -> "~{!!~}"

> In really they are encoded properly, but decoded in wrong way.
>     (decode-coding-string (encode-coding-string "~~" 'hz) 'hz)
>>> "~"
>     (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz)
>>> #("\x3000" 0 1 (charset chinese-gb2312))

Thank you for finding those bugs.  Could you please try the attached
patch instead?

---
K. Handa
handa <at> gnu.org

diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index e531640..9abdae1 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -95,7 +95,12 @@ decode-hz-region
 	(goto-char (point-min))
 	(while (search-forward "~" nil t)
 	  (setq ch (following-char))
-	  (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
+          (if (= ch ?{)
+              (search-forward "~}" nil 'move)
+            (when (or (= ch ?\n) (= ch ?~))
+              (delete-char -1)
+              (put-text-property (point) (1+ (point)) 'hz-decoded t)
+              (forward-char 1))))
 
 	;; "^zW...\n" -> Chinese GB2312
 	;; "~{...~}"  -> Chinese GB2312
@@ -104,6 +109,8 @@ decode-hz-region
 	(while (re-search-forward hz/zw-start-gb nil t)
 	  (setq pos (match-beginning 0)
 		ch (char-after pos))
+          (if (and (= ch ?~) (get-text-property pos 'hz-decoded))
+              (forward-char 1)
 	  ;; Record the first position to start conversion.
 	  (or beg (setq beg pos))
 	  (end-of-line)
@@ -122,9 +129,10 @@ decode-hz-region
 				  t)
 		  (delete-char -2))
 	      (setq end (point))
-	      (translate-region pos (point) hz-set-msb-table))))
+	      (translate-region pos (point) hz-set-msb-table)))))
 	(if beg
 	    (decode-coding-region beg end 'euc-china)))
+      (remove-text-properties (point-min) (point-max) '(hz-decoded nil))
       (- (point-max) (point-min)))))
 
 ;;;###autoload
@@ -142,6 +150,7 @@ encode-hz-region
     (save-restriction
       (narrow-to-region beg end)
 
+      (put-text-property beg end 'charset 'chinese-gb2312)
       ;; "~" -> "~~"
       (goto-char (point-min))
       (while (search-forward "~" nil t)	(insert ?~))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Fri, 29 Jul 2016 01:06:01 GMT) Full text and rfc822 format available.

Message #38 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: handa <handa <at> gnu.org>
Cc: eliz <at> gnu.org, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Fri, 29 Jul 2016 10:05:14 +0900
handa <handa <at> gnu.org> writes:

> In article <87twffigzv.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
>
>> But I found other bugs about decodings of "~" escape.
>> "~~" and "~{!!~}" should be encoded and decoded as below.
>>     "~~" -> "~~~~" -> "~~"
>>     "~{!!~}" -> "~~{!!~~}" -> "~{!!~}"
>
>> In really they are encoded properly, but decoded in wrong way.
>>     (decode-coding-string (encode-coding-string "~~" 'hz) 'hz)
>>>> "~"
>>     (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz)
>>>> #("\x3000" 0 1 (charset chinese-gb2312))
>
> Thank you for finding those bugs.  Could you please try the attached
> patch instead?
>
> ---
> K. Handa
> handa <at> gnu.org

If there are unencodable characters, encodable characters may be broken.
In this example, the second ?\x4E00 character disappears.
    (set-language-environment 'Chinese-GB)
    (decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz)
    => "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273"

To avoid this behavior, there are some solutions.
(a) While decoding, replace "~{...~}" with "\e$A...\e(B"
    and decode with iso-2022-7bit.
(b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding
    and insert "\e$)A" at the beginning of the temp buffer
    and decode with iso-2022-8bit-ss2.
    (8bit data are decoded as euc-cn.)
(c) While encoding, use euc-cn instead of iso-2022-7bit
    and translate each consecutive 8bit data to 7bit data
    prefixed by "~{" and postfixed by "~}".


By the way, RFC1843 describes:
    The escape sequence '~\n' is a line-continuation marker to be
    consumed with no output produced.

This form shoud return "AB".
    (decode-coding-string "A~\nB" 'hz)
    => "A\nB"

> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
> index e531640..9abdae1 100644
> --- a/lisp/language/china-util.el
> +++ b/lisp/language/china-util.el
> @@ -95,7 +95,12 @@ decode-hz-region
>  	(goto-char (point-min))
>  	(while (search-forward "~" nil t)
>  	  (setq ch (following-char))
> -	  (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
> +          (if (= ch ?{)
> +              (search-forward "~}" nil 'move)
> +            (when (or (= ch ?\n) (= ch ?~))
> +              (delete-char -1)
> +              (put-text-property (point) (1+ (point)) 'hz-decoded t)
> +              (forward-char 1))))
>  
>  	;; "^zW...\n" -> Chinese GB2312
>  	;; "~{...~}"  -> Chinese GB2312
> @@ -104,6 +109,8 @@ decode-hz-region
>  	(while (re-search-forward hz/zw-start-gb nil t)
>  	  (setq pos (match-beginning 0)
>  		ch (char-after pos))
> +          (if (and (= ch ?~) (get-text-property pos 'hz-decoded))
> +              (forward-char 1)
>  	  ;; Record the first position to start conversion.
>  	  (or beg (setq beg pos))
>  	  (end-of-line)
> @@ -122,9 +129,10 @@ decode-hz-region
>  				  t)
>  		  (delete-char -2))
>  	      (setq end (point))
> -	      (translate-region pos (point) hz-set-msb-table))))
> +	      (translate-region pos (point) hz-set-msb-table)))))
>  	(if beg
>  	    (decode-coding-region beg end 'euc-china)))
> +      (remove-text-properties (point-min) (point-max) '(hz-decoded nil))
>        (- (point-max) (point-min)))))
>  
>  ;;;###autoload
> @@ -142,6 +150,7 @@ encode-hz-region
>      (save-restriction
>        (narrow-to-region beg end)
>  
> +      (put-text-property beg end 'charset 'chinese-gb2312)
>        ;; "~" -> "~~"
>        (goto-char (point-min))
>        (while (search-forward "~" nil t)	(insert ?~))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Sun, 14 Aug 2016 11:23:02 GMT) Full text and rfc822 format available.

Message #41 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: handa <handa <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: eliz <at> gnu.org, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Sun, 14 Aug 2016 20:22:25 +0900
[Message part 1 (text/plain, inline)]
Hi, sorry for the late response.  I've just noticed that my reply mail
didn't go out successfully.  I'm trying to re-send it.

I wrote:

> In article <871t2dz22d.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
> > If there are unencodable characters, encodable characters may be broken.
> > In this example, the second ?\x4E00 character disappears.
> >     (set-language-environment 'Chinese-GB)
> >     (decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz)
> >>> "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273"
> 
> How to treat unencodable characters on encoding is a difficult problem.
> As HZ is designed for 7-bit environment, I think it's important to keep
> 7-bit on encoding.  So, the new code uses \uXXXX for those characters.
> Another way is to use UTF-8 sequence for them, then we can decode it
> back.  Which, do yo think, is better?
> 
> > To avoid this behavior, there are some solutions.
> > (a) While decoding, replace "~{...~}" with "\e$A...\e(B"
> >     and decode with iso-2022-7bit.
> > (b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding
> >     and insert "\e$)A" at the beginning of the temp buffer
> >     and decode with iso-2022-8bit-ss2.
> >     (8bit data are decoded as euc-cn.)
> > (c) While encoding, use euc-cn instead of iso-2022-7bit
> >     and translate each consecutive 8bit data to 7bit data
> >     prefixed by "~{" and postfixed by "~}".
> 
> I adopted the (a) method for decoding, and fix bugs encoding code.
> 
> > By the way, RFC1843 describes:
> >     The escape sequence '~\n' is a line-continuation marker to be
> >     consumed with no output produced.
> 
> The variable decode-hz-line-continuation controls this feature.  I don't
> remember why the default is nil (i.e. do not decode ~\n), perhaps some
> Chinese people I was discussing with on implementing HZ support
> suggested that.
> 
> Attched is the full china-util.el (not a diff).
> 
> ---
> K. Handa
> handa <at> gnu.org

[china-util.el (application/emacs-lisp, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 17 Aug 2016 06:34:01 GMT) Full text and rfc822 format available.

Message #44 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: ynyaaa <at> gmail.com
To: handa <handa <at> gnu.org>
Cc: eliz <at> gnu.org, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 17 Aug 2016 15:33:29 +0900
Hi, I tried new china-util.el. It works very well.

handa <handa <at> gnu.org> writes:
> Hi, sorry for the late response.  I've just noticed that my reply mail
> didn't go out successfully.  I'm trying to re-send it.

>> How to treat unencodable characters on encoding is a difficult problem.
>> As HZ is designed for 7-bit environment, I think it's important to keep
>> 7-bit on encoding.  So, the new code uses \uXXXX for those characters.
>> Another way is to use UTF-8 sequence for them, then we can decode it
>> back.  Which, do yo think, is better?

I prefer 7bit encoding to use only 7bit data, too.
As for elisp, "\u12345" is treated as "\u1234\ 5".




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 17 Aug 2016 14:44:01 GMT) Full text and rfc822 format available.

Message #47 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: handa <handa <at> gnu.org>
To: ynyaaa <at> gmail.com
Cc: eliz <at> gnu.org, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 17 Aug 2016 23:43:13 +0900
In article <87oa4rdhvq.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:

> Hi, I tried new china-util.el. It works very well.

Thank you for testing it.

> I prefer 7bit encoding to use only 7bit data, too.
> As for elisp, "\u12345" is treated as "\u1234\ 5".

Ah, ok, I changed to encode characters not in BMP to \UXXXXXXXX.

I've just committed the attached change.

---
K. Handa
handa <at> gnu.org

2016-08-17  handa  <handa <at> gnu.org>

	* lisp/language/china-util.el (decode-hz-region): Pay
	attention to "~~}" sequence at the end of Chinese character
	range.
	(hz-category-table): New variable.
	(encode-hz-region): Convert non-encodable characters to
	\u... and \U...  Preserve ESC on ecoding.  Put
	`chinese-gb2312' `charset' text property in advance to force
	iso-2022-encoding to select chinese-gb2312 designation.

diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index e531640..6505fb8 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -88,43 +88,34 @@ decode-hz-region
       (let (pos ch)
 	(narrow-to-region beg end)
 
-	;; We, at first, convert HZ/ZW to `euc-china',
+	;; We, at first, convert HZ/ZW to `iso-2022-7bit',
 	;; then decode it.
 
-	;; "~\n" -> "\n", "~~" -> "~"
+	;; "~\n" -> "", "~~" -> "~"
 	(goto-char (point-min))
 	(while (search-forward "~" nil t)
 	  (setq ch (following-char))
-	  (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
+	  (cond ((= ch ?{)
+		 (delete-region (1- (point)) (1+ (point)))
+		 (setq pos (point))
+		 (insert iso2022-gb-designation)
+		 (if (looking-at "\\([!-}][!-~]\\)*")
+		     (goto-char (match-end 0)))
+		 (if (looking-at hz-ascii-designation)
+		     (delete-region (match-beginning 0) (match-end 0)))
+		 (insert iso2022-ascii-designation)
+		 (decode-coding-region pos (point) 'iso-2022-7bit))
+
+		((= ch ?~)
+		 (delete-char 1))
+
+		((and (= ch ?\n)
+		      decode-hz-line-continuation)
+		 (delete-region (1- (point)) (1+ (point))))
+
+		(t
+		 (forward-char 1)))))
 
-	;; "^zW...\n" -> Chinese GB2312
-	;; "~{...~}"  -> Chinese GB2312
-	(goto-char (point-min))
-	(setq beg nil)
-	(while (re-search-forward hz/zw-start-gb nil t)
-	  (setq pos (match-beginning 0)
-		ch (char-after pos))
-	  ;; Record the first position to start conversion.
-	  (or beg (setq beg pos))
-	  (end-of-line)
-	  (setq end (point))
-	  (if (>= ch 128)		; 8bit GB2312
-	      nil
-	    (goto-char pos)
-	    (delete-char 2)
-	    (setq end (- end 2))
-	    (if (= ch ?z)			; ZW -> euc-china
-		(progn
-		  (translate-region (point) end hz-set-msb-table)
-		  (goto-char end))
-	      (if (search-forward hz-ascii-designation
-				  (if decode-hz-line-continuation nil end)
-				  t)
-		  (delete-char -2))
-	      (setq end (point))
-	      (translate-region pos (point) hz-set-msb-table))))
-	(if beg
-	    (decode-coding-region beg end 'euc-china)))
       (- (point-max) (point-min)))))
 
 ;;;###autoload
@@ -133,33 +124,57 @@ decode-hz-buffer
   (interactive)
   (decode-hz-region (point-min) (point-max)))
 
+(defvar hz-category-table nil)
+
 ;;;###autoload
 (defun encode-hz-region (beg end)
   "Encode the text in the current region to HZ.
 Return the length of resulting text."
   (interactive "r")
+  (unless hz-category-table
+    (setq hz-category-table (make-category-table))
+    (with-category-table hz-category-table
+      (define-category ?c "hz encodable")
+      (map-charset-chars #'modify-category-entry 'ascii ?c)
+      (map-charset-chars #'modify-category-entry 'chinese-gb2312 ?c)))
   (save-excursion
     (save-restriction
       (narrow-to-region beg end)
+      (with-category-table hz-category-table
+	;; ~ -> ~~
+	(goto-char (point-min))
+	(while (search-forward "~" nil t) (insert ?~))
+
+	;; ESC -> ESC ESC
+	(goto-char (point-min))
+	(while (search-forward "\e" nil t) (insert ?\e))
 
-      ;; "~" -> "~~"
-      (goto-char (point-min))
-      (while (search-forward "~" nil t)	(insert ?~))
-
-      ;; Chinese GB2312 -> "~{...~}"
-      (goto-char (point-min))
-      (if (re-search-forward "\\cc" nil t)
-	  (let (pos)
-	    (goto-char (setq pos (match-beginning 0)))
-	    (encode-coding-region pos (point-max) 'iso-2022-7bit)
-	    (goto-char pos)
-	    (while (search-forward iso2022-gb-designation nil t)
-	      (delete-char -3)
-	      (insert hz-gb-designation))
-	    (goto-char pos)
-	    (while (search-forward iso2022-ascii-designation nil t)
-	      (delete-char -3)
-	      (insert hz-ascii-designation))))
+	;; Non-ASCII-GB2312 -> \uXXXX
+	(goto-char (point-min))
+	(while (re-search-forward "\\Cc" nil t)
+	  (let ((ch (preceding-char)))
+	    (delete-char -1)
+	    (insert (format (if (< ch #x10000) "\\u%04X" "\\U%08X") ch))))
+
+	;; Prefer chinese-gb2312 for Chinese characters.
+	(put-text-property (point-min) (point-max) 'charset 'chinese-gb2312)
+	(encode-coding-region (point-min) (point-max) 'iso-2022-7bit)
+
+	;; ESC $ B ... ESC ( B  -> ~{ ... ~}
+	;; ESC ESC -> ESC
+	(goto-char (point-min))
+	(while (search-forward "\e" nil t)
+	  (if (= (following-char) ?\e)
+	      ;; ESC ESC -> ESC
+	      (delete-char 1)
+	    (forward-char -1)
+	    (if (looking-at iso2022-gb-designation)
+		(progn
+		  (delete-region (match-beginning 0) (match-end 0))
+		  (insert hz-gb-designation)
+		  (search-forward iso2022-ascii-designation nil 'move)
+		  (delete-region (match-beginning 0) (match-end 0))
+		  (insert hz-ascii-designation))))))
       (- (point-max) (point-min)))))
 
 ;;;###autoload




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23814; Package emacs. (Wed, 17 Aug 2016 15:29:01 GMT) Full text and rfc822 format available.

Message #50 received at 23814 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: handa <handa <at> gnu.org>
Cc: ynyaaa <at> gmail.com, 23814 <at> debbugs.gnu.org
Subject: Re: bug#23814: 24.5; bug of hz coding-system
Date: Wed, 17 Aug 2016 18:28:06 +0300
> From: handa <handa <at> gnu.org>
> Cc: eliz <at> gnu.org,  23814 <at> debbugs.gnu.org
> Date: Wed, 17 Aug 2016 23:43:13 +0900
> 
> In article <87oa4rdhvq.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
> 
> > Hi, I tried new china-util.el. It works very well.
> 
> Thank you for testing it.
> 
> > I prefer 7bit encoding to use only 7bit data, too.
> > As for elisp, "\u12345" is treated as "\u1234\ 5".
> 
> Ah, ok, I changed to encode characters not in BMP to \UXXXXXXXX.
> 
> I've just committed the attached change.

Thanks.  Please close the bug if satisfied with the solution.




bug marked as fixed in version 26.1, send any further explanations to 23814 <at> debbugs.gnu.org and ynyaaa <at> gmail.com Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 01 Mar 2017 20:37:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 30 Mar 2017 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 85 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.