GNU bug report logs -
#23814
24.5; bug of hz coding-system
Previous Next
Reported by: ynyaaa <at> gmail.com
Date: Tue, 21 Jun 2016 12:23:02 UTC
Severity: normal
Found in version 24.5
Fixed in version 26.1
Done: Glenn Morris <rgm <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23814 in the body.
You can then email your comments to 23814 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Tue, 21 Jun 2016 12:23:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
ynyaaa <at> gmail.com
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Tue, 21 Jun 2016 12:23:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
hz coding-system should encode chinese-gb2312 characters,
it may fail to encode text without charset property.
current-language-environment
=>"Japanese"
;; wrong
(encode-coding-string "\x4E00" 'hz)
=>"\e$B0l~}"
;; correct
(encode-coding-string (propertize "\x4E00" 'charset 'chinese-gb2312) 'hz)
=>"~{R;~}"
When the second byte of chinese-gb2312 character equals to ?~,
hz coding-system may faile to decode.
(encode-coding-string (propertize "\x670D" 'charset 'chinese-gb2312) 'hz)
=>"~{7~~}"
;; wrong
(decode-coding-string "~{7~~}" 'hz)
=>"\300\267"
In GNU Emacs 24.5.1 (i686-pc-mingw32)
of 2015-04-11 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.0.6002
Configured using:
`configure --prefix=/c/usr --host=i686-pc-mingw32'
Important settings:
value of $LANG: JPN
locale-coding-system: cp932
Major mode: Lisp Interaction
Minor modes in effect:
tooltip-mode: t
electric-indent-mode: t
mouse-wheel-mode: t
tool-bar-mode: t
menu-bar-mode: t
file-name-shadow-mode: t
global-font-lock-mode: t
font-lock-mode: t
blink-cursor-mode: t
auto-composition-mode: t
auto-encryption-mode: t
auto-compression-mode: t
line-number-mode: t
transient-mark-mode: t
Recent messages:
Load-path shadows:
None found.
Features:
(network-stream starttls tls mailalias smtpmail auth-source eieio
byte-opt bytecomp byte-compile cl-extra cl-loaddefs cl-lib cconv
eieio-core password-cache rect warnings china-util misearch
multi-isearch pp shadow sort gnus-util mail-extr emacsbug message
format-spec rfc822 mml mml-sec mm-decode mm-bodies mm-encode mail-parse
rfc2231 mailabbrev gmm-utils mailheader sendmail rfc2047 rfc2045
ietf-drums mm-util mail-prsvr mail-utils help-mode easymenu advice
help-fns time-date japan-util tooltip electric uniquify ediff-hook
vc-hooks lisp-float-type mwheel dos-w32 ls-lisp w32-common-fns
disp-table w32-win w32-vars tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment lisp-mode prog-mode register page menu-bar
rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax
facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak
czech european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer nadvice loaddefs button
faces cus-face macroexp files text-properties overlay sha1 md5 base64
format env code-pages mule custom widget hashtable-print-readable
backquote make-network-process w32notify w32 multi-tty emacs)
Memory information:
((conses 8 94845 27098)
(symbols 32 19573 0)
(miscs 32 77 279)
(strings 16 16482 13821)
(string-bytes 1 462365)
(vectors 8 12746)
(vector-slots 4 519456 11240)
(floats 8 62 556)
(intervals 28 606 13)
(buffers 508 18))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Tue, 21 Jun 2016 13:00:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 23814 <at> debbugs.gnu.org (full text, mbox):
> From: ynyaaa <at> gmail.com
> Date: Tue, 21 Jun 2016 21:22:32 +0900
>
> hz coding-system should encode chinese-gb2312 characters,
> it may fail to encode text without charset property.
This is by design, and mentioned in the doc string of that
coding-system. Since Emacs is Unicode based, the _only_ way of having
"chinese-gb2312 characters" is by using that text property.
IOW, I don't think this is a bug.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 22 Jun 2016 13:48:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 23814 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> This is by design, and mentioned in the doc string of that
> coding-system. Since Emacs is Unicode based, the _only_ way of having
> "chinese-gb2312 characters" is by using that text property.
`encode-hz-region' uses `iso-2022-7bit' coding-system internally,
replacing it with the coding-system below will work.
(define-coding-system 'iso-2022-cn-gb
"ISO 2022 based 7bit encoding only for Chinese GB2312."
:coding-type 'iso-2022
:mnemonic ?C
:charset-list '(ascii chinese-gb2312)
:designation [(ascii chinese-gb2312) nil nil nil]
:flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 22 Jun 2016 15:30:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 23814 <at> debbugs.gnu.org (full text, mbox):
> > From: ynyaaa <at> gmail.com
> Cc: 23814 <at> debbugs.gnu.org
> Date: Wed, 22 Jun 2016 22:47:00 +0900
>
> Eli Zaretskii <eliz <at> gnu.org> writes:
>
> > This is by design, and mentioned in the doc string of that
> > coding-system. Since Emacs is Unicode based, the _only_ way of having
> > "chinese-gb2312 characters" is by using that text property.
>
> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> replacing it with the coding-system below will work.
>
> (define-coding-system 'iso-2022-cn-gb
> "ISO 2022 based 7bit encoding only for Chinese GB2312."
> :coding-type 'iso-2022
> :mnemonic ?C
> :charset-list '(ascii chinese-gb2312)
> :designation [(ascii chinese-gb2312) nil nil nil]
> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> )
What advantages does this change have?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 22 Jun 2016 17:05:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 23814 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
>> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
>> replacing it with the coding-system below will work.
>>
>> (define-coding-system 'iso-2022-cn-gb
>> "ISO 2022 based 7bit encoding only for Chinese GB2312."
>> :coding-type 'iso-2022
>> :mnemonic ?C
>> :charset-list '(ascii chinese-gb2312)
>> :designation [(ascii chinese-gb2312) nil nil nil]
>> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
>> )
>
> What advantages does this change have?
`iso-2022-7bit' may encode same character to various strings,
while `iso-2022-cn-gb' encodes same charcter to same string.
(mapcar (lambda (cs) (encode-coding-string
(propertize "\x4e00" 'charset cs)
'iso-2022-7bit))
'(chinese-gb2312 japanese-jisx0208 korean-ksc5601
chinese-cns11643-1))
=>("\e$AR;\e(B"
"\e$B0l\e(B"
"\e$(Cli\e(B"
"\e$(GD!\e(B")
(mapcar (lambda (cs) (encode-coding-string
(propertize "\x4e00" 'charset cs)
'iso-2022-cn-gb))
'(chinese-gb2312 japanese-jisx0208 korean-ksc5601
chinese-cns11643-1))
=>("\e$AR;\e(B"
"\e$AR;\e(B"
"\e$AR;\e(B"
"\e$AR;\e(B")
`encode-hz-region' expects `chinese-gb2312' characters are encoded
with "\e$A" sequences, and replaces them to "~{".
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 22 Jun 2016 17:28:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 23814 <at> debbugs.gnu.org (full text, mbox):
> From: ynyaaa <at> gmail.com
> Cc: 23814 <at> debbugs.gnu.org
> Date: Thu, 23 Jun 2016 02:04:18 +0900
>
> Eli Zaretskii <eliz <at> gnu.org> writes:
>
> >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> >> replacing it with the coding-system below will work.
> >>
> >> (define-coding-system 'iso-2022-cn-gb
> >> "ISO 2022 based 7bit encoding only for Chinese GB2312."
> >> :coding-type 'iso-2022
> >> :mnemonic ?C
> >> :charset-list '(ascii chinese-gb2312)
> >> :designation [(ascii chinese-gb2312) nil nil nil]
> >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> >> )
> >
> > What advantages does this change have?
>
> `iso-2022-7bit' may encode same character to various strings,
> while `iso-2022-cn-gb' encodes same charcter to same string.
>
> (mapcar (lambda (cs) (encode-coding-string
> (propertize "\x4e00" 'charset cs)
> 'iso-2022-7bit))
> '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
> chinese-cns11643-1))
> =>("\e$AR;\e(B"
> "\e$B0l\e(B"
> "\e$(Cli\e(B"
> "\e$(GD!\e(B")
>
> (mapcar (lambda (cs) (encode-coding-string
> (propertize "\x4e00" 'charset cs)
> 'iso-2022-cn-gb))
> '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
> chinese-cns11643-1))
> =>("\e$AR;\e(B"
> "\e$AR;\e(B"
> "\e$AR;\e(B"
> "\e$AR;\e(B")
>
> `encode-hz-region' expects `chinese-gb2312' characters are encoded
> with "\e$A" sequences, and replaces them to "~{".
I understand, but as I said, I think this is by design, and should not
be changed. However, maybe I'm missing something, so I'll CC
Handa-san and ask him to comment on this proposal and the issue in
general.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Sat, 09 Jul 2016 11:21:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 23814 <at> debbugs.gnu.org (full text, mbox):
Ping! Could you please comment on this issue?
> Date: Wed, 22 Jun 2016 20:26:53 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 23814 <at> debbugs.gnu.org
>
> > From: ynyaaa <at> gmail.com
> > Cc: 23814 <at> debbugs.gnu.org
> > Date: Thu, 23 Jun 2016 02:04:18 +0900
> >
> > Eli Zaretskii <eliz <at> gnu.org> writes:
> >
> > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> > >> replacing it with the coding-system below will work.
> > >>
> > >> (define-coding-system 'iso-2022-cn-gb
> > >> "ISO 2022 based 7bit encoding only for Chinese GB2312."
> > >> :coding-type 'iso-2022
> > >> :mnemonic ?C
> > >> :charset-list '(ascii chinese-gb2312)
> > >> :designation [(ascii chinese-gb2312) nil nil nil]
> > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> > >> )
> > >
> > > What advantages does this change have?
> >
> > `iso-2022-7bit' may encode same character to various strings,
> > while `iso-2022-cn-gb' encodes same charcter to same string.
> >
> > (mapcar (lambda (cs) (encode-coding-string
> > (propertize "\x4e00" 'charset cs)
> > 'iso-2022-7bit))
> > '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
> > chinese-cns11643-1))
> > =>("\e$AR;\e(B"
> > "\e$B0l\e(B"
> > "\e$(Cli\e(B"
> > "\e$(GD!\e(B")
> >
> > (mapcar (lambda (cs) (encode-coding-string
> > (propertize "\x4e00" 'charset cs)
> > 'iso-2022-cn-gb))
> > '(chinese-gb2312 japanese-jisx0208 korean-ksc5601
> > chinese-cns11643-1))
> > =>("\e$AR;\e(B"
> > "\e$AR;\e(B"
> > "\e$AR;\e(B"
> > "\e$AR;\e(B")
> >
> > `encode-hz-region' expects `chinese-gb2312' characters are encoded
> > with "\e$A" sequences, and replaces them to "~{".
>
> I understand, but as I said, I think this is by design, and should not
> be changed. However, maybe I'm missing something, so I'll CC
> Handa-san and ask him to comment on this proposal and the issue in
> general.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 13 Jul 2016 14:14:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 23814 <at> debbugs.gnu.org (full text, mbox):
In article <83d1mngirw.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:
> Ping! Could you please comment on this issue?
Sorry, I've overlooked that mail.
> > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> > > >> replacing it with the coding-system below will work.
> > > >>
> > > >> (define-coding-system 'iso-2022-cn-gb
> > > >> "ISO 2022 based 7bit encoding only for Chinese GB2312."
> > > >> :coding-type 'iso-2022
> > > >> :mnemonic ?C
> > > >> :charset-list '(ascii chinese-gb2312)
> > > >> :designation [(ascii chinese-gb2312) nil nil nil]
> > > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> > > >> )
Right. But, as there are already so many iso-2022 based coding systems,
I'd like to avoid adding a new one just for encode-hz-region. I think
the attached patch is sufficent. Could you please try it? It also
fixes the problem of incorrect decoding of "~{7~~}".
---
K. Handa
handa <at> gnu.org
diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index e531640..9735bd6 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -95,7 +95,9 @@ decode-hz-region
(goto-char (point-min))
(while (search-forward "~" nil t)
(setq ch (following-char))
- (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
+ (if (= ch ?{)
+ (search-forward "~}" nil 'move)
+ (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))))
;; "^zW...\n" -> Chinese GB2312
;; "~{...~}" -> Chinese GB2312
@@ -141,7 +143,7 @@ encode-hz-region
(save-excursion
(save-restriction
(narrow-to-region beg end)
-
+ (put-text-property beg end 'charset 'chinese-gb2312)
;; "~" -> "~~"
(goto-char (point-min))
(while (search-forward "~" nil t) (insert ?~))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Sat, 23 Jul 2016 17:48:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 23814 <at> debbugs.gnu.org (full text, mbox):
Ping! Could you please try this patch and see if it solves the
problem?
> From: handa <handa <at> gnu.org>
> Cc: ynyaaa <at> gmail.com, 23814 <at> debbugs.gnu.org
> Date: Wed, 13 Jul 2016 23:12:47 +0900
>
> > > > >> `encode-hz-region' uses `iso-2022-7bit' coding-system internally,
> > > > >> replacing it with the coding-system below will work.
> > > > >>
> > > > >> (define-coding-system 'iso-2022-cn-gb
> > > > >> "ISO 2022 based 7bit encoding only for Chinese GB2312."
> > > > >> :coding-type 'iso-2022
> > > > >> :mnemonic ?C
> > > > >> :charset-list '(ascii chinese-gb2312)
> > > > >> :designation [(ascii chinese-gb2312) nil nil nil]
> > > > >> :flags '(ascii-at-eol ascii-at-cntl designation 7-bit safe)
> > > > >> )
>
> Right. But, as there are already so many iso-2022 based coding systems,
> I'd like to avoid adding a new one just for encode-hz-region. I think
> the attached patch is sufficent. Could you please try it? It also
> fixes the problem of incorrect decoding of "~{7~~}".
>
> ---
> K. Handa
> handa <at> gnu.org
>
> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
> index e531640..9735bd6 100644
> --- a/lisp/language/china-util.el
> +++ b/lisp/language/china-util.el
> @@ -95,7 +95,9 @@ decode-hz-region
> (goto-char (point-min))
> (while (search-forward "~" nil t)
> (setq ch (following-char))
> - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
> + (if (= ch ?{)
> + (search-forward "~}" nil 'move)
> + (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))))
>
> ;; "^zW...\n" -> Chinese GB2312
> ;; "~{...~}" -> Chinese GB2312
> @@ -141,7 +143,7 @@ encode-hz-region
> (save-excursion
> (save-restriction
> (narrow-to-region beg end)
> -
> + (put-text-property beg end 'charset 'chinese-gb2312)
> ;; "~" -> "~~"
> (goto-char (point-min))
> (while (search-forward "~" nil t) (insert ?~))
>
>
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Sun, 24 Jul 2016 08:22:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 23814 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> Ping! Could you please try this patch and see if it solves the
> problem?
The patch seems to make better results.
But I found other bugs about decodings of "~" escape.
"~~" and "~{!!~}" should be encoded and decoded as below.
"~~" -> "~~~~" -> "~~"
"~{!!~}" -> "~~{!!~~}" -> "~{!!~}"
In really they are encoded properly, but decoded in wrong way.
(decode-coding-string (encode-coding-string "~~" 'hz) 'hz)
=> "~"
(decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz)
=> #("\x3000" 0 1 (charset chinese-gb2312))
These behaviors are not affected by the patch.
>> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
>> index e531640..9735bd6 100644
>> --- a/lisp/language/china-util.el
>> +++ b/lisp/language/china-util.el
>> @@ -95,7 +95,9 @@ decode-hz-region
>> (goto-char (point-min))
>> (while (search-forward "~" nil t)
>> (setq ch (following-char))
>> - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
>> + (if (= ch ?{)
>> + (search-forward "~}" nil 'move)
>> + (if (or (= ch ?\n) (= ch ?~)) (delete-char -1))))
>>
>> ;; "^zW...\n" -> Chinese GB2312
>> ;; "~{...~}" -> Chinese GB2312
>> @@ -141,7 +143,7 @@ encode-hz-region
>> (save-excursion
>> (save-restriction
>> (narrow-to-region beg end)
>> -
>> + (put-text-property beg end 'charset 'chinese-gb2312)
>> ;; "~" -> "~~"
>> (goto-char (point-min))
>> (while (search-forward "~" nil t) (insert ?~))
>>
>>
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Tue, 26 Jul 2016 15:10:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 23814 <at> debbugs.gnu.org (full text, mbox):
In article <87twffigzv.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
> But I found other bugs about decodings of "~" escape.
> "~~" and "~{!!~}" should be encoded and decoded as below.
> "~~" -> "~~~~" -> "~~"
> "~{!!~}" -> "~~{!!~~}" -> "~{!!~}"
> In really they are encoded properly, but decoded in wrong way.
> (decode-coding-string (encode-coding-string "~~" 'hz) 'hz)
>>> "~"
> (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz)
>>> #("\x3000" 0 1 (charset chinese-gb2312))
Thank you for finding those bugs. Could you please try the attached
patch instead?
---
K. Handa
handa <at> gnu.org
diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index e531640..9abdae1 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -95,7 +95,12 @@ decode-hz-region
(goto-char (point-min))
(while (search-forward "~" nil t)
(setq ch (following-char))
- (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
+ (if (= ch ?{)
+ (search-forward "~}" nil 'move)
+ (when (or (= ch ?\n) (= ch ?~))
+ (delete-char -1)
+ (put-text-property (point) (1+ (point)) 'hz-decoded t)
+ (forward-char 1))))
;; "^zW...\n" -> Chinese GB2312
;; "~{...~}" -> Chinese GB2312
@@ -104,6 +109,8 @@ decode-hz-region
(while (re-search-forward hz/zw-start-gb nil t)
(setq pos (match-beginning 0)
ch (char-after pos))
+ (if (and (= ch ?~) (get-text-property pos 'hz-decoded))
+ (forward-char 1)
;; Record the first position to start conversion.
(or beg (setq beg pos))
(end-of-line)
@@ -122,9 +129,10 @@ decode-hz-region
t)
(delete-char -2))
(setq end (point))
- (translate-region pos (point) hz-set-msb-table))))
+ (translate-region pos (point) hz-set-msb-table)))))
(if beg
(decode-coding-region beg end 'euc-china)))
+ (remove-text-properties (point-min) (point-max) '(hz-decoded nil))
(- (point-max) (point-min)))))
;;;###autoload
@@ -142,6 +150,7 @@ encode-hz-region
(save-restriction
(narrow-to-region beg end)
+ (put-text-property beg end 'charset 'chinese-gb2312)
;; "~" -> "~~"
(goto-char (point-min))
(while (search-forward "~" nil t) (insert ?~))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Fri, 29 Jul 2016 01:06:01 GMT)
Full text and
rfc822 format available.
Message #38 received at 23814 <at> debbugs.gnu.org (full text, mbox):
handa <handa <at> gnu.org> writes:
> In article <87twffigzv.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
>
>> But I found other bugs about decodings of "~" escape.
>> "~~" and "~{!!~}" should be encoded and decoded as below.
>> "~~" -> "~~~~" -> "~~"
>> "~{!!~}" -> "~~{!!~~}" -> "~{!!~}"
>
>> In really they are encoded properly, but decoded in wrong way.
>> (decode-coding-string (encode-coding-string "~~" 'hz) 'hz)
>>>> "~"
>> (decode-coding-string (encode-coding-string "~{!!~}" 'hz) 'hz)
>>>> #("\x3000" 0 1 (charset chinese-gb2312))
>
> Thank you for finding those bugs. Could you please try the attached
> patch instead?
>
> ---
> K. Handa
> handa <at> gnu.org
If there are unencodable characters, encodable characters may be broken.
In this example, the second ?\x4E00 character disappears.
(set-language-environment 'Chinese-GB)
(decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz)
=> "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273"
To avoid this behavior, there are some solutions.
(a) While decoding, replace "~{...~}" with "\e$A...\e(B"
and decode with iso-2022-7bit.
(b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding
and insert "\e$)A" at the beginning of the temp buffer
and decode with iso-2022-8bit-ss2.
(8bit data are decoded as euc-cn.)
(c) While encoding, use euc-cn instead of iso-2022-7bit
and translate each consecutive 8bit data to 7bit data
prefixed by "~{" and postfixed by "~}".
By the way, RFC1843 describes:
The escape sequence '~\n' is a line-continuation marker to be
consumed with no output produced.
This form shoud return "AB".
(decode-coding-string "A~\nB" 'hz)
=> "A\nB"
> diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
> index e531640..9abdae1 100644
> --- a/lisp/language/china-util.el
> +++ b/lisp/language/china-util.el
> @@ -95,7 +95,12 @@ decode-hz-region
> (goto-char (point-min))
> (while (search-forward "~" nil t)
> (setq ch (following-char))
> - (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
> + (if (= ch ?{)
> + (search-forward "~}" nil 'move)
> + (when (or (= ch ?\n) (= ch ?~))
> + (delete-char -1)
> + (put-text-property (point) (1+ (point)) 'hz-decoded t)
> + (forward-char 1))))
>
> ;; "^zW...\n" -> Chinese GB2312
> ;; "~{...~}" -> Chinese GB2312
> @@ -104,6 +109,8 @@ decode-hz-region
> (while (re-search-forward hz/zw-start-gb nil t)
> (setq pos (match-beginning 0)
> ch (char-after pos))
> + (if (and (= ch ?~) (get-text-property pos 'hz-decoded))
> + (forward-char 1)
> ;; Record the first position to start conversion.
> (or beg (setq beg pos))
> (end-of-line)
> @@ -122,9 +129,10 @@ decode-hz-region
> t)
> (delete-char -2))
> (setq end (point))
> - (translate-region pos (point) hz-set-msb-table))))
> + (translate-region pos (point) hz-set-msb-table)))))
> (if beg
> (decode-coding-region beg end 'euc-china)))
> + (remove-text-properties (point-min) (point-max) '(hz-decoded nil))
> (- (point-max) (point-min)))))
>
> ;;;###autoload
> @@ -142,6 +150,7 @@ encode-hz-region
> (save-restriction
> (narrow-to-region beg end)
>
> + (put-text-property beg end 'charset 'chinese-gb2312)
> ;; "~" -> "~~"
> (goto-char (point-min))
> (while (search-forward "~" nil t) (insert ?~))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Sun, 14 Aug 2016 11:23:02 GMT)
Full text and
rfc822 format available.
Message #41 received at 23814 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi, sorry for the late response. I've just noticed that my reply mail
didn't go out successfully. I'm trying to re-send it.
I wrote:
> In article <871t2dz22d.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
> > If there are unencodable characters, encodable characters may be broken.
> > In this example, the second ?\x4E00 character disappears.
> > (set-language-environment 'Chinese-GB)
> > (decode-coding-string (encode-coding-string "\x4E00\x00B7\x4E00" 'hz) 'hz)
> >>> "\x4E00\e\x3048\x6070\x70B3\x11213D\300\273"
>
> How to treat unencodable characters on encoding is a difficult problem.
> As HZ is designed for 7-bit environment, I think it's important to keep
> 7-bit on encoding. So, the new code uses \uXXXX for those characters.
> Another way is to use UTF-8 sequence for them, then we can decode it
> back. Which, do yo think, is better?
>
> > To avoid this behavior, there are some solutions.
> > (a) While decoding, replace "~{...~}" with "\e$A...\e(B"
> > and decode with iso-2022-7bit.
> > (b) Like (a), replace "~{...~}" with "\e$A...\e(B" while decoding
> > and insert "\e$)A" at the beginning of the temp buffer
> > and decode with iso-2022-8bit-ss2.
> > (8bit data are decoded as euc-cn.)
> > (c) While encoding, use euc-cn instead of iso-2022-7bit
> > and translate each consecutive 8bit data to 7bit data
> > prefixed by "~{" and postfixed by "~}".
>
> I adopted the (a) method for decoding, and fix bugs encoding code.
>
> > By the way, RFC1843 describes:
> > The escape sequence '~\n' is a line-continuation marker to be
> > consumed with no output produced.
>
> The variable decode-hz-line-continuation controls this feature. I don't
> remember why the default is nil (i.e. do not decode ~\n), perhaps some
> Chinese people I was discussing with on implementing HZ support
> suggested that.
>
> Attched is the full china-util.el (not a diff).
>
> ---
> K. Handa
> handa <at> gnu.org
[china-util.el (application/emacs-lisp, attachment)]
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 17 Aug 2016 06:34:01 GMT)
Full text and
rfc822 format available.
Message #44 received at 23814 <at> debbugs.gnu.org (full text, mbox):
Hi, I tried new china-util.el. It works very well.
handa <handa <at> gnu.org> writes:
> Hi, sorry for the late response. I've just noticed that my reply mail
> didn't go out successfully. I'm trying to re-send it.
>> How to treat unencodable characters on encoding is a difficult problem.
>> As HZ is designed for 7-bit environment, I think it's important to keep
>> 7-bit on encoding. So, the new code uses \uXXXX for those characters.
>> Another way is to use UTF-8 sequence for them, then we can decode it
>> back. Which, do yo think, is better?
I prefer 7bit encoding to use only 7bit data, too.
As for elisp, "\u12345" is treated as "\u1234\ 5".
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 17 Aug 2016 14:44:01 GMT)
Full text and
rfc822 format available.
Message #47 received at 23814 <at> debbugs.gnu.org (full text, mbox):
In article <87oa4rdhvq.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
> Hi, I tried new china-util.el. It works very well.
Thank you for testing it.
> I prefer 7bit encoding to use only 7bit data, too.
> As for elisp, "\u12345" is treated as "\u1234\ 5".
Ah, ok, I changed to encode characters not in BMP to \UXXXXXXXX.
I've just committed the attached change.
---
K. Handa
handa <at> gnu.org
2016-08-17 handa <handa <at> gnu.org>
* lisp/language/china-util.el (decode-hz-region): Pay
attention to "~~}" sequence at the end of Chinese character
range.
(hz-category-table): New variable.
(encode-hz-region): Convert non-encodable characters to
\u... and \U... Preserve ESC on ecoding. Put
`chinese-gb2312' `charset' text property in advance to force
iso-2022-encoding to select chinese-gb2312 designation.
diff --git a/lisp/language/china-util.el b/lisp/language/china-util.el
index e531640..6505fb8 100644
--- a/lisp/language/china-util.el
+++ b/lisp/language/china-util.el
@@ -88,43 +88,34 @@ decode-hz-region
(let (pos ch)
(narrow-to-region beg end)
- ;; We, at first, convert HZ/ZW to `euc-china',
+ ;; We, at first, convert HZ/ZW to `iso-2022-7bit',
;; then decode it.
- ;; "~\n" -> "\n", "~~" -> "~"
+ ;; "~\n" -> "", "~~" -> "~"
(goto-char (point-min))
(while (search-forward "~" nil t)
(setq ch (following-char))
- (if (or (= ch ?\n) (= ch ?~)) (delete-char -1)))
+ (cond ((= ch ?{)
+ (delete-region (1- (point)) (1+ (point)))
+ (setq pos (point))
+ (insert iso2022-gb-designation)
+ (if (looking-at "\\([!-}][!-~]\\)*")
+ (goto-char (match-end 0)))
+ (if (looking-at hz-ascii-designation)
+ (delete-region (match-beginning 0) (match-end 0)))
+ (insert iso2022-ascii-designation)
+ (decode-coding-region pos (point) 'iso-2022-7bit))
+
+ ((= ch ?~)
+ (delete-char 1))
+
+ ((and (= ch ?\n)
+ decode-hz-line-continuation)
+ (delete-region (1- (point)) (1+ (point))))
+
+ (t
+ (forward-char 1)))))
- ;; "^zW...\n" -> Chinese GB2312
- ;; "~{...~}" -> Chinese GB2312
- (goto-char (point-min))
- (setq beg nil)
- (while (re-search-forward hz/zw-start-gb nil t)
- (setq pos (match-beginning 0)
- ch (char-after pos))
- ;; Record the first position to start conversion.
- (or beg (setq beg pos))
- (end-of-line)
- (setq end (point))
- (if (>= ch 128) ; 8bit GB2312
- nil
- (goto-char pos)
- (delete-char 2)
- (setq end (- end 2))
- (if (= ch ?z) ; ZW -> euc-china
- (progn
- (translate-region (point) end hz-set-msb-table)
- (goto-char end))
- (if (search-forward hz-ascii-designation
- (if decode-hz-line-continuation nil end)
- t)
- (delete-char -2))
- (setq end (point))
- (translate-region pos (point) hz-set-msb-table))))
- (if beg
- (decode-coding-region beg end 'euc-china)))
(- (point-max) (point-min)))))
;;;###autoload
@@ -133,33 +124,57 @@ decode-hz-buffer
(interactive)
(decode-hz-region (point-min) (point-max)))
+(defvar hz-category-table nil)
+
;;;###autoload
(defun encode-hz-region (beg end)
"Encode the text in the current region to HZ.
Return the length of resulting text."
(interactive "r")
+ (unless hz-category-table
+ (setq hz-category-table (make-category-table))
+ (with-category-table hz-category-table
+ (define-category ?c "hz encodable")
+ (map-charset-chars #'modify-category-entry 'ascii ?c)
+ (map-charset-chars #'modify-category-entry 'chinese-gb2312 ?c)))
(save-excursion
(save-restriction
(narrow-to-region beg end)
+ (with-category-table hz-category-table
+ ;; ~ -> ~~
+ (goto-char (point-min))
+ (while (search-forward "~" nil t) (insert ?~))
+
+ ;; ESC -> ESC ESC
+ (goto-char (point-min))
+ (while (search-forward "\e" nil t) (insert ?\e))
- ;; "~" -> "~~"
- (goto-char (point-min))
- (while (search-forward "~" nil t) (insert ?~))
-
- ;; Chinese GB2312 -> "~{...~}"
- (goto-char (point-min))
- (if (re-search-forward "\\cc" nil t)
- (let (pos)
- (goto-char (setq pos (match-beginning 0)))
- (encode-coding-region pos (point-max) 'iso-2022-7bit)
- (goto-char pos)
- (while (search-forward iso2022-gb-designation nil t)
- (delete-char -3)
- (insert hz-gb-designation))
- (goto-char pos)
- (while (search-forward iso2022-ascii-designation nil t)
- (delete-char -3)
- (insert hz-ascii-designation))))
+ ;; Non-ASCII-GB2312 -> \uXXXX
+ (goto-char (point-min))
+ (while (re-search-forward "\\Cc" nil t)
+ (let ((ch (preceding-char)))
+ (delete-char -1)
+ (insert (format (if (< ch #x10000) "\\u%04X" "\\U%08X") ch))))
+
+ ;; Prefer chinese-gb2312 for Chinese characters.
+ (put-text-property (point-min) (point-max) 'charset 'chinese-gb2312)
+ (encode-coding-region (point-min) (point-max) 'iso-2022-7bit)
+
+ ;; ESC $ B ... ESC ( B -> ~{ ... ~}
+ ;; ESC ESC -> ESC
+ (goto-char (point-min))
+ (while (search-forward "\e" nil t)
+ (if (= (following-char) ?\e)
+ ;; ESC ESC -> ESC
+ (delete-char 1)
+ (forward-char -1)
+ (if (looking-at iso2022-gb-designation)
+ (progn
+ (delete-region (match-beginning 0) (match-end 0))
+ (insert hz-gb-designation)
+ (search-forward iso2022-ascii-designation nil 'move)
+ (delete-region (match-beginning 0) (match-end 0))
+ (insert hz-ascii-designation))))))
(- (point-max) (point-min)))))
;;;###autoload
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#23814
; Package
emacs
.
(Wed, 17 Aug 2016 15:29:01 GMT)
Full text and
rfc822 format available.
Message #50 received at 23814 <at> debbugs.gnu.org (full text, mbox):
> From: handa <handa <at> gnu.org>
> Cc: eliz <at> gnu.org, 23814 <at> debbugs.gnu.org
> Date: Wed, 17 Aug 2016 23:43:13 +0900
>
> In article <87oa4rdhvq.fsf <at> gmail.com>, ynyaaa <at> gmail.com writes:
>
> > Hi, I tried new china-util.el. It works very well.
>
> Thank you for testing it.
>
> > I prefer 7bit encoding to use only 7bit data, too.
> > As for elisp, "\u12345" is treated as "\u1234\ 5".
>
> Ah, ok, I changed to encode characters not in BMP to \UXXXXXXXX.
>
> I've just committed the attached change.
Thanks. Please close the bug if satisfied with the solution.
bug marked as fixed in version 26.1, send any further explanations to
23814 <at> debbugs.gnu.org and ynyaaa <at> gmail.com
Request was from
Glenn Morris <rgm <at> gnu.org>
to
control <at> debbugs.gnu.org
.
(Wed, 01 Mar 2017 20:37:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 30 Mar 2017 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 85 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.