GNU bug report logs -
#59341
29.0.50; Lisp files with other encoding than UTF-8?
Previous Next
Reported by: Stefan Kangas <stefankangas <at> gmail.com>
Date: Thu, 17 Nov 2022 19:39:02 UTC
Severity: normal
Found in version 29.0.50
Done: Eli Zaretskii <eliz <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 59341 in the body.
You can then email your comments to 59341 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Thu, 17 Nov 2022 19:39:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stefan Kangas <stefankangas <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Thu, 17 Nov 2022 19:39:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
I noticed that codespell was having trouble with one of our files
(ethio-util.el), refusing to read it as UTF-8. This lead to some
unexpected and unwanted behavior.
So I took a closer look and found that several files in lisp/ appear to
be use some other encoding than UTF-8:
$ cd lisp ; for f in $(git ls-files|egrep '.el$'); \
do file $f | grep -v UTF-8 | grep -v " ASCII" ; \
done ; cd -
international/titdic-cnv.el: Lisp/Scheme program, Non-ISO
extended-ASCII text, with LF, NEL line terminators
language/ethio-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
language/ethiopic.el: Lisp/Scheme program, Non-ISO extended-ASCII text
language/ind-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
language/tibet-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
language/tibetan.el: Non-ISO extended-ASCII text, with LF, NEL line terminators
leim/quail/ethiopic.el: Non-ISO extended-ASCII text, with LF, NEL line
terminators
leim/quail/tibetan.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
Should these files be converted to UTF-8?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Thu, 17 Nov 2022 19:55:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 59341 <at> debbugs.gnu.org (full text, mbox):
> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Thu, 17 Nov 2022 11:38:46 -0800
>
> I noticed that codespell was having trouble with one of our files
> (ethio-util.el), refusing to read it as UTF-8. This lead to some
> unexpected and unwanted behavior.
>
> So I took a closer look and found that several files in lisp/ appear to
> be use some other encoding than UTF-8:
>
> $ cd lisp ; for f in $(git ls-files|egrep '.el$'); \
> do file $f | grep -v UTF-8 | grep -v " ASCII" ; \
> done ; cd -
> international/titdic-cnv.el: Lisp/Scheme program, Non-ISO
> extended-ASCII text, with LF, NEL line terminators
> language/ethio-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> language/ethiopic.el: Lisp/Scheme program, Non-ISO extended-ASCII text
> language/ind-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> language/tibet-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> language/tibetan.el: Non-ISO extended-ASCII text, with LF, NEL line terminators
> leim/quail/ethiopic.el: Non-ISO extended-ASCII text, with LF, NEL line
> terminators
> leim/quail/tibetan.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
>
> Should these files be converted to UTF-8?
No. AFAIR, they are in utf-8-emacs because they include characters
beyond the Unicode range, which UTF-8 cannot encode. See, for
example, the codepoints that start around line 645 in ind-util.el,
which are used for converting between IS 13194 (ISCII) and Unicode.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 04:15:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 59341 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> No. AFAIR, they are in utf-8-emacs because they include characters
> beyond the Unicode range, which UTF-8 cannot encode. See, for
> example, the codepoints that start around line 645 in ind-util.el,
> which are used for converting between IS 13194 (ISCII) and Unicode.
I see, thanks.
Do we need these characters to be raw bytes in the source code though?
I was thinking of a change similar to the below, which would
incidentally make it a bit easier to read the code.
diff --git a/lisp/language/ind-util.el b/lisp/language/ind-util.el
index e2a21820f4..16161319ef 100644
--- a/lisp/language/ind-util.el
+++ b/lisp/language/ind-util.el
@@ -644,9 +644,9 @@ indian-dev-aiba-decode-region
;;Unicode vs IS13194 ;; only Devanagari is supported now.
((ucs-devanagari-to-is13194-alist
'((?\x0900 . "[U+0900]")
- (?\x0901 . " ")
- (?\x0902 . " ")
- (?\x0903 . " ")
+ (?\x0901 . "?\x180000")
+ (?\x0902 . "?\x180001")
+ (?\x0903 . "?\x180002")
(?\x0904 . "[U+0904]")
[and so on]
This change would also avoid confusing external tools. For example, the
code is completely unreadable in many external viewers, such as:
https://github.com/emacs-mirror/emacs/blob/master/lisp/language/ind-util.el#L647
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 07:58:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 59341 <at> debbugs.gnu.org (full text, mbox):
> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Thu, 17 Nov 2022 20:14:09 -0800
> Cc: 59341 <at> debbugs.gnu.org
>
> Eli Zaretskii <eliz <at> gnu.org> writes:
>
> > No. AFAIR, they are in utf-8-emacs because they include characters
> > beyond the Unicode range, which UTF-8 cannot encode. See, for
> > example, the codepoints that start around line 645 in ind-util.el,
> > which are used for converting between IS 13194 (ISCII) and Unicode.
>
> I see, thanks.
>
> Do we need these characters to be raw bytes in the source code though?
(They are not raw bytes, they are UTF-8 encoded sequences, except that
the encoding uses more bytes than the "official" UTF-8 allows. IOW,
we have there encoded codepoints beyond u+0010FFFF, not raw bytes.)
Yes we need them to appear as characters, because then they will be
displayed as glyphs if you have a suitable font, and in the context
such as this one it is very important to _see_ the character, if only
understand the mapping and to detect mistakes in it.
However, I think we should add a note in the Commentary section about
these subtleties, so that whoever next is bothered about this will be
able to have their questions answered.
> This change would also avoid confusing external tools. For example, the
> code is completely unreadable in many external viewers, such as:
>
> https://github.com/emacs-mirror/emacs/blob/master/lisp/language/ind-util.el#L647
I'm not too bothered about this. Emacs sources are best viewed with
Emacs, and there's no way around this.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 11:12:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 59341 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> However, I think we should add a note in the Commentary section about
> these subtleties, so that whoever next is bothered about this will be
> able to have their questions answered.
How about adding something like this to the Commentary section of the
relevant files (i.e. the list I posted originally):
;; Note that this file contains non-Unicode characters, which are
;; needed for certain conversions. See the discussion in Bug#59341.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 12:00:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 59341 <at> debbugs.gnu.org (full text, mbox):
> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Fri, 18 Nov 2022 03:11:15 -0800
> Cc: 59341 <at> debbugs.gnu.org
>
> Eli Zaretskii <eliz <at> gnu.org> writes:
>
> > However, I think we should add a note in the Commentary section about
> > these subtleties, so that whoever next is bothered about this will be
> > able to have their questions answered.
>
> How about adding something like this to the Commentary section of the
> relevant files (i.e. the list I posted originally):
>
> ;; Note that this file contains non-Unicode characters, which are
> ;; needed for certain conversions. See the discussion in Bug#59341.
I'd rather describe the reason in more detail there.
I can do this myself if you prefer.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 16:50:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 59341 <at> debbugs.gnu.org (full text, mbox):
> However, I think we should add a note in the Commentary section about
> these subtleties, so that whoever next is bothered about this will be
> able to have their questions answered.
+1
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 16:54:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 59341 <at> debbugs.gnu.org (full text, mbox):
> How about adding something like this to the Commentary section of the
> relevant files (i.e. the list I posted originally):
>
> ;; Note that this file contains non-Unicode characters, which are
> ;; needed for certain conversions. See the discussion in Bug#59341.
Why send readers off to a bug thread? Wouldn't it
be better to summarize the important info here?
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 17:16:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 59341 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> I'd rather describe the reason in more detail there.
>
> I can do this myself if you prefer.
Yes, that would be much appreciated. Thank you.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#59341
; Package
emacs
.
(Fri, 18 Nov 2022 17:17:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 59341 <at> debbugs.gnu.org (full text, mbox):
Drew Adams <drew.adams <at> oracle.com> writes:
> > ;; Note that this file contains non-Unicode characters, which are
> > ;; needed for certain conversions. See the discussion in Bug#59341.
>
> Why send readers off to a bug thread? Wouldn't it
> be better to summarize the important info here?
Yes, but unlike Eli, I don't have many expert things to say here.
Luckily, Eli offered to write up something better.
Reply sent
to
Eli Zaretskii <eliz <at> gnu.org>
:
You have taken responsibility.
(Sat, 19 Nov 2022 09:27:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Stefan Kangas <stefankangas <at> gmail.com>
:
bug acknowledged by developer.
(Sat, 19 Nov 2022 09:27:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 59341-done <at> debbugs.gnu.org (full text, mbox):
> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Fri, 18 Nov 2022 18:14:50 +0100
> Cc: 59341 <at> debbugs.gnu.org
>
> Eli Zaretskii <eliz <at> gnu.org> writes:
>
> > I'd rather describe the reason in more detail there.
> >
> > I can do this myself if you prefer.
>
> Yes, that would be much appreciated. Thank you.
Done, and closing the bug.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 17 Dec 2022 12:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 2 years and 236 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.