GNU bug report logs - #59341
29.0.50; Lisp files with other encoding than UTF-8?

Package: emacs;

Reported by: Stefan Kangas <stefankangas <at> gmail.com>

Date: Thu, 17 Nov 2022 19:39:02 UTC

Severity: normal

Found in version 29.0.50

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 59341 in the body.
You can then email your comments to 59341 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Thu, 17 Nov 2022 19:39:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Kangas <stefankangas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 17 Nov 2022 19:39:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Thu, 17 Nov 2022 11:38:46 -0800

I noticed that codespell was having trouble with one of our files
(ethio-util.el), refusing to read it as UTF-8.  This lead to some
unexpected and unwanted behavior.

So I took a closer look and found that several files in lisp/ appear to
be use some other encoding than UTF-8:

$ cd lisp ; for f in $(git ls-files|egrep '.el$'); \
    do file $f | grep -v UTF-8 | grep -v " ASCII" ; \
  done ; cd -
international/titdic-cnv.el: Lisp/Scheme program, Non-ISO
extended-ASCII text, with LF, NEL line terminators
language/ethio-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
language/ethiopic.el: Lisp/Scheme program, Non-ISO extended-ASCII text
language/ind-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
language/tibet-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators
language/tibetan.el: Non-ISO extended-ASCII text, with LF, NEL line terminators
leim/quail/ethiopic.el: Non-ISO extended-ASCII text, with LF, NEL line
terminators
leim/quail/tibetan.el: Lisp/Scheme program, Non-ISO extended-ASCII
text, with LF, NEL line terminators

Should these files be converted to UTF-8?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Thu, 17 Nov 2022 19:55:02 GMT) Full text and rfc822 format available.

Message #8 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 59341 <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Thu, 17 Nov 2022 21:54:04 +0200

> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Thu, 17 Nov 2022 11:38:46 -0800
> 
> I noticed that codespell was having trouble with one of our files
> (ethio-util.el), refusing to read it as UTF-8.  This lead to some
> unexpected and unwanted behavior.
> 
> So I took a closer look and found that several files in lisp/ appear to
> be use some other encoding than UTF-8:
> 
> $ cd lisp ; for f in $(git ls-files|egrep '.el$'); \
>     do file $f | grep -v UTF-8 | grep -v " ASCII" ; \
>   done ; cd -
> international/titdic-cnv.el: Lisp/Scheme program, Non-ISO
> extended-ASCII text, with LF, NEL line terminators
> language/ethio-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> language/ethiopic.el: Lisp/Scheme program, Non-ISO extended-ASCII text
> language/ind-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> language/tibet-util.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> language/tibetan.el: Non-ISO extended-ASCII text, with LF, NEL line terminators
> leim/quail/ethiopic.el: Non-ISO extended-ASCII text, with LF, NEL line
> terminators
> leim/quail/tibetan.el: Lisp/Scheme program, Non-ISO extended-ASCII
> text, with LF, NEL line terminators
> 
> Should these files be converted to UTF-8?

No.  AFAIR, they are in utf-8-emacs because they include characters
beyond the Unicode range, which UTF-8 cannot encode.  See, for
example, the codepoints that start around line 645 in ind-util.el,
which are used for converting between IS 13194 (ISCII) and Unicode.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 04:15:02 GMT) Full text and rfc822 format available.

Message #11 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 59341 <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Thu, 17 Nov 2022 20:14:09 -0800

Eli Zaretskii <eliz <at> gnu.org> writes:

> No.  AFAIR, they are in utf-8-emacs because they include characters
> beyond the Unicode range, which UTF-8 cannot encode.  See, for
> example, the codepoints that start around line 645 in ind-util.el,
> which are used for converting between IS 13194 (ISCII) and Unicode.

I see, thanks.

Do we need these characters to be raw bytes in the source code though?
I was thinking of a change similar to the below, which would
incidentally make it a bit easier to read the code.

diff --git a/lisp/language/ind-util.el b/lisp/language/ind-util.el
index e2a21820f4..16161319ef 100644
--- a/lisp/language/ind-util.el
+++ b/lisp/language/ind-util.el
@@ -644,9 +644,9 @@ indian-dev-aiba-decode-region
     ;;Unicode vs IS13194  ;; only Devanagari is supported now.
     ((ucs-devanagari-to-is13194-alist
       '((?\x0900 . "[U+0900]")
-	(?\x0901 . " ")
-	(?\x0902 . " ")
-	(?\x0903 . " ")
+        (?\x0901 . "?\x180000")
+        (?\x0902 . "?\x180001")
+        (?\x0903 . "?\x180002")
 	(?\x0904 . "[U+0904]")

[and so on]

This change would also avoid confusing external tools.  For example, the
code is completely unreadable in many external viewers, such as:

https://github.com/emacs-mirror/emacs/blob/master/lisp/language/ind-util.el#L647

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 07:58:01 GMT) Full text and rfc822 format available.

Message #14 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 59341 <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Fri, 18 Nov 2022 09:57:11 +0200

> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Thu, 17 Nov 2022 20:14:09 -0800
> Cc: 59341 <at> debbugs.gnu.org
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > No.  AFAIR, they are in utf-8-emacs because they include characters
> > beyond the Unicode range, which UTF-8 cannot encode.  See, for
> > example, the codepoints that start around line 645 in ind-util.el,
> > which are used for converting between IS 13194 (ISCII) and Unicode.
> 
> I see, thanks.
> 
> Do we need these characters to be raw bytes in the source code though?

(They are not raw bytes, they are UTF-8 encoded sequences, except that
the encoding uses more bytes than the "official" UTF-8 allows.  IOW,
we have there encoded codepoints beyond u+0010FFFF, not raw bytes.)

Yes we need them to appear as characters, because then they will be
displayed as glyphs if you have a suitable font, and in the context
such as this one it is very important to _see_ the character, if only
understand the mapping and to detect mistakes in it.

However, I think we should add a note in the Commentary section about
these subtleties, so that whoever next is bothered about this will be
able to have their questions answered.

> This change would also avoid confusing external tools.  For example, the
> code is completely unreadable in many external viewers, such as:
> 
> https://github.com/emacs-mirror/emacs/blob/master/lisp/language/ind-util.el#L647

I'm not too bothered about this.  Emacs sources are best viewed with
Emacs, and there's no way around this.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 11:12:02 GMT) Full text and rfc822 format available.

Message #17 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 59341 <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Fri, 18 Nov 2022 03:11:15 -0800

Eli Zaretskii <eliz <at> gnu.org> writes:

> However, I think we should add a note in the Commentary section about
> these subtleties, so that whoever next is bothered about this will be
> able to have their questions answered.

How about adding something like this to the Commentary section of the
relevant files (i.e. the list I posted originally):

;; Note that this file contains non-Unicode characters, which are
;; needed for certain conversions.  See the discussion in Bug#59341.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 12:00:02 GMT) Full text and rfc822 format available.

Message #20 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 59341 <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Fri, 18 Nov 2022 13:59:18 +0200

> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Fri, 18 Nov 2022 03:11:15 -0800
> Cc: 59341 <at> debbugs.gnu.org
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > However, I think we should add a note in the Commentary section about
> > these subtleties, so that whoever next is bothered about this will be
> > able to have their questions answered.
> 
> How about adding something like this to the Commentary section of the
> relevant files (i.e. the list I posted originally):
> 
> ;; Note that this file contains non-Unicode characters, which are
> ;; needed for certain conversions.  See the discussion in Bug#59341.

I'd rather describe the reason in more detail there.

I can do this myself if you prefer.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 16:50:01 GMT) Full text and rfc822 format available.

Message #23 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Eli Zaretskii <eliz <at> gnu.org>, Stefan Kangas <stefankangas <at> gmail.com>
Cc: "59341 <at> debbugs.gnu.org" <59341 <at> debbugs.gnu.org>
Subject: RE: [External] : bug#59341: 29.0.50; Lisp files with other encoding
 than UTF-8?
Date: Fri, 18 Nov 2022 16:49:42 +0000

> However, I think we should add a note in the Commentary section about
> these subtleties, so that whoever next is bothered about this will be
> able to have their questions answered.

+1

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 16:54:02 GMT) Full text and rfc822 format available.

Message #26 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Drew Adams <drew.adams <at> oracle.com>
To: Stefan Kangas <stefankangas <at> gmail.com>, Eli Zaretskii <eliz <at> gnu.org>
Cc: "59341 <at> debbugs.gnu.org" <59341 <at> debbugs.gnu.org>
Subject: RE: [External] : bug#59341: 29.0.50; Lisp files with other encoding
 than UTF-8?
Date: Fri, 18 Nov 2022 16:53:26 +0000

> How about adding something like this to the Commentary section of the
> relevant files (i.e. the list I posted originally):
> 
> ;; Note that this file contains non-Unicode characters, which are
> ;; needed for certain conversions.  See the discussion in Bug#59341.

Why send readers off to a bug thread?  Wouldn't it
be better to summarize the important info here?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 17:16:02 GMT) Full text and rfc822 format available.

Message #29 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 59341 <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Fri, 18 Nov 2022 18:14:50 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

> I'd rather describe the reason in more detail there.
>
> I can do this myself if you prefer.

Yes, that would be much appreciated.  Thank you.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#59341; Package emacs. (Fri, 18 Nov 2022 17:17:02 GMT) Full text and rfc822 format available.

Message #32 received at 59341 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefankangas <at> gmail.com>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>,
 "59341 <at> debbugs.gnu.org" <59341 <at> debbugs.gnu.org>
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Fri, 18 Nov 2022 18:16:12 +0100

Drew Adams <drew.adams <at> oracle.com> writes:

> > ;; Note that this file contains non-Unicode characters, which are
> > ;; needed for certain conversions.  See the discussion in Bug#59341.
>
> Why send readers off to a bug thread?  Wouldn't it
> be better to summarize the important info here?

Yes, but unlike Eli, I don't have many expert things to say here.
Luckily, Eli offered to write up something better.

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 19 Nov 2022 09:27:02 GMT) Full text and rfc822 format available.

Notification sent to Stefan Kangas <stefankangas <at> gmail.com>:
bug acknowledged by developer. (Sat, 19 Nov 2022 09:27:02 GMT) Full text and rfc822 format available.

Message #37 received at 59341-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefankangas <at> gmail.com>
Cc: 59341-done <at> debbugs.gnu.org
Subject: Re: bug#59341: 29.0.50; Lisp files with other encoding than UTF-8?
Date: Sat, 19 Nov 2022 11:26:26 +0200

> From: Stefan Kangas <stefankangas <at> gmail.com>
> Date: Fri, 18 Nov 2022 18:14:50 +0100
> Cc: 59341 <at> debbugs.gnu.org
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > I'd rather describe the reason in more detail there.
> >
> > I can do this myself if you prefer.
> 
> Yes, that would be much appreciated.  Thank you.

Done, and closing the bug.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 17 Dec 2022 12:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 236 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #59341 29.0.50; Lisp files with other encoding than UTF-8?

GNU bug report logs - #59341
29.0.50; Lisp files with other encoding than UTF-8?