GNU bug report logs - #8308
23.3; Use utf-8 for writing abbrev file

Previous Next

Package: emacs;

Reported by: Leo <sdl.web <at> gmail.com>

Date: Mon, 21 Mar 2011 06:23:01 UTC

Severity: minor

Found in version 23.3

Fixed in version 24.1.

Done: Leo <sdl.web <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 8308 in the body.
You can then email your comments to 8308 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 06:23:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Leo <sdl.web <at> gmail.com>:
New bug report received and forwarded. Copy sent to monnier <at> iro.umontreal.ca, bug-gnu-emacs <at> gnu.org. (Mon, 21 Mar 2011 06:23:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 14:22:24 +0800
Is it OK to change the encoding for abbrev file to utf-8?

=== modified file 'lisp/abbrev.el'
--- a/lisp/abbrev.el	2011-03-21 05:49:12 +0000
+++ b/lisp/abbrev.el	2011-03-21 06:20:36 +0000
@@ -225,9 +225,9 @@
 		    abbrev-file-name)))
   (or (and file (> (length file) 0))
       (setq file abbrev-file-name))
-  (let ((coding-system-for-write 'emacs-mule))
+  (let ((coding-system-for-write 'utf-8))
     (with-temp-file file
-      (insert ";;-*-coding: emacs-mule;-*-\n")
+      (insert ";;-*-coding: utf-8;-*-\n")
       (dolist (table
                ;; We sort the table in order to ease the automatic
                ;; merging of different versions of the user's abbrevs


Leo




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 09:01:02 GMT) Full text and rfc822 format available.

Message #8 received at 8308 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Leo <sdl.web <at> gmail.com>
Cc: 8308 <at> debbugs.gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 05:00:51 -0400
> From: Leo <sdl.web <at> gmail.com>
> Date: Mon, 21 Mar 2011 14:22:24 +0800
> Cc: 
> 
> Is it OK to change the encoding for abbrev file to utf-8?

What will that do to characters that are not unified into the range of
valid Unicode code points?

Can you tell what is the purpose of this change?





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 10:02:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 18:01:17 +0800
On 2011-03-21 17:00 +0800, Eli Zaretskii wrote:
>> From: Leo <sdl.web <at> gmail.com>
>> Date: Mon, 21 Mar 2011 14:22:24 +0800
>> Cc: 
>> 
>> Is it OK to change the encoding for abbrev file to utf-8?
>
> What will that do to characters that are not unified into the range of
> valid Unicode code points?

That's a valid concern. But

,----
| M -- emacs-mule
| 
| Emacs 21 internal format used in buffer and string.
| Type: emacs-mule (Emacs 21 internal encoding)
| EOL type: Automatic selection from:
| 	[emacs-mule-unix emacs-mule-dos emacs-mule-mac]
| This coding system can encode all emacs-mule charsets.
| 
| [back]
`----

,----[ (info "(elisp)Text Representations") ]
|    (1) This internal representation is based on one of the encodings
| defined by the Unicode Standard, called "UTF-8", for representing any
| Unicode codepoint, but Emacs extends UTF-8 to represent the additional
| codepoints it uses for raw 8-bit bytes and characters not unified with
| Unicode.
`----

Would you agree to use utf-8-emacs instead, which covers all characters.

>
> Can you tell what is the purpose of this change?

Make abbrev file editable to other editors.

Leo





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 10:55:01 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Leo <sdl.web <at> gmail.com>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 06:54:18 -0400
> From: Leo <sdl.web <at> gmail.com>
> Date: Mon, 21 Mar 2011 18:01:17 +0800
> Cc: 
> 
> Would you agree to use utf-8-emacs instead, which covers all characters.

That's better, but the characters outside Unicode are still going to
do bad things to any software except Emacs.  AFAIK, emacs-mule is a
superset of iso-2022 in the same way as utf-8-emacs is a superset of
utf-8.

> > Can you tell what is the purpose of this change?
> 
> Make abbrev file editable to other editors.

If we are really keen on making the abbrev files editable to other
editors, we should make sure they are encoded in some encoding that
these other editors will understand.  That probably calls for using
utf-8 for everything that's covered by Unicode, and using other
appropriate encodings for characters outside Unicode.




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 11:21:01 GMT) Full text and rfc822 format available.

Message #17 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 12:26:15 +0100
Am 21.03.2011 11:54, schrieb Eli Zaretskii:
>> From: Leo<sdl.web <at> gmail.com>
>> Date: Mon, 21 Mar 2011 18:01:17 +0800
>> Cc:
>>
>> Would you agree to use utf-8-emacs instead, which covers all characters.
>
> That's better, but the characters outside Unicode are still going to
> do bad things to any software except Emacs.  AFAIK, emacs-mule is a
> superset of iso-2022 in the same way as utf-8-emacs is a superset of
> utf-8.
>
>>> Can you tell what is the purpose of this change?
>>
>> Make abbrev file editable to other editors.
>
> If we are really keen on making the abbrev files editable to other
> editors, we should make sure they are encoded in some encoding that
> these other editors will understand.  That probably calls for using
> utf-8 for everything that's covered by Unicode, and using other
> appropriate encodings for characters outside Unicode.
>
>
>
>

Hi,

sounds interesting for me, as not just other editors are at stake AFAIU, 
but auto-generated abbrevs produced by programms.

These might be theme-specific, cover items of medicine, jura etc.
Could offer modes with preloaded abbrevs resp. to matter of writing.

Regards,

Andreas









Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 14:51:02 GMT) Full text and rfc822 format available.

Message #20 received at 8308 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Leo <sdl.web <at> gmail.com>
Cc: 8308 <at> debbugs.gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 10:50:21 -0400
> Is it OK to change the encoding for abbrev file to utf-8?
> === modified file 'lisp/abbrev.el'
> --- a/lisp/abbrev.el	2011-03-21 05:49:12 +0000
> +++ b/lisp/abbrev.el	2011-03-21 06:20:36 +0000
> @@ -225,9 +225,9 @@
>  		    abbrev-file-name)))
>    (or (and file (> (length file) 0))
>        (setq file abbrev-file-name))
> -  (let ((coding-system-for-write 'emacs-mule))
> +  (let ((coding-system-for-write 'utf-8))
>      (with-temp-file file
> -      (insert ";;-*-coding: emacs-mule;-*-\n")
> +      (insert ";;-*-coding: utf-8;-*-\n")
>        (dolist (table
>                 ;; We sort the table in order to ease the automatic
>                 ;; merging of different versions of the user's abbrevs

Sounds good in general, but I'm wondering whether we should worry about
the presence of abbrevs which include bytes (aka eight-bit-chars).
Using `utf-8-emacs' should fix those issues, but would then bump into
the problem that such abbrev files wouldn't be compatible with Emacs-22.


        Stefan




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 15:39:01 GMT) Full text and rfc822 format available.

Message #23 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 23:37:41 +0800
On 2011-03-21 22:50 +0800, Stefan Monnier wrote:
> Sounds good in general, but I'm wondering whether we should worry about
> the presence of abbrevs which include bytes (aka eight-bit-chars).
> Using `utf-8-emacs' should fix those issues, but would then bump into
> the problem that such abbrev files wouldn't be compatible with Emacs-22.

I think we should just use utf-8-emacs. What do other people think?

Leo





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 18:19:02 GMT) Full text and rfc822 format available.

Message #26 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 19:24:16 +0100
Am 21.03.2011 15:50, schrieb Stefan Monnier:
>> Is it OK to change the encoding for abbrev file to utf-8?
>> === modified file 'lisp/abbrev.el'
>> --- a/lisp/abbrev.el	2011-03-21 05:49:12 +0000
>> +++ b/lisp/abbrev.el	2011-03-21 06:20:36 +0000
>> @@ -225,9 +225,9 @@
>>   		    abbrev-file-name)))
>>     (or (and file (>  (length file) 0))
>>         (setq file abbrev-file-name))
>> -  (let ((coding-system-for-write 'emacs-mule))
>> +  (let ((coding-system-for-write 'utf-8))
>>       (with-temp-file file
>> -      (insert ";;-*-coding: emacs-mule;-*-\n")
>> +      (insert ";;-*-coding: utf-8;-*-\n")
>>         (dolist (table
>>                  ;; We sort the table in order to ease the automatic
>>                  ;; merging of different versions of the user's abbrevs
>
> Sounds good in general, but I'm wondering whether we should worry about
> the presence of abbrevs which include bytes (aka eight-bit-chars).
> Using `utf-8-emacs' should fix those issues, but would then bump into
> the problem that such abbrev files wouldn't be compatible with Emacs-22.
>
>
>          Stefan
>

Hi,

so maybe not hard-code it, rather have a variable?

Andreas




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 18:46:02 GMT) Full text and rfc822 format available.

Message #29 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Leo <sdl.web <at> gmail.com>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 20:45:33 +0200
> From: Leo <sdl.web <at> gmail.com>
> Date: Mon, 21 Mar 2011 23:37:41 +0800
> Cc: 
> 
> I think we should just use utf-8-emacs.

Why do you think so?




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Mon, 21 Mar 2011 18:54:02 GMT) Full text and rfc822 format available.

Message #32 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 20:53:41 +0200
> Date: Mon, 21 Mar 2011 19:24:16 +0100
> From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
> Cc: 
> 
> so maybe not hard-code it, rather have a variable?

A constant encoding will never DTRT in all cases.





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Tue, 22 Mar 2011 01:02:02 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Tue, 22 Mar 2011 09:00:51 +0800
On 2011-03-22 02:45 +0800, Eli Zaretskii wrote:
>> I think we should just use utf-8-emacs.
>
> Why do you think so?

By the time 24.1 is released, it will be 1-2 years from now and there
will be two major stable releases that work with utf-8-emacs, which are
backward-compatible enough. But I don't know so I'll forget about this
bug and let the gurus figure it out.

Leo




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Tue, 22 Mar 2011 02:49:02 GMT) Full text and rfc822 format available.

Message #38 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Leo <sdl.web <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Mon, 21 Mar 2011 22:48:40 -0400
>>> I think we should just use utf-8-emacs.
>> Why do you think so?
> By the time 24.1 is released, it will be 1-2 years from now and there
> will be two major stable releases that work with utf-8-emacs, which are
> backward-compatible enough. But I don't know so I'll forget about this
> bug and let the gurus figure it out.

I think it might be OK to do it for Emacs-25, but since Emacs-22 can't
handle utf-8-emacs, I think it's a bit early to switch to it in
Emacs-24.  If utf-8 is sufficient, OTOH it's the best choice.  So maybe
we should check the buffer first to see if utf-8 is safe, and only fall
back to emacs-mule if utf-8 is not safe.


        Stefan





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Tue, 22 Mar 2011 03:48:02 GMT) Full text and rfc822 format available.

Message #41 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Tue, 22 Mar 2011 11:47:21 +0800
On 2011-03-22 10:48 +0800, Stefan Monnier wrote:
> I think it might be OK to do it for Emacs-25, but since Emacs-22 can't
> handle utf-8-emacs, I think it's a bit early to switch to it in
> Emacs-24.  If utf-8 is sufficient, OTOH it's the best choice.  So maybe
> we should check the buffer first to see if utf-8 is safe, and only fall
> back to emacs-mule if utf-8 is not safe.

I think default to utf-8 is good, which is sufficient for most people.
Any comments on the following patch? I don't know how to introduce a
char unencodable with utf-8 to the abbrevs. So it is only partially
tested.


=== modified file 'lisp/abbrev.el'
--- lisp/abbrev.el	2011-01-25 04:08:28 +0000
+++ lisp/abbrev.el	2011-03-22 03:30:52 +0000
@@ -225,21 +225,29 @@
 		    abbrev-file-name)))
   (or (and file (> (length file) 0))
       (setq file abbrev-file-name))
-  (let ((coding-system-for-write 'emacs-mule))
-    (with-temp-file file
-      (insert ";;-*-coding: emacs-mule;-*-\n")
+  (let ((coding-system-for-write 'utf-8))
+    (with-temp-buffer
       (dolist (table
-               ;; We sort the table in order to ease the automatic
-               ;; merging of different versions of the user's abbrevs
-               ;; file.  This is useful, for example, for when the
-               ;; user keeps their home directory in a revision
-               ;; control system, and is therefore keeping multiple
-               ;; slightly-differing copies loosely synchronized.
-               (sort (copy-sequence abbrev-table-name-list)
-                     (lambda (s1 s2)
-                       (string< (symbol-name s1)
-                                (symbol-name s2)))))
-	(insert-abbrev-table-description table nil)))))
+	       ;; We sort the table in order to ease the automatic
+	       ;; merging of different versions of the user's abbrevs
+	       ;; file.  This is useful, for example, for when the
+	       ;; user keeps their home directory in a revision
+	       ;; control system, and is therefore keeping multiple
+	       ;; slightly-differing copies loosely synchronized.
+	       (sort (copy-sequence abbrev-table-name-list)
+		     (lambda (s1 s2)
+		       (string< (symbol-name s1)
+				(symbol-name s2)))))
+	(insert-abbrev-table-description table nil))
+      (when (unencodable-char-position (point-min) (point-max) 'utf-8)
+	(setq coding-system-for-write
+	      (if (> emacs-major-version 24)
+		  'utf-8-emacs
+		;; For compatibility with Emacs 22
+		'emacs-mule)))
+      (goto-char (point-min))
+      (insert (format ";;-*-coding: %s;-*-\n" coding-system-for-write))
+      (write-region nil nil file nil 0))))
 
 (defun add-mode-abbrev (arg)
   "Define mode-specific abbrev for last word(s) before point.





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Tue, 22 Mar 2011 05:25:02 GMT) Full text and rfc822 format available.

Message #44 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Leo <sdl.web <at> gmail.com>
Cc: Eli Zaretskii <eliz <at> gnu.org>, bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Tue, 22 Mar 2011 01:24:28 -0400
> I think default to utf-8 is good, which is sufficient for most people.
> Any comments on the following patch? I don't know how to introduce a
> char unencodable with utf-8 to the abbrevs. So it is only partially
> tested.

(unibyte-string 129) returns a string containing an unencodable char.
So you can test with it.
The patch looks good,


        Stefan




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Tue, 22 Mar 2011 10:43:03 GMT) Full text and rfc822 format available.

Message #47 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Tue, 22 Mar 2011 18:41:39 +0800
On 2011-03-22 13:24 +0800, Stefan Monnier wrote:
> (unibyte-string 129) returns a string containing an unencodable char.
> So you can test with it.

I still cannot get any byte into the abbrevs. For example,
(unibyte-string 129) returns byte \201 but when it is written to abbrev
file by write-abbrev-file, it is changed to \ 2 0 1, so utf-8 appear
sufficient even for bytes.

Leo





Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#8308; Package emacs. (Tue, 22 Mar 2011 18:28:01 GMT) Full text and rfc822 format available.

Message #50 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Leo <sdl.web <at> gmail.com>
Cc: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Tue, 22 Mar 2011 14:27:04 -0400
>> (unibyte-string 129) returns a string containing an unencodable char.
>> So you can test with it.
> I still cannot get any byte into the abbrevs. For example,
> (unibyte-string 129) returns byte \201 but when it is written to abbrev
> file by write-abbrev-file, it is changed to \ 2 0 1, so utf-8 appear
> sufficient even for bytes.

Good.  In any case your unencodable-foo test would trigger if there were
eight-bit-chars in there, so it works correctly in this respect.
Please install your patch.


        Stefan




Reply sent to Leo <sdl.web <at> gmail.com>:
You have taken responsibility. (Wed, 23 Mar 2011 00:43:02 GMT) Full text and rfc822 format available.

Notification sent to Leo <sdl.web <at> gmail.com>:
bug acknowledged by developer. (Wed, 23 Mar 2011 00:43:02 GMT) Full text and rfc822 format available.

Message #55 received at 8308-done <at> debbugs.gnu.org (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: 8308-done <at> debbugs.gnu.org
Subject: Re: bug#8308: 23.3; Use utf-8 for writing abbrev file
Date: Wed, 23 Mar 2011 08:42:08 +0800
Version: 24.1.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 20 Apr 2011 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 14 years and 125 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.