GNU bug report logs - #6971
24.0.50.1: non-ascii chars appear as numbers

Package: emacs;

Reported by: Andreas Röhler <andreas.roehler <at> easy-emacs.de>

Date: Thu, 2 Sep 2010 10:15:05 UTC

Severity: normal

Found in version 24.0.50.1

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6971 in the body.
You can then email your comments to 6971 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#6971; Package emacs. (Thu, 02 Sep 2010 10:15:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Andreas Röhler <andreas.roehler <at> easy-emacs.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 02 Sep 2010 10:15:07 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.0.50.1: non-ascii chars appear as numbers
Date: Thu, 02 Sep 2010 12:15:34 +0200

Hi,

encounter an encoding bug, which I've seen years
ago and reported already, but didn't occur with Emacs
23:

when opening a file containing non-ascii chars, german
umlauts etc., these aren't shown as glyphs but
as numbers.

(define-abbrev-table
  'global-abbrev-table
  '(("Infinity" "∞" nil 0)
    ("alpha" "α" nil 2)
    ("beta" "β" nil 1)
    ("gamma" "γ" nil 1)
    ("theta" "θ" nil 0)))


I see  ("alpha" "\316\261" nil 2)

for example.

May send a screenshot if useful.

Curious: if the chars-as-numbers code is pasted here in this mail,
glyphs are displayed correctly.

As the only thing I remember is editing the file with

GNU Emacs 24.0.50.1 (i686-pc-linux-gnu, GTK+ Version 2.12.0) of 2010-08-28

assume the bug comes from there.

Sorry, not being able to truck down further the issue.


Andreas

--
https://code.launchpad.net/~a-roehler/python-mode
https://code.launchpad.net/s-x-emacs-werkstatt/

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 04 Sep 2010 08:13:02 GMT) Full text and rfc822 format available.

Notification sent to Andreas Röhler <andreas.roehler <at> easy-emacs.de>:
bug acknowledged by developer. (Sat, 04 Sep 2010 08:13:02 GMT) Full text and rfc822 format available.

Message #10 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 11:16:32 +0300

> Date: Sat, 04 Sep 2010 09:30:55 +0200
> From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
> CC: bug-gnu-emacs <at> gnu.org
> 
> > Please post the file as an attachment.
> >
> 
> Attached.

Thanks.  Here's your culprit:

> > \240 (autoload 'muse-mode "muse-mode" "" t)

You have literal \240 characters in the file, which are invalid UTF-8
sequences.

This file has also other similar problems, like this one:

  Du kannst es nat\365\202\211\205\365\200\210\246\357\275\357\275\274rlich auch unter Linux ausprobieren, z.B.:

I believe the 4th word should have been "natűrlich", and the invalid
long byte sequence instead of ű (which Emacs decodes into some
Japanese Kanji character that cannot be encoded by UTF-8) is the
result of multiple saving of this file with incorrect encoding.

To fix all this corruption, I suggest the following steps:

  1) C-x RET c utf-8 RET C-x C-f befehle.txt RET

  2) M-: (unencodable-char-position (point) (point-max) 'utf-8) RET

  3) Go to the position shown by the previous command, and edit the
     file to replace invalid bytes with valid characters.

  4) Move point past the corrected portion.

  5) Go back to 2.  When unencodable-char-position returns nil, you
     are done; save the file.

I'm closing bug #6971 with this message, since there's no Emacs bug
here.

Message #11 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 10:29:09 +0200

Am 04.09.2010 10:16, schrieb Eli Zaretskii:
>> Date: Sat, 04 Sep 2010 09:30:55 +0200
>> From: Andreas Röhler<andreas.roehler <at> easy-emacs.de>
>> CC: bug-gnu-emacs <at> gnu.org
>>
>>> Please post the file as an attachment.
>>>
>>
>> Attached.
>
> Thanks.  Here's your culprit:
>
>>> \240 (autoload 'muse-mode "muse-mode" "" t)
>
> You have literal \240 characters in the file, which are invalid UTF-8
> sequences.
>

Thanks a lot for your efforts.
Question remains how that might happen.

Why Emacs could not prevent that.

See too possible causes

- chars from a auto-saved-file
- something pasted from the net, which had some MS- encoding etc.

If thats real so far, both cases are not that uncommon, think Emacs 
should find a way to deal with.


> This file has also other similar problems, like this one:
>
>    Du kannst es nat\365\202\211\205\365\200\210\246\357\275\357\275\274rlich auch unter Linux ausprobieren, z.B.:
>
> I believe the 4th word should have been "natűrlich", and the invalid
> long byte sequence instead of ű (which Emacs decodes into some
> Japanese Kanji character that cannot be encoded by UTF-8) is the
> result of multiple saving of this file with incorrect encoding.
>
> To fix all this corruption, I suggest the following steps:
>
>    1) C-x RET c utf-8 RET C-x C-f befehle.txt RET
>
>    2) M-: (unencodable-char-position (point) (point-max) 'utf-8) RET
>
>    3) Go to the position shown by the previous command, and edit the
>       file to replace invalid bytes with valid characters.
>
>    4) Move point past the corrected portion.
>
>    5) Go back to 2.  When unencodable-char-position returns nil, you
>       are done; save the file.
>
> I'm closing bug #6971 with this message, since there's no Emacs bug
> here.
>

Hm,

as I didn't see that error for a long time, still suspect Emacs 24 doing 
something not that clever 23 does.

Andreas

Message #12 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 12:27:09 +0300

> Date: Sat, 04 Sep 2010 10:29:09 +0200
> From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
> CC: 6971-done <at> debbugs.gnu.org
> 
> >>> \240 (autoload 'muse-mode "muse-mode" "" t)
> >
> > You have literal \240 characters in the file, which are invalid UTF-8
> > sequences.
> >
> 
> Thanks a lot for your efforts.
> Question remains how that might happen.
> 
> Why Emacs could not prevent that.

I have no idea, but you should watch very closely the first time when
Emacs says it cannot save the file and offers you to select an
encoding.  At that moment, the contents of the buffer should be
analyzed to find the unencodable characters, and try to figure out
where did they come from.  If you just accept one of the encodings
offered by Emacs and save the file, it's too late.

> See too possible causes
> 
> - chars from a auto-saved-file

Cannot be true: Emacs 24 uses UTF-8 for auto-saved-file.

> - something pasted from the net, which had some MS- encoding etc.

Cannot be, if your selection encoding is set up correctly.

> as I didn't see that error for a long time, still suspect Emacs 24 doing 
> something not that clever 23 does.

Maybe, but it's impossible to say unless you show a reproducible
recipe where a valid UTF-8 buffer gets a raw byte \240 inserted into
it.

Message #13 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 11:50:01 +0200

Am 04.09.2010 11:27, schrieb Eli Zaretskii:
>> Date: Sat, 04 Sep 2010 10:29:09 +0200
>> From: Andreas Röhler<andreas.roehler <at> easy-emacs.de>
>> CC: 6971-done <at> debbugs.gnu.org
>>
>>>>> \240 (autoload 'muse-mode "muse-mode" "" t)
>>>
>>> You have literal \240 characters in the file, which are invalid UTF-8
>>> sequences.
>>>
>>
>> Thanks a lot for your efforts.
>> Question remains how that might happen.
>>
>> Why Emacs could not prevent that.
>
> I have no idea, but you should watch very closely the first time when
> Emacs says it cannot save the file and offers you to select an
> encoding.  At that moment, the contents of the buffer should be
> analyzed to find the unencodable characters, and try to figure out
> where did they come from.  If you just accept one of the encodings
> offered by Emacs and save the file, it's too late.
>
>> See too possible causes
>>
>> - chars from a auto-saved-file
>
> Cannot be true: Emacs 24 uses UTF-8 for auto-saved-file.
>
>> - something pasted from the net, which had some MS- encoding etc.
>
> Cannot be, if your selection encoding is set up correctly.
>
>> as I didn't see that error for a long time, still suspect Emacs 24 doing
>> something not that clever 23 does.
>
> Maybe, but it's impossible to say unless you show a reproducible
> recipe where a valid UTF-8 buffer gets a raw byte \240 inserted into
> it.
>

Thanks,

as this file contains just some notes saved, maybe the both last edits
at the top might give some information.

I'm pretty sure everything was fine with the but-last entry below
(setq mon (list ...

Maybe this entry and/or execution it inside did the corruption:

(insert (list (read-from-string (format "%s" α))


(setq mon (list "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli" 
"August" "September" "Oktober"   "November" "Dezember"))
Mai
\,(pop mon)

Message #14 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 13:36:53 +0300

> Date: Sat, 04 Sep 2010 11:50:01 +0200
> From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
> CC: 6971-done <at> debbugs.gnu.org
> 
> I'm pretty sure everything was fine with the but-last entry below
> (setq mon (list ...
> 
> Maybe this entry and/or execution it inside did the corruption:
> 
> (insert (list (read-from-string (format "%s" α))

This one errors out.

> (setq mon (list "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli" 
> "August" "September" "Oktober"   "November" "Dezember"))
> Mai
> \,(pop mon)

I don't see anything here that could have caused corruption.  And note
that these parts of your file are perfectly fine.

Message #15 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 13:21:09 +0200

[Message part 1 (text/plain, inline)]

Am 04.09.2010 12:36, schrieb Eli Zaretskii:
>> Date: Sat, 04 Sep 2010 11:50:01 +0200
>> From: Andreas Röhler<andreas.roehler <at> easy-emacs.de>
>> CC: 6971-done <at> debbugs.gnu.org
>>
>> I'm pretty sure everything was fine with the but-last entry below
>> (setq mon (list ...
>>
>> Maybe this entry and/or execution it inside did the corruption:
>>
>> (insert (list (read-from-string (format "%s" α))
>
> This one errors out.
>
>> (setq mon (list "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli"
>> "August" "September" "Oktober"   "November" "Dezember"))
>> Mai
>> \,(pop mon)
>
> I don't see anything here that could have caused corruption.  And note
> that these parts of your file are perfectly fine.
>

Only in thunderbird, see screenshot attached.

If thunderbird displays the pasted code correctly,
doesn't it indicate a bug in Emacs?

BTW pasting from thunderbird might be well considered as a possible trigger.

[tb.png (image/png, attachment)]

Message #16 received at 6971-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
Cc: 6971-done <at> debbugs.gnu.org
Subject: Re: bug#6974: Emacs doesn't like Swedish ä (on w32)
Date: Sat, 04 Sep 2010 16:15:50 +0300

> Date: Sat, 04 Sep 2010 13:21:09 +0200
> From: Andreas Röhler <andreas.roehler <at> easy-emacs.de>
> CC: 6971-done <at> debbugs.gnu.org
> 
> >> (insert (list (read-from-string (format "%s" α))
> >
> > This one errors out.
> >
> >> (setq mon (list "Januar" "Februar" "März" "April" "Mai" "Juni" "Juli"
> >> "August" "September" "Oktober"   "November" "Dezember"))
> >> Mai
> >> \,(pop mon)
> >
> > I don't see anything here that could have caused corruption.  And note
> > that these parts of your file are perfectly fine.
> >
> 
> Only in thunderbird, see screenshot attached.

Not only in Thunderbird, also in Emacs, if you force UTF-8 decoding
with "C-x RET c".

> If thunderbird displays the pasted code correctly,
> doesn't it indicate a bug in Emacs?

No, because we don't know what exactly happens to the text during
copy-paste.  Thunderbird could perform some "corrective" action, for
example.  Or, if you copy just a portion of the file, that portion
could be a valid UTF-8.

> BTW pasting from thunderbird might be well considered as a possible trigger.

Maybe.  Look, I don't see any reason to continue this guesswork.  If
you can find a recipe to reproduce the corruption starting with a
valid UTF-8 file, please post it.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 03 Oct 2010 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 14 years and 315 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #6971 24.0.50.1: non-ascii chars appear as numbers

GNU bug report logs - #6971
24.0.50.1: non-ascii chars appear as numbers