GNU bug report logs - #2497
23.0.91; Fails to read UTF-8 on Win2k

Package: emacs;

Date: Fri, 27 Feb 2009 14:20:02 UTC

Severity: normal

Merged with 2354

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 2497 in the body.
You can then email your comments to 2497 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 14:20:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
New bug report received and forwarded. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 14:20:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: emacs-pretest-bug <at> gnu.org
Subject: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 15:10:19 +0100

I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
fails to read utf-8 encoded files correctly. When visiting a file in
utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
indicates iso-latin1-dos for saving the file. This has not been an
issue in 23.0.90.

-- 
Uwe


In GNU Emacs 23.0.91.1 (i386-mingw-nt5.0.2195)
 of 2009-02-27 on SOFT-MJASON
Windowing system distributor `Microsoft Corp.', version 5.0.2195
configured using `configure --with-gcc (3.4)'

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: DEU
  value of $XMODIFIERS: nil
  locale-coding-system: cp1252
  default-enable-multibyte-characters: t

Major mode: Lisp Interaction

Minor modes in effect:
  iswitchb-mode: t
  display-time-mode: t
  auto-insert-mode: t
  diff-auto-refine-mode: t
  delete-selection-mode: t
  pc-selection-mode: t
  tooltip-mode: t
  mouse-wheel-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  column-number-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent input:
M-x r e <tab> p o <tab> r t <tab> <return>

Recent messages:
Loading time...done
Loading iswitchb...done
For information about GNU Emacs and the GNU system, type C-h C-a.
Making completion list... [2 times]

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 16:10:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 16:10:04 GMT) Full text and rfc822 format available.

Message #10 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: uwe.siart <at> tum.de, 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 18:03:16 +0200

> Date: Fri, 27 Feb 2009 15:10:19 +0100
> From: Uwe Siart <uwe.siart <at> tum.de>
> Cc: 
> 
> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> fails to read utf-8 encoded files correctly. When visiting a file in
> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> indicates iso-latin1-dos for saving the file.

Does it work with "C-x RET c utf-8 RET" immediately prior to
"C-x C-f"?  If it does, then the problem is with guessing the
encoding, not with decoding it.

Also, what is the default value of buffer-file-coding-system, and was
it the same in 23.0.90?

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 16:20:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Juanma Barranquero <lekktu <at> gmail.com>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 16:20:03 GMT) Full text and rfc822 format available.

Message #15 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:11:38 +0100

On Fri, Feb 27, 2009 at 15:10, Uwe Siart <uwe.siart <at> tum.de> wrote:

> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> fails to read utf-8 encoded files correctly. When visiting a file in
> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> indicates iso-latin1-dos for saving the file. This has not been an
> issue in 23.0.90.

Do you have a specific example of a UTF-8 coded file that was detected
as UTF-8 in 23.0.90 and it is detected as Latin-1 in 23.0.91?

For example, I create a UTF-8 file (without UTF-8 byte-order-mark
"signature") with just the following contents:

cañón

And 23.0.90 also thinks it is Latin-1.

That said, if you need UTF-8 to be given more priority than Latin-1,
etc, you can use `set-coding-system-priority' in your .emacs.

    Juanma

Message #20 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:16:51 +0100

On Fri, Feb 27, 2009 at 17:11, Juanma Barranquero <lekktu <at> gmail.com> wrote:

> cañón
>
> And 23.0.90 also thinks it is Latin-1.

Just to be clear: of course "cañón" is Latin-1. What I mean is that
emacs 23.0.90 also reads the byte representation of "cañón" in UTF-8,
that is:

  0000000 63 61 c3 b1 c3 b3 6e

and interprets it as Latin-1: caÃ±Ã³n

    Juanma

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 16:30:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 16:30:02 GMT) Full text and rfc822 format available.

Message #25 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: Juanma Barranquero <lekktu <at> gmail.com>
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:23:43 +0100

Juanma Barranquero <lekktu <at> gmail.com> writes:

> On Fri, Feb 27, 2009 at 15:10, Uwe Siart <uwe.siart <at> tum.de> wrote:
>
>> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
>> fails to read utf-8 encoded files correctly. When visiting a file in
>> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
>> indicates iso-latin1-dos for saving the file. This has not been an
>> issue in 23.0.90.
>
> Do you have a specific example of a UTF-8 coded file that was detected
> as UTF-8 in 23.0.90 and it is detected as Latin-1 in 23.0.91?

Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>

I hope, the webserver delivers it in utf-8 encoding.

> For example, I create a UTF-8 file (without UTF-8 byte-order-mark
> "signature") with just the following contents:
>
> cañón
>
> And 23.0.90 also thinks it is Latin-1.

Maybe because it can be encoded in latin-1. That would be ok for me. But
my .gnus.el contains symbols (arrows for the summary buffer) that are
definitely not included in latin-1 but 23.0.91 recognises latin-1.

-- 
Uwe

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 16:35:03 GMT) Full text and rfc822 format available.

Message #30 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: Juanma Barranquero <lekktu <at> gmail.com>
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:27:56 +0100

Juanma Barranquero <lekktu <at> gmail.com> writes:

> Just to be clear: of course "cañón" is Latin-1. What I mean is that
> emacs 23.0.90 also reads the byte representation of "cañón" in UTF-8,
> that is:
>
>   0000000 63 61 c3 b1 c3 b3 6e
>
> and interprets it as Latin-1: caÃ±Ã³n

I tried this out in 23.0.90 in the following way:

- mark "cañón" from your mail
- create empty file with 'touch t.txt'
- visit t.txt and yank cañón
- save t.txt
- visit t.txt

and get correct result (cañón not caÃ±Ã³n)

-- 
Uwe

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 16:40:03 GMT) Full text and rfc822 format available.

Message #35 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:32:31 +0100

On Fri, Feb 27, 2009 at 17:27, Uwe Siart <uwe.siart <at> tum.de> wrote:

> I tried this out in 23.0.90 in the following way:
>
> - mark "cañón" from your mail
> - create empty file with 'touch t.txt'
> - visit t.txt and yank cañón
> - save t.txt
> - visit t.txt
>
> and get correct result (cañón not caÃ±Ã³n)

Of course: you've created a file t.txt encoded in Latin-1, not UTF-8.

    Juanma

Message #40 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:38:37 +0100

On Fri, Feb 27, 2009 at 17:23, Uwe Siart <uwe.siart <at> tum.de> wrote:

> Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>

Aha, yes, the bug is reproducible.

> I hope, the webserver delivers it in utf-8 encoding.

Yes. Emacs 23.0.90 opens it as utf-8, as does Notepad2.

> But
> my .gnus.el contains symbols (arrows for the summary buffer) that are
> definitely not included in latin-1 but 23.0.91 recognises latin-1.


    Juanma

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 16:55:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 16:55:05 GMT) Full text and rfc822 format available.

Message #45 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:48:15 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

>> Date: Fri, 27 Feb 2009 15:10:19 +0100
>> From: Uwe Siart <uwe.siart <at> tum.de>
>> Cc: 
>> 
>> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
>> fails to read utf-8 encoded files correctly. When visiting a file in
>> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
>> indicates iso-latin1-dos for saving the file.
>
> Does it work with "C-x RET c utf-8 RET" immediately prior to
> "C-x C-f"?

It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".

> If it does, then the problem is with guessing the encoding, not with
> decoding it.

That's also my impression.

> Also, what is the default value of buffer-file-coding-system, and was
> it the same in 23.0.90?

iso-latin-1-dos in 23.0.90 and in 23.0.91.

-- 
Uwe

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 17:10:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Leo <sdl.web <at> gmail.com>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 17:10:04 GMT) Full text and rfc822 format available.

Message #50 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Leo <sdl.web <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 17:02:19 +0000

On 2009-02-27 16:11 +0000, Juanma Barranquero wrote:
> On Fri, Feb 27, 2009 at 15:10, Uwe Siart <uwe.siart <at> tum.de> wrote:
>
>> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
>> fails to read utf-8 encoded files correctly. When visiting a file in
>> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
>> indicates iso-latin1-dos for saving the file. This has not been an
>> issue in 23.0.90.
>
> Do you have a specific example of a UTF-8 coded file that was detected
> as UTF-8 in 23.0.90 and it is detected as Latin-1 in 23.0.91?
>
> For example, I create a UTF-8 file (without UTF-8 byte-order-mark
> "signature") with just the following contents:
>
> cañón
>
> And 23.0.90 also thinks it is Latin-1.
>
> That said, if you need UTF-8 to be given more priority than Latin-1,
> etc, you can use `set-coding-system-priority' in your .emacs.
>
>     Juanma

I have the following code in my .emacs when I changed to w32 last
June. So the problem might exist longer.

;;; FIXME: find out why GNU/Linux does not need this
(prefer-coding-system 'utf-8)

I just tested some Chinese files. Without that line, all of them are
being opened in latin-1 encoding and are unreadable.

Tested in GNU Emacs 23.0.91.1 (i386-mingw-nt5.1.2600) of 2009-02-26

-- 
.:  Leo  :.  [ sdl.web AT gmail.com ]  .: I use Emacs :.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 17:50:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Engster <deng <at> randomsample.de>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 17:50:04 GMT) Full text and rfc822 format available.

Message #55 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: David Engster <deng <at> randomsample.de>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org, emacs-pretest-bug <at> gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 18:46:12 +0100

Uwe Siart <uwe.siart <at> tum.de> writes:
> I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> fails to read utf-8 encoded files correctly. When visiting a file in
> utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> indicates iso-latin1-dos for saving the file. This has not been an
> issue in 23.0.90.

Maybe this is a duplicate of what I reported in

http://debbugs.gnu.org/cgi/bugreport.cgi?bug=2354

As I write later in that bug report, I think I could track down this
issue to the change in revision 1.413 of src/coding.c. Maybe you could
try if the same applies to your problem.

-David

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 18:25:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 18:25:05 GMT) Full text and rfc822 format available.

Message #65 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 20:19:04 +0200

> From: Uwe Siart <uwe.siart <at> tum.de>
> Cc: 2497 <at> emacsbugs.donarmstrong.com
> Date: Fri, 27 Feb 2009 17:48:15 +0100
> 
> It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".
> 
> > If it does, then the problem is with guessing the encoding, not with
> > decoding it.
> 
> That's also my impression.
> 
> > Also, what is the default value of buffer-file-coding-system, and was
> > it the same in 23.0.90?
> 
> iso-latin-1-dos in 23.0.90 and in 23.0.91.

Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
every single instance.  Distinguishing between UTF-8 and Latin-1 is
generally impossible with the current state of the art of coded
character sets support in Emacs.  It might work in certain cases, but
that's sheer luck.

One way to work around that in your specific case, without changing
your global defaults, is to add a `coding:' cookie to your .gnus.el
file.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 18:25:07 GMT) Full text and rfc822 format available.

Message #70 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juanma Barranquero <lekktu <at> gmail.com>, 2497 <at> debbugs.gnu.org
Cc: uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 20:19:47 +0200

> Date: Fri, 27 Feb 2009 17:38:37 +0100
> From: Juanma Barranquero <lekktu <at> gmail.com>
> Cc: 2497 <at> emacsbugs.donarmstrong.com
> 
> On Fri, Feb 27, 2009 at 17:23, Uwe Siart <uwe.siart <at> tum.de> wrote:
> 
> > Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>
> 
> Aha, yes, the bug is reproducible.

Which bug?

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 20:45:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 20:45:03 GMT) Full text and rfc822 format available.

Message #75 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 21:35:08 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Uwe Siart <uwe.siart <at> tum.de>
>> iso-latin-1-dos in 23.0.90 and in 23.0.91.
>
> Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
> every single instance. Distinguishing between UTF-8 and Latin-1 is
> generally impossible with the current state of the art of coded
> character sets support in Emacs. It might work in certain cases, but
> that's sheer luck.

I do not have the background knowledge to join in this conversation but
I just observed that it worked correctly for years now (even with CVS
Emacsen prior to the 22.1 release) and that it stopped working in
23.0.91. If it appears that this is not a bug then I will take the
measures you suggested and set a utf-8 cookie in all files concerned.

-- 
Uwe

Acknowledgement sent to Juanma Barranquero <lekktu <at> gmail.com>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 20:45:04 GMT) Full text and rfc822 format available.

Message #80 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Juanma Barranquero <lekktu <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 21:38:13 +0100

On Fri, Feb 27, 2009 at 19:19, Eli Zaretskii <eliz <at> gnu.org> wrote:

>> Aha, yes, the bug is reproducible.
>
> Which bug?

I mean, the fact that the given .gnus.el file was read as utf-8-dos in
23.0.90 and as iso-latin1-dos in 23.0.91 (with characters that are not
latin-1).

    Juanma

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 21:25:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 21:25:03 GMT) Full text and rfc822 format available.

Message #85 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: 2497 <at> debbugs.gnu.org
Cc: emacs-pretest-bug <at> gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 22:15:36 +0100

David Engster <deng <at> randomsample.de> writes:

> Maybe this is a duplicate of what I reported in
>
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=2354
>
> As I write later in that bug report, I think I could track down this
> issue to the change in revision 1.413 of src/coding.c. Maybe you could
> try if the same applies to your problem.

At least I can reproduce it and it seems to be the very same thing that
I stumbled across. But due to lack of detailed knowledge about coding
recognition I'm unable to join the discussion whether this is a bug or
not. It's just that I felt more comfortable about the previous state.

So far I got things back to work with

;; -*- coding:utf-8-dos; -*-

as the first line of my .gnus.el :-)

-- 
Uwe

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Fri, 27 Feb 2009 23:40:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 23:40:04 GMT) Full text and rfc822 format available.

Message #95 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Richard M Stallman <rms <at> gnu.org>
To: uwe.siart <at> tum.de, 2497 <at> debbugs.gnu.org
Cc: emacs-pretest-bug <at> gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k
Date: Fri, 27 Feb 2009 18:34:08 -0500

Please don't call that system "Win"--that name implies praise.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Fri, 27 Feb 2009 23:40:08 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 01:35:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jason Rumney <jasonr <at> f2s.com>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 01:35:03 GMT) Full text and rfc822 format available.

Message #105 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Jason Rumney <jasonr <at> f2s.com>
To: Eli Zaretskii <eliz <at> gnu.org>, 2497 <at> debbugs.gnu.org
Cc: Juanma Barranquero <lekktu <at> gmail.com>, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 09:29:05 +0800

Eli Zaretskii wrote:
>> Date: Fri, 27 Feb 2009 17:38:37 +0100
>> From: Juanma Barranquero <lekktu <at> gmail.com>
>> Cc: 2497 <at> emacsbugs.donarmstrong.com
>>
>> On Fri, Feb 27, 2009 at 17:23, Uwe Siart <uwe.siart <at> tum.de> wrote:
>>
>>     
>>> Yes. My .gnus.el: <http://www.siart.de/etc/.gnus.el>
>>>       
>> Aha, yes, the bug is reproducible.
>>     
>
> Which bug?
>   

The one where the OP's .gnus.el contains characters which were correctly 
detected as UTF-8 in 23.0.90, but now appear as \200\224 octal escapes, 
as the file is incorrectly detected as Latin-1.

Acknowledgement sent to Jason Rumney <jasonr <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 01:35:05 GMT) Full text and rfc822 format available.

Message #110 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Jason Rumney <jasonr <at> gnu.org>
To: David Engster <deng <at> randomsample.de>, 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 09:32:51 +0800

merge 2354 2497

David Engster wrote:
> Maybe this is a duplicate of what I reported in
>
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=2354
>   

It seems so, yes.

Merged 2354 2497. Request was from Jason Rumney <jasonr <at> gnu.org> to control <at> emacsbugs.donarmstrong.com. (Sat, 28 Feb 2009 01:35:07 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 04:45:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 04:45:04 GMT) Full text and rfc822 format available.

Message #117 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Fri, 27 Feb 2009 23:40:01 -0500

>> It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".
>> > If it does, then the problem is with guessing the encoding, not with
>> > decoding it.
>> That's also my impression.
>> > Also, what is the default value of buffer-file-coding-system, and was
>> > it the same in 23.0.90?
>> iso-latin-1-dos in 23.0.90 and in 23.0.91.
> Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
> every single instance.  Distinguishing between UTF-8 and Latin-1 is

The guessing shouldn't give priority to buffer-file-coding-system.
Instead we have the set-coding-system-priority instead.
And IIUC utf-8 should always have a pretty high priority since false
positives are fairly rare.  So this still looks like a real bug.


        Stefan

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 08:25:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 08:25:03 GMT) Full text and rfc822 format available.

Message #122 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 09:17:35 +0100

Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> The guessing shouldn't give priority to buffer-file-coding-system.
> Instead we have the set-coding-system-priority instead. And IIUC utf-8
> should always have a pretty high priority since false positives are
> fairly rare. So this still looks like a real bug.

Here I would like to note that I never had false positives in the past
(before 23.0.91) but I do have false positives now. Therefore I'm
inclined to call it a bug.

-- 
Uwe

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 09:55:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 09:55:05 GMT) Full text and rfc822 format available.

Message #127 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: rms <at> gnu.org
Cc: 2497 <at> debbugs.gnu.org, emacs-pretest-bug <at> gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k
Date: Sat, 28 Feb 2009 10:47:44 +0100

Richard M Stallman <rms <at> gnu.org> writes:

> Please don't call that system "Win"--that name implies praise.

How right you are. Forgive me my trespasses. In my own defence I have to
say that I never thought of W2k as the "system". My system is Emacs and
I'm very comfortable with it. W2k is its boot loader. The boot loader
does not become noticeable too much. I never understood, however, why
this boot loader takes up a whole CD.

-- 
Uwe

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 10:20:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Engster <deng <at> randomsample.de>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 10:20:07 GMT) Full text and rfc822 format available.

Message #137 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: David Engster <deng <at> randomsample.de>
To: uwe.siart <at> tum.de
Cc: 2497 <at> debbugs.gnu.org, Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 11:14:16 +0100

Uwe Siart <uwe.siart <at> tum.de> writes:
> Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
>
>> The guessing shouldn't give priority to buffer-file-coding-system.
>> Instead we have the set-coding-system-priority instead. And IIUC utf-8
>> should always have a pretty high priority since false positives are
>> fairly rare. So this still looks like a real bug.
>
> Here I would like to note that I never had false positives in the past
> (before 23.0.91) but I do have false positives now. Therefore I'm
> inclined to call it a bug.

I second this - this has worked for years without problems, and suddenly
it fails to detect UTF-8 with a Latin-1 environment.

I once again confirmed that this behaviour can be tracked down to this
change in detect_coding_charset in coding.c (revision 1.413):

--- coding.c    7 Feb 2009 10:49:39 -0000       1.412
+++ coding.c    9 Feb 2009 00:42:37 -0000       1.413
@@ -5101,7 +5101,7 @@
   valids = AREF (attrs, coding_attr_charset_valids);
   name = CODING_ID_NAME (coding->id);
   if (VECTORP (Vlatin_extra_code_table)
-      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-"))
+      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-") == 0)
     check_latin_extra = 1;
   if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs)))
     src += head_ascii;

I'm inclined to say that this change is wrong, since strcmp will only
return 0 if two strings are exactly equal. In this case though, the
string "iso-8859-" is compared to "iso-8859-1" (in my case), so it
returns 1 and therefore check_latin_extra is not set.

-David

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 10:55:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 10:55:05 GMT) Full text and rfc822 format available.

Message #142 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>, Kenichi Handa <handa <at> m17n.org>
Cc: 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 12:49:58 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 2497 <at> emacsbugs.donarmstrong.com,  uwe.siart <at> tum.de
> Date: Fri, 27 Feb 2009 23:40:01 -0500
> 
> >> It works with "C-x RET c utf-8 RET" immediately prior to "C-x C-f".
> >> > If it does, then the problem is with guessing the encoding, not with
> >> > decoding it.
> >> That's also my impression.
> >> > Also, what is the default value of buffer-file-coding-system, and was
> >> > it the same in 23.0.90?
> >> iso-latin-1-dos in 23.0.90 and in 23.0.91.
> > Then you shouldn't expect Emacs to guess UTF-8 encoding correctly in
> > every single instance.  Distinguishing between UTF-8 and Latin-1 is
> 
> The guessing shouldn't give priority to buffer-file-coding-system.
> Instead we have the set-coding-system-priority instead.

Please give me some credit: I said ``the _default_value_ of
buffer-file-coding-system''.  That default tells volumes about the
coding-system priorities.

> And IIUC utf-8 should always have a pretty high priority

With today's CVS on a Windows XP machine I get this:

  M-: (coding-system-priority-list) RET
  =>  (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

So UTF-8 is indeed ``pretty high'', but lower than the locale's
default.

> So this still looks like a real bug.

Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
between Latin-1 and UTF-8, even when UTF-8 sequences are present in
the text.  Can we do that reliably?  Perhaps Handa-san can shed some
light on this.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 12:15:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 12:15:04 GMT) Full text and rfc822 format available.

Message #147 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: David Engster <deng <at> randomsample.de>, 2497 <at> debbugs.gnu.org
Cc: uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 14:09:04 +0200

> From: David Engster <deng <at> randomsample.de>
> Date: Sat, 28 Feb 2009 11:14:16 +0100
> Cc: 2497 <at> emacsbugs.donarmstrong.com
> 
> I once again confirmed that this behaviour can be tracked down to this
> change in detect_coding_charset in coding.c (revision 1.413):
> 
> --- coding.c    7 Feb 2009 10:49:39 -0000       1.412
> +++ coding.c    9 Feb 2009 00:42:37 -0000       1.413
> @@ -5101,7 +5101,7 @@
>    valids = AREF (attrs, coding_attr_charset_valids);
>    name = CODING_ID_NAME (coding->id);
>    if (VECTORP (Vlatin_extra_code_table)
> -      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-"))
> +      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-") == 0)
>      check_latin_extra = 1;
>    if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs)))
>      src += head_ascii;
> 
> I'm inclined to say that this change is wrong, since strcmp will only
> return 0 if two strings are exactly equal. In this case though, the
> string "iso-8859-" is compared to "iso-8859-1" (in my case), so it
> returns 1 and therefore check_latin_extra is not set.

You are right.  But in my case, it was not enough to test for
"iso-8859-", as the symbol's name was "iso-latin-1", not "iso-8859-1".

I installed the patch below, that does seem to fix the problem with
the OP's .gnus.el, although I don't know how general that problem is,
nor whether Emacs is capable of distinguishing UTF-8 from Latin-N in
general.


2009-02-28  Eli Zaretskii  <eliz <at> gnu.org>

	* coding.c (detect_coding_charset): Fix change from 2008-10-21.
	Also, check iso-latin-*, not only iso-8859-*.

Index: src/coding.c
===================================================================
RCS file: /cvsroot/emacs/emacs/src/coding.c,v
retrieving revision 1.419
diff -u -r1.419 coding.c
--- src/coding.c	22 Feb 2009 15:48:03 -0000	1.419
+++ src/coding.c	28 Feb 2009 12:01:18 -0000
@@ -5103,7 +5103,10 @@
   valids = AREF (attrs, coding_attr_charset_valids);
   name = CODING_ID_NAME (coding->id);
   if (VECTORP (Vlatin_extra_code_table)
-      && strcmp ((char *) SDATA (SYMBOL_NAME (name)), "iso-8859-") == 0)
+      && (strncmp ((char *) SDATA (SYMBOL_NAME (name)),
+		   "iso-8859-", sizeof ("iso-8859-") - 1) == 0
+	  || strncmp ((char *) SDATA (SYMBOL_NAME (name)),
+		      "iso-latin-", sizeof ("iso-latin-") - 1) == 0))
     check_latin_extra = 1;
   if (! NILP (CODING_ATTR_ASCII_COMPAT (attrs)))
     src += head_ascii;

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 12:25:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to uwe.siart <at> tum.de:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 12:25:04 GMT) Full text and rfc822 format available.

Message #152 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Uwe Siart <uwe.siart <at> tum.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>, Kenichi Handa <handa <at> m17n.org>,
        2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 13:16:08 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
>> So this still looks like a real bug.
>
> Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
> between Latin-1 and UTF-8, even when UTF-8 sequences are present in
> the text. Can we do that reliably? Perhaps Handa-san can shed some
> light on this.

Finding a solution to do it reliably would of course be the best.

Assumed this is not possible right now we should distinguish between
»high reliability« and »poor reliability«. From my perception it has
been much more reliable earlier so (as a user with limited viewpoint)
I vote for reverting the change.

-- 
Uwe

Message #153 received at 2497-done <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: 2497-done <at> debbugs.gnu.org, 2354-done <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 14:21:08 +0200

> From: David Engster <deng <at> randomsample.de>
> Date: Fri, 27 Feb 2009 18:46:12 +0100
> Cc: emacs-pretest-bug <at> gnu.org, 2497 <at> emacsbugs.donarmstrong.com
> 
> Uwe Siart <uwe.siart <at> tum.de> writes:
> > I'm using the windows port of 23.0.91 on Win2k SP4 and I found that it
> > fails to read utf-8 encoded files correctly. When visiting a file in
> > utf-8 encoding all characters above 255 are screwed up and "C-h C RET"
> > indicates iso-latin1-dos for saving the file. This has not been an
> > issue in 23.0.90.
> 
> Maybe this is a duplicate of what I reported in
> 
> http://debbugs.gnu.org/cgi/bugreport.cgi?bug=2354
> 
> As I write later in that bug report, I think I could track down this
> issue to the change in revision 1.413 of src/coding.c. Maybe you could
> try if the same applies to your problem.

Should be fixed by this change:

2009-02-28  Eli Zaretskii  <eliz <at> gnu.org>

	* coding.c (detect_coding_charset): Fix change from 2008-10-21.
	Also, check iso-latin-*, not only iso-8859-*.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 14:25:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jason Rumney <jasonr <at> f2s.com>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 14:25:07 GMT) Full text and rfc822 format available.

Message #158 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Jason Rumney <jasonr <at> f2s.com>
To: Eli Zaretskii <eliz <at> gnu.org>, 2497 <at> debbugs.gnu.org
Cc: David Engster <deng <at> randomsample.de>, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 22:16:22 +0800

Eli Zaretskii wrote:
> You are right.  But in my case, it was not enough to test for
> "iso-8859-", as the symbol's name was "iso-latin-1", not "iso-8859-1".
>
> I installed the patch below, that does seem to fix the problem with
> the OP's .gnus.el, although I don't know how general that problem is,
> nor whether Emacs is capable of distinguishing UTF-8 from Latin-N in
> general.
>   

I installed a further change for the case where latin-extra-code-table 
is not a vector. But I don't understand why we have this table, and why 
the default value allows the 6 C1 control codes PU1, PU2, STS, CCH, MW 
and SPA to appear in latin text without breaking the auto detection. Are 
these control characters really that common?

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 14:40:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to David Engster <deng <at> randomsample.de>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 14:40:07 GMT) Full text and rfc822 format available.

Message #163 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: David Engster <deng <at> randomsample.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 15:31:47 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:
>> From: David Engster <deng <at> randomsample.de>
>> I'm inclined to say that this change is wrong, since strcmp will only
>> return 0 if two strings are exactly equal. In this case though, the
>> string "iso-8859-" is compared to "iso-8859-1" (in my case), so it
>> returns 1 and therefore check_latin_extra is not set.
>
> You are right.  But in my case, it was not enough to test for
> "iso-8859-", as the symbol's name was "iso-latin-1", not "iso-8859-1".
>
> I installed the patch below, that does seem to fix the problem with
> the OP's .gnus.el, although I don't know how general that problem is,
> nor whether Emacs is capable of distinguishing UTF-8 from Latin-N in
> general.

I can confirm this patch fixes my original bug report (#2354). Thanks!

-David

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 18:15:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 18:15:05 GMT) Full text and rfc822 format available.

Message #168 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Richard M Stallman <rms <at> gnu.org>
To: uwe.siart <at> tum.de, 2497 <at> debbugs.gnu.org
Cc: emacs-pretest-bug <at> gnu.org, 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Windows2k
Date: Sat, 28 Feb 2009 13:08:06 -0500

    How right you are. Forgive me my trespasses.

Only Emacs can forgive you, but I am confident that it will.

						 In my own defence I have to
    say that I never thought of W2k as the "system". My system is Emacs and
    I'm very comfortable with it. W2k is its boot loader.

Why not switch to a free boot loader then?

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 18:15:07 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 22:05:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 22:05:06 GMT) Full text and rfc822 format available.

Message #178 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: uwe.siart <at> tum.de
Cc: Eli Zaretskii <eliz <at> gnu.org>, 2497 <at> debbugs.gnu.org
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 17:00:43 -0500

>> The guessing shouldn't give priority to buffer-file-coding-system.
>> Instead we have the set-coding-system-priority instead. And IIUC utf-8
>> should always have a pretty high priority since false positives are
>> fairly rare. So this still looks like a real bug.

> Here I would like to note that I never had false positives in the past
> (before 23.0.91) but I do have false positives now. Therefore I'm
> inclined to call it a bug.

To clear things up: by "false positives" I meant text that Emacs thinks
is valid utf-8 whereas it's really using some other coding system.


        Stefan

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Sat, 28 Feb 2009 22:10:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Sat, 28 Feb 2009 22:10:04 GMT) Full text and rfc822 format available.

Message #183 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Kenichi Handa <handa <at> m17n.org>, 2497 <at> debbugs.gnu.org,
        uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Sat, 28 Feb 2009 17:04:35 -0500

>> The guessing shouldn't give priority to buffer-file-coding-system.
>> Instead we have the set-coding-system-priority instead.

> Please give me some credit: I said ``the _default_value_ of
> buffer-file-coding-system''.  That default tells volumes about the
> coding-system priorities.

I'm sorry for my bad wording: what I wrote was only meant to describe
the way the code is currently expected to work (AFAIK).

>   M-: (coding-system-priority-list) RET
>   =>  (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

> So UTF-8 is indeed ``pretty high'', but lower than the locale's
> default.

That seems to be the source of the problem.  utf-8 should always come
before latin-1 in that list, since utf-8 streams that are valid latin-1
streams are not uncommon, whereas latin-1 streams that are valid utf-8
streams are extremely rare.


        Stefan

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Mon, 02 Mar 2009 11:50:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Kenichi Handa <handa <at> m17n.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Mar 2009 11:50:02 GMT) Full text and rfc822 format available.

Message #188 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: monnier <at> iro.umontreal.ca, 2497 <at> debbugs.gnu.org,
        uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Mon, 02 Mar 2009 20:43:58 +0900

In article <uab86q1ih.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

>   M-: (coding-system-priority-list) RET
>>> (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided)

> So UTF-8 is indeed ``pretty high'', but lower than the locale's
> default.

> > So this still looks like a real bug.

> Perhaps it is, but I didn't know Emacs 23 can reliably distinguish
> between Latin-1 and UTF-8, even when UTF-8 sequences are present in
> the text.  Can we do that reliably?  Perhaps Handa-san can shed some
> light on this.

The coding system iso-latin-1 is for the character set
iso-8859-1, and the code-space of iso-8859-1 is 0x00..0xFF
(without gap, i.e. including 0x80..0x9F) (see
/usr/share/i18n/charmaps/ISO-8859-1.gz).  So, if we follows
it strictly, any byte sequence can be a correct iso-8859-1
stream, and it means that when iso-latin-1 has the highest
priority, all files are detected as iso-latin-1.

So, as far as we strictly follows the definition of
iso-8859-1...

In article <jwv7i3az0fc.fsf-monnier+emacsbugreports <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> That seems to be the source of the problem.  utf-8 should always come
> before latin-1 in that list, since utf-8 streams that are valid latin-1
> streams are not uncommon, whereas latin-1 streams that are valid utf-8
> streams are extremely rare.

I think that is the only solution.

In article <87ab86ah9z.fsf <at> tum.de>, Uwe Siart <uwe.siart <at> tum.de> writes:

> Assumed this is not possible right now we should distinguish between
> »high reliability« and »poor reliability«. From my perception it has
> been much more reliable earlier so (as a user with limited viewpoint)
> I vote for reverting the change.

In Emacs 22, the coding system iso-latin-1 was defined as a
variant of iso-2022-based coding system, and thus 0x80..0x9F
were not a valid byte (except for 0x91 and etc. in
latin-extra-code-table).  So, some of UTF-8 texts were not
detected as iso-latin-1.

To recover that behaviour, we can define iso-latin-1 as
before by doing this:

(define-coding-system 'iso-latin-1
  "Emacs 22 iso-latin-1."
  :mnemonic ?1
  :coding-type 'iso-2022
  :charset-list '(ascii latin-iso8859-1)
  :ascii-compatible-p t
  :mime-charset 'iso-8859-1
  :designation [ascii latin-iso8859-1 nil nil])

But, even with that, still some valid UTF-8 texts will be
detected as iso-latin-1.  So I don't think this is the
solution of "high reliability".

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Mon, 02 Mar 2009 15:35:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Mar 2009 15:35:03 GMT) Full text and rfc822 format available.

Message #193 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Kenichi Handa <handa <at> m17n.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 2497 <at> debbugs.gnu.org,
        uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Mon, 02 Mar 2009 10:25:45 -0500

>> That seems to be the source of the problem.  utf-8 should always come
>> before latin-1 in that list, since utf-8 streams that are valid latin-1
>> streams are not uncommon, whereas latin-1 streams that are valid utf-8
>> streams are extremely rare.
> I think that is the only solution.

Not only it's the only solution, but it's a solution on which we agreed
already several years ago.  So, again, the bug is in the ordering, and
we have to figure out which code ends up putting latin-1 before utf-8 in
the coding system priority.


        Stefan

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Mon, 02 Mar 2009 19:35:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Mar 2009 19:35:03 GMT) Full text and rfc822 format available.

Message #198 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: handa <at> m17n.org, 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Mon, 02 Mar 2009 21:25:58 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  2497 <at> emacsbugs.donarmstrong.com,  uwe.siart <at> tum.de
> Date: Mon, 02 Mar 2009 10:25:45 -0500
> 
> So, again, the bug is in the ordering

Actually, the OP was complaining that, even with this ordering, Emacs
23 did TRT for him, and that a recent change broke that.  That bug is
fixed now, I believe, so you are talking about a more general problem.

> we have to figure out which code ends up putting latin-1 before utf-8 in
> the coding system priority.

Well, I think this is fairly easy: set-locale-environment does it.
Observe:

  (defun set-locale-environment (&optional locale-name frame)
    "Set up multi-lingual environment for using LOCALE-NAME.
  This sets the language environment, the coding system priority,
  the default input method and sometimes other things.
	...
	(let ((language-name
	       (locale-name-match locale locale-language-names))
	      (charset-language-name
	       (locale-name-match locale locale-charset-language-names))
	      (default-eol-type (coding-system-eol-type
				 default-buffer-file-coding-system))
	      (coding-system
	       (or (locale-name-match locale locale-preferred-coding-systems)
		   (when locale
		     (if (string-match "\\.\\([^@]+\\)" locale)
			 (locale-charset-to-coding-system
			  (match-string 1 locale)))))))
	...
	  (when (and (not frame)
		     coding-system
		     (not (coding-system-equal coding-system
					       locale-coding-system)))
    >>>>>	  (prefer-coding-system coding-system)
	    ;; Fixme: perhaps prefer-coding-system should set this too.
	    ;; But it's not the time to do such a fundamental change.
	    (setq default-sendmail-coding-system coding-system)
	    (setq locale-coding-system coding-system))))

Even the doc string says that the coding system priority is set
according to the locale's native encoding.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#2497; Package emacs. (Tue, 03 Mar 2009 16:40:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> IRO.UMontreal.CA>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 03 Mar 2009 16:40:05 GMT) Full text and rfc822 format available.

Message #203 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: handa <at> m17n.org, 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de
Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k
Date: Tue, 03 Mar 2009 11:34:45 -0500

>> So, again, the bug is in the ordering
> Actually, the OP was complaining that, even with this ordering, Emacs
> 23 did TRT for him, and that a recent change broke that.  That bug is
> fixed now, I believe, so you are talking about a more general problem.

Yes.  I didn't realize that the reason why it worked before is because
we were luckly.

>> we have to figure out which code ends up putting latin-1 before utf-8 in
>> the coding system priority.

> Well, I think this is fairly easy: set-locale-environment does it.
> Observe:

>   (defun set-locale-environment (&optional locale-name frame)
[...]
>>>>>> (prefer-coding-system coding-system)
[...]
> Even the doc string says that the coding system priority is set
> according to the locale's native encoding.

Indeed, thanks for spotting it.  Can someone change this code so it
doesn't move utf-8 from first to second place?


        Stefan

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> emacsbugs.donarmstrong.com. (Wed, 01 Apr 2009 14:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 16 years and 138 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #2497 23.0.91; Fails to read UTF-8 on Win2k

GNU bug report logs - #2497
23.0.91; Fails to read UTF-8 on Win2k