#2497 - 23.0.91; Fails to read UTF-8 on Win2k

GNU bug report logs - #2497
23.0.91; Fails to read UTF-8 on Win2k

Package: emacs;

Reported by: uwe.siart <at> tum.de

Date: Fri, 27 Feb 2009 14:20:02 UTC

Severity: normal

Merged with 2354

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Message #188 received at 2497 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org> To: Eli Zaretskii <eliz <at> gnu.org> Cc: monnier <at> iro.umontreal.ca, 2497 <at> debbugs.gnu.org, uwe.siart <at> tum.de Subject: Re: bug#2497: 23.0.91; Fails to read UTF-8 on Win2k Date: Mon, 02 Mar 2009 20:43:58 +0900

In article <uab86q1ih.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes: > M-: (coding-system-priority-list) RET >>> (iso-latin-1 utf-8 iso-2022-7bit iso-2022-7bit-lock iso-2022-8bit-ss2 emacs-mule raw-text iso-2022-jp in-is13194-devanagari chinese-iso-8bit utf-8-auto utf-8-with-signature utf-16 utf-16be-with-signature utf-16le-with-signature utf-16be utf-16le japanese-shift-jis undecided) > So UTF-8 is indeed ``pretty high'', but lower than the locale's > default. > > So this still looks like a real bug. > Perhaps it is, but I didn't know Emacs 23 can reliably distinguish > between Latin-1 and UTF-8, even when UTF-8 sequences are present in > the text. Can we do that reliably? Perhaps Handa-san can shed some > light on this. The coding system iso-latin-1 is for the character set iso-8859-1, and the code-space of iso-8859-1 is 0x00..0xFF (without gap, i.e. including 0x80..0x9F) (see /usr/share/i18n/charmaps/ISO-8859-1.gz). So, if we follows it strictly, any byte sequence can be a correct iso-8859-1 stream, and it means that when iso-latin-1 has the highest priority, all files are detected as iso-latin-1. So, as far as we strictly follows the definition of iso-8859-1... In article <jwv7i3az0fc.fsf-monnier+emacsbugreports <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes: > That seems to be the source of the problem. utf-8 should always come > before latin-1 in that list, since utf-8 streams that are valid latin-1 > streams are not uncommon, whereas latin-1 streams that are valid utf-8 > streams are extremely rare. I think that is the only solution. In article <87ab86ah9z.fsf <at> tum.de>, Uwe Siart <uwe.siart <at> tum.de> writes: > Assumed this is not possible right now we should distinguish between > »high reliability« and »poor reliability«. From my perception it has > been much more reliable earlier so (as a user with limited viewpoint) > I vote for reverting the change. In Emacs 22, the coding system iso-latin-1 was defined as a variant of iso-2022-based coding system, and thus 0x80..0x9F were not a valid byte (except for 0x91 and etc. in latin-extra-code-table). So, some of UTF-8 texts were not detected as iso-latin-1. To recover that behaviour, we can define iso-latin-1 as before by doing this: (define-coding-system 'iso-latin-1 "Emacs 22 iso-latin-1." :mnemonic ?1 :coding-type 'iso-2022 :charset-list '(ascii latin-iso8859-1) :ascii-compatible-p t :mime-charset 'iso-8859-1 :designation [ascii latin-iso8859-1 nil nil]) But, even with that, still some valid UTF-8 texts will be detected as iso-latin-1. So I don't think this is the solution of "high reliability". --- Kenichi Handa handa <at> m17n.org

This bug report was last modified 16 years and 87 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #2497 23.0.91; Fails to read UTF-8 on Win2k

GNU bug report logs - #2497
23.0.91; Fails to read UTF-8 on Win2k