Package: emacs;
Reported by: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Mon, 4 May 2015 01:15:03 UTC
Severity: wishlist
Tags: patch
Merged with 16082
Done: Lars Ingebrigtsen <larsi <at> gnus.org>
Bug is archived. No further changes may be made.
Message #108 received at 20499 <at> debbugs.gnu.org (full text, mbox):
From: Ivan Shmakov <ivan <at> siamics.net> To: 20499 <at> debbugs.gnu.org Subject: Re: bug#20499: C-x 8 shorthands for curved quotes, Euro, etc. Date: Thu, 07 May 2015 10:00:38 +0000
>>>>> Paul Eggert <eggert <at> cs.ucla.edu> writes: […] >> … Also, did you consider generating this list automatically, based >> on the codepoint properties already known to Emacs? Something along >> the lines of the function MIMEd, which readily produces a list of >> entries for the following 133 characters. (Three spaces added for >> symmetry purposes.) >> À Á Â Ã Ä È É Ê Ë Ì Í Î Ï Ñ Ò Ó Ô Õ Ö Ù Ú Û Ü Ý >> à á â ã ä è é ê ë ì í î ï ñ ò ó ô õ ö ù ú û ü ý >> ÿ Ā ā Ć ć Ĉ ĉ Č č Ď ď Ē ē Ě ě Ĝ ĝ Ĥ ĥ Ĩ ĩ Ī ī Ĵ ĵ Ĺ ĺ >> Ľ ľ Ń ń Ň ň Ō ō Ŕ ŕ Ř ř Ś ś Ŝ ŝ Š š Ť ť Ũ ũ Ū ū Ŵ ŵ Ŷ ŷ >> Ÿ Ź ź Ž ž Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǧ ǧ Ǩ ǩ ǰ Ǵ ǵ Ǹ ǹ Ș ș Ț ț >> Ȟ ȟ Ȳ ȳ > Sorry, I don't really follow the code that you attached. Which part, specifically? It just iterates over the range given (or U+00A8 through U+02AF by default) and maps “LATIN + COMBINING” decompositions to 'iso-transl entries. For example, it maps the (?g #x327) decomposition (U+0327 being COMBINING CEDILLA) for U+0123 into an (",g" . ģ) entry. Or, rather, it /should/, for my code has an obvious typo: (`(,c #x30c) (string ?v c)) (`(,c #x326) (string 59 c)) - (`(,c #x326) (string ?, c))))) + (`(,c #x327) (string ?, c))))) Other possible additions (assuming we’ll agree on C-x 8 u, C-x 8 .) are: (`(,c #x304) (string ?= c)) + (`(,c #x306) (string ?u c)) + (`(,c #x307) (string ?. c)) (`(,c #x308) (string 34 c)) + (`(,c #x30b) (string ?2 c)) (`(,c #x30c) (string ?v c)) > Although I suppose it comes from a decomposition table, I don't know > what the table was designed for, and it's not clear to me how it's > relevant. I hope someone more knowledgeable could comment on this. Still, this (ab)use of the data seem to work well in practice. > Anyway, most of those letters are either in iso-transl.el now, The point is to /remove/ them from 'iso-transl, as these entries duplicate, in a way, a part of the decomposition table already present in Emacs. […] >> Ǎ ǎ Ǐ ǐ Ǒ ǒ Ǔ ǔ Ǹ ǹ > These are for toned Pinyin but this list is incomplete. If we wanted > to cover toned Pinyin, we'd also need Ǖ ǖ Ǘ ǘ Ǚ ǚ Ǜ ǜ. Coming up > with two-character abbreviations for all these might be tricky. But are we actually limited to two-character abbreviations only? Why not allow for, say, C-x 8 " ' u? […] >> ǰ > What language uses this? I couldn't find one. To quote NamesList.txt: 01F0 LATIN SMALL LETTER J WITH CARON * IPA and many languages >> Ǵ ǵ > Good catch. These are used for transliteration from Serbian and > Macedonian. We should also include Ḱ ḱ as they are also needed. > Included in the attached patch. The code I’ve suggested could be used to scan the U+1Exx range just as well, thus resulting in the following set. Ḑ ḑ Ḡ ḡ Ḧ ḧ Ḩ ḩ Ḱ ḱ Ḿ ḿ Ṕ ṕ Ṽ ṽ Ẁ ẁ Ẃ ẃ Ẅ ẅ Ẍ ẍ Ẑ ẑ ẗ Ẽ ẽ Ỳ ỳ Ỹ ỹ […] > Anyway, part of what's going on here is that the proposed list > doesn't cover every Latin character in the ISO 10646 repertoire > (that'd be a large set), but instead is limited to what appear to be > reasonably commonly letters. Admittedly this is not universal but > one must cut things off somewhere, and it would be odd to add only > partial coverage for toned Pinyin, Livonian, etc. When it comes to the LATIN … LETTER WITH … letters, my proposal for such a cut off would be to satisfy /both/ of the following criteria: • only cover specific Unicode ranges; such as, for instance, U+00A8 through U+02AF, U+1E00 … U+1EFF, perhaps 2C60 … 2C7F; • only cover the letters which can be represented with a sufficiently general C-x 8 ⟨diacritic⟩+ ⟨ASCII-latin⟩ pattern. Other characters deemed common may be added to the list. >>> --------------090904020002020306060104 >>> Content-Type: text/x-patch; >>> name="0001-C-x-8-shorthands-for-curved-quotes-Euro-etc.patch" >> This MIME part sure wants ‘; charset=UTF-8’. Otherwise, Gnus does >> no decoding, and Emacs shows the contents with the likes of >> \304\260. > Hmm, it works for me. I use Thunderbird to read the top level > message, and it spins off an Emacs to display the attachment with no > problem. I can “spin off” cat(1) to read the offending MIME part, too: Emacs will feed it raw-text, and interpret the result as UTF-8 (the default.) It still does /not/ comply with the MIME specification. Consider section 4.1.2 of RFC 2046: RFC> […] The default character set, which must be assumed in the RFC> absence of a charset parameter, is US-ASCII. RFC 6657 updates this as follows: RFC> Each subtype of the "text" media type that uses the "charset" RFC> parameter can define its own default value for the "charset" RFC> parameter, including the absence of any default. However, given that ‘text/x-patch’ is not a /registered/ MIME type, I believe the above does not apply. > The web-site archive at <http://bugs.gnu.org/20499#60> also works for > me with Firefox. > It's common for people to send the output of "git send-email" as > attachments; If Thunderbird /knows/ the encoding (“character set”) of the contents of the MIME part, it /should/ specify it in the MIME part header. If the said contents is strictly 7-bit, it /could/ omit that (given that it’s more than likely to be US-ASCII.) Otherwise, I guess Thunderbird should either ask the user for the encoding /or/ send the part as application/octet-stream. […] -- FSF associate member #7257 np. Satellite one — Purple Motion B6A0 230E 334A
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.