#20789 - auto-generate more Unicode data from sources

GNU bug report logs - #20789
auto-generate more Unicode data from sources

Package: emacs;

Reported by: Glenn Morris <rgm <at> gnu.org>

Date: Thu, 11 Jun 2015 22:06:02 UTC

Severity: wishlist

Found in version 25.0.50

Message #29 received at 20789 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org> To: Glenn Morris <rgm <at> gnu.org>, Kenichi Handa <handa <at> gnu.org> Cc: 20789 <at> debbugs.gnu.org Subject: Re: bug#20789: Invalid script or charset name: cuneiform-numbers-and-punctuation Date: Sun, 21 Jun 2015 18:00:20 +0300

> From: Glenn Morris <rgm <at> gnu.org> > Cc: 20789 <at> debbugs.gnu.org > Date: Sat, 20 Jun 2015 19:34:01 -0400 > > I spent some time looking at some of these. > In no case could I see a clear path from the inputs to the outputs. Thanks for looking into this. Let me first make a general comment: we can always convert only certain parts of the setup to an automated procedure, and leave the rest in its present form, more or less. That's especially true where Emacs has specialized needs or defines properties not in Unicode. > > . characters.el: > > > > . The modify-category-entry calls -- they basically can be derived > > from Blocks.txt > > I looked at it briefly. I can see that they are somewhat related, but > not precisely how. Eg: > > Emacs: 2E80:312F and 3190:33FF are "line breakable". > Which means that "Hangul Compatibility Jamo" isn't. I have no idea why. > > Emacs: 3400:4DBF and 4E00:9FAF are "2-byte han". > Which means that "Yijing Hexagram Symbols" aren't. Again, I have no idea why. > > I didn't look any further. When I said "derived from Blocks.txt", I meant the categories that are related to script names, like ASCII, Latin, Arabic, Chinese, etc. Sorry for not saying that explicitly. Other categories need other sources. Here's my attempt to decipher some of them: . ?| -- "line breakable" The data seems to be in LineBreak.txt, described in detail in UAX#14 (http://unicode.org/reports/tr14/). It looks like characters with the ?| category are those whose line-break properties are ID or CJ or NS. Therefore, the exclusion of Hangul Compatibility Jamo is a mistake (or maybe an omission, since the comment says "Chinese"); in particular, UAX#14 explicitly says, in section 5.1 under "ID", that the characters in the range 3130..318F are treated as class ID. This category is currently used only by kinsoku.el, which has its own data (and sets the ?< and ?> categories). So this will only become important if we ever implement in Emacs something more general, like the algorithm described in UAX#14. . "2-byte han" -- I think this is related to their legacy encoding; I don't see this used anywhere. Likewise with other 2-byte categories. Perhaps Handa-san (CC'ed) could comment on their necessity. If this is still needed, we should probably leave these alone. . ?0 - ?9 -- I don't see how to get this data from the UCD or any other source. Some of it seems to be in IndicSyllabicCategory.txt, FWIW. . ?R and ?L -- already set up using the Unicode data, so no change is needed. . ?^ -- should be set for any character whose general-category is Mn. Since we already do this, the manual setting around line 820 is redundant and should be deleted. . ?. -- already set using Unicode data, no change needed. > > . The setup of char-width-table -- I think the information is in > > EastAsianWidth.txt, with background information described in > > UAX#11 (http://www.unicode.org/reports/tr11/) > > Looks somewhat promising, but could you be more specific? > There's nothing in that file that defines "zero width" characters, so I > don't see where Emacs's width 0 characters come from. The following rules regarding zero-width characters are due to Markus Kuhn, and are excerpted from the description in comments to his implementation of 'wcwidth' (http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c): . The null character (U+0000) has a column width of 0. . Non-spacing and enclosing combining characters (general category code Mn or Me in the Unicode database) have a column width of 0. . ZERO WIDTH SPACE (U+200B) and format characters (general category code Cf in the Unicode database), except SOFT HYPHEN (U+00AD), have a column width of 0. . Hangul Jamo medial vowels and final consonants (U+1160-U+11FF) have a column width of 0. > The width 2 characters look like they might be the "W" and "F" characters, Yes. > but just doing that gives a list that has many differences to the list > Emacs uses. I don't see any significant differences, except perhaps in unassigned codepoints (see paragraph 6.1 of UAX#11 for the treatment of unassigned CJK codepoints). I think any differences beyond that should be treated as errors in Emacs in this case. > > . The setup of char-acronym-table: at least some of the data is in > > NameAliases.txt and NameList.txt > > Looks somewhat promising. > I can see how most of this comes from NameAliases.txt. > But there are many oddities: > > Why does Emacs not have anything for 0009 (HT or TAB) or 000A (LF, NL, > or EOF)? This table is set for the 'acronym' method of glyphless-char-display, so I guess these omissions are for characters for which no one envisioned them to be ever displayed as glyphless. I'd include them in the table anyway, just in case, and also to keep our exceptions vs the UCD to the bare minimum. > 0019 is EOM in the source but EM in Emacs. Typo, I think. > 0080 is PAD in the source but XXX in Emacs. > 0081 is HOP in the source but XXX in Emacs. > 008F is SS3 in the source but SS1 in Emacs. > 0099 is SGC in the source but XXX in Emacs. I think these are typos and perhaps acronyms that whoever wrote this didn't know. > How does Emacs choose which entries to list? There are many more in the > source. Could it do any harm to add more? As long as you take only "abbreviations", i.e. they are short, I think we should use all of them, given their use in Emacs. > Where does "KIVAQ" come from? That appears nowhere in the source AFAICS. AFAIK, that's the official name of that character. At least that's what I glean from Google; I know nothing about the Khmer script. > Why does Emacs list two Khmer entries, and nothing else? There are loads > more of them. These are the only 2 that have such abbreviations; see https://en.wikipedia.org/wiki/Khmer_alphabet (assuming by "loads more" you meant the Khmer letters). > > . fontset.el: > > > > . The setup of script-representative-chars > > I don't see how. It seems to be "for some of, but not all, the entries > in char-script-table, choose a single character somewhere in the range." We should have a representative character for each entry in char-script-table. They are used with some font back-ends (xfont and x?ftfont, AFAIR) to probe candidate fonts for coverage of the required script, so we should have the full information about that. I think the only reason for the partial information we have now is that it is maintained manually, so it includes whatever the people who worked on that bothered to add. > There seems to be no pattern to how the character is chosen within the > range. Often the first one, but by no means always. I think the rule is to choose the first one that is a letter, i.e. its general-category is either one of Lu, Ll, Lt, Lo, or Lm. > > . mule-cmds.el: > > > > . The setting of locale-language-names -- the data is available in > > IANA's Language Subtag Registry > > > > (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry) > > and in ISO 639-2 (http://www.loc.gov/standards/iso639-2/, > > http://www.loc.gov/standards/iso639-2/php/English_list.php) > > Again, I don't see how. Eg nowhere in those source files do I see Welsh > associated with iso-8859-14, and the comment in mule-cmds says that the > last part is "implementation dependent". The bulk of the data is the correspondence between the ISO 639 2-letter names and the country/culture name. The few cases where we also have the encoding could be set up with a very small database once the main data is set, by adding the encoding to those few that need it. If by "last part" you mean IPA and "Nonstandard or obsolete language codes", then these are very few and can be added manually. > > P.S. It would be good to add to somewhere (admin/make-tarball.txt?) a > > reminder to fetch all those reference files and regenerate their > > dependencies, before we prepare a release. > > admin/FOR-RELEASE contains that kind of thing. Right, I will add the information there. Thanks.

This bug report was last modified 10 years and 86 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20789 auto-generate more Unicode data from sources

GNU bug report logs - #20789
auto-generate more Unicode data from sources