GNU bug report logs - #20789
auto-generate more Unicode data from sources

Previous Next

Package: emacs;

Reported by: Glenn Morris <rgm <at> gnu.org>

Date: Thu, 11 Jun 2015 22:06:02 UTC

Severity: wishlist

Found in version 25.0.50

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Glenn Morris <rgm <at> gnu.org>
Cc: 20789 <at> debbugs.gnu.org
Subject: bug#20789: Invalid script or charset name:	cuneiform-numbers-and-punctuation
Date: Tue, 16 Jun 2015 17:41:36 +0300
> From: Glenn Morris <rgm <at> gnu.org>
> Cc: 20789 <at> debbugs.gnu.org
> Date: Mon, 15 Jun 2015 20:22:07 -0400
> 
> Eli Zaretskii wrote:
> 
> >> I don't suppose that big list can be auto-generated from the inputs?
> >
> > It's not trivial.  I describe below some of the issues, in the hope
> > that Someoneā„¢ will volunteer:
> 
> Thanks. Script that processes Blocks.txt attached. Some questions:
> 
> 1. In Blocks.txt:
> 
>   FF00..FFEF; Halfwidth and Fullwidth Forms
> 
> In Emacs:
> 
>   (#xFF00 #xFF5F cjk-misc)
>   (#xFF61 #xFF9F kana)
>   (#xFFE0 #xFFEF cjk-misc)
> 
> Is ff60 (FULLWIDTH RIGHT WHITE PARENTHESIS) intentionally omitted?

AFAICT, there's a small mess around there.  Based on the names of the
pertinent characters, I think we should have this instead of the above
3 ranges:

  (#xFF00 #xFF60 cjk-misc)
  (#xFF61 #xFF9F kana)
  (#xFFA0 #xFFDF hangul)
  (#xFFE0 #xFFEF cjk-misc)

> 2. In Emacs "olt-italic" looks like a typo ("old-italic"). Can it be renamed?

Yes, please.

> 3. In Blocks.txt, Anatolian Hieroglyphs ends at 1467F.
> In Emacs, it ends at 1457F. Typo?

Yes.

> 4. In Blocks.txt:
> 
>   20000..2A6DF; CJK Unified Ideographs Extension B
>   2A700..2B73F; CJK Unified Ideographs Extension C
>   2B740..2B81F; CJK Unified Ideographs Extension D
>   2B820..2CEAF; CJK Unified Ideographs Extension E
>   2F800..2FA1F; CJK Compatibility Ideographs Supplement
> 
> In Emacs:
> 
>   (#x20000 #x2CEAF han)
>   (#x2F800 #x2FFFF han)
> 
> Emacs adds the ranges 2a6e0:2a6ff and 2fa20:2ffff, which Blocks.txt does
> not cover. Intentional?

I don't know, but probably not intentional.  I think we had better
made it consistent with the UCD.

> 5. Newly added "sutton-sign-writing" - should be "sutton-signwriting"?
> (The case-insensitive source says "Sutton SignWriting".)

Well, "signwriting" is not a word, AFAIK, it's 2 words (and the funny
camel-case seems to agree with me).  AFAIU, they used "SignWriting"
because it's the commercial name.  But if you insist, I won't...

Thank you for doing this.

P.S. Does the script work with mawk?  (Some systems have it as their
default Awk, I think.)




This bug report was last modified 9 years and 356 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.