GNU bug report logs - #34862
27.0.50; Trying to update pinyin.map

Previous Next

Package: emacs;

Reported by: Eric Abrahamsen <eric <at> ericabrahamsen.net>

Date: Thu, 14 Mar 2019 21:52:01 UTC

Severity: wishlist

Found in version 27.0.50

Full log


Message #23 received at 34862 <at> debbugs.gnu.org (full text, mbox):

From: Eric Abrahamsen <eric <at> ericabrahamsen.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 34862 <at> debbugs.gnu.org, Richard Stallman <rms <at> gnu.org>
Subject: Re: bug#34862: 27.0.50; Trying to update pinyin.map
Date: Wed, 20 Mar 2019 12:30:22 -0700
On 03/20/19 11:45 AM, Eli Zaretskii wrote:

[...]

>> > Btw, I understand that the Google pinyin method is Apache licensed,
>> > but does this mean we can freely use its data for updating pinyin.map?
>> > IANAL. Could you perhaps describe how you intend to extract the data
>> > from the Google input method for the purpose of updating our file? I
>> > think someone will have to audit that process for being legal and
>> > compatible with both the Apache license and the GPL.
>> 
>> This[2] is the source file I used. I chopped off all the
>> multiple-character dictionary entries, and munged the remaining data
>> into the format we need. Ie, lines like this:
>> 
>> 八 6677.54934466 0 ba
>> 把 165484.231697 0 ba
>> 吧 385205.434615 0 ba
>> 
>> Became this:
>> 
>> ba 吧把八
>> 
>> A straight rearrangement, with frequency of use translated into simple
>> ordering of the characters. While this is obviously pretty manual, and a
>> bit of work, a file like this really only needs to be updated every five
>> years or so -- if that. Whenever someone thinks of it.
>
> I think this should be done with a script, and that script should be
> in our repository.  The easiest kind of a script is a Lisp program, of
> course, but we can also use other kinds, such as Awk scripts.

Awk seems just right for the problem, but I haven't written much in it;
I did the original munging in elisp. Would this be a script written for
use with -batch and a custom make target? Or something to be loaded into
a running Emacs and called interactively? In either case, should it also
be responsible for downloading a recent copy of the source file, or
should that be done first, and the function pointed at the file?

>> Regarding the license, I'm even less of a lawyer than you, but these[3]
>> are the terms that cover this data.
>
> Richard, could you please look at that license and tell if we can use
> this data file?
>
>> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
>> > characters you want to add.  Or did you not try using it?)
>> 
>> I did not try using it! Mostly because the error message suggested
>> gb18030 first. gbk also works. I don't have any opinion about encoding,
>> apart from assuming utf8 unless there's a good reason not to.
>
> I see no good reason to use anything other than UTF-8.

Excellent. I will think about the script, and look forward to word from
Richard.

Eric




This bug report was last modified 123 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.