GNU bug report logs -
#34862
27.0.50; Trying to update pinyin.map
Previous Next
Full log
View this message in rfc822 format
> From: Eric Abrahamsen <eric <at> ericabrahamsen.net>
> Date: Fri, 15 Mar 2019 11:31:40 -0700
>
> > That file is imported from an external source, isn't it? Are you
> > saying we should stop synchronizing it with that source, and instead
> > fork it, maintain our own separate copy, and never resync with that
> > source again? If so, then I see no reason not to recode it in UTF-8.
>
> Near as I can tell that file was imported into Emacs in 2001 and not
> touched since (apart from copyright and encoding stuff). The Debian
> package from which it comes seems to have been orphaned in 2003[1]. So
> there's not much to either synchronize or fork!
OK, sounds reasonable.
> > Btw, I understand that the Google pinyin method is Apache licensed,
> > but does this mean we can freely use its data for updating pinyin.map?
> > IANAL. Could you perhaps describe how you intend to extract the data
> > from the Google input method for the purpose of updating our file? I
> > think someone will have to audit that process for being legal and
> > compatible with both the Apache license and the GPL.
>
> This[2] is the source file I used. I chopped off all the
> multiple-character dictionary entries, and munged the remaining data
> into the format we need. Ie, lines like this:
>
> 八 6677.54934466 0 ba
> 把 165484.231697 0 ba
> 吧 385205.434615 0 ba
>
> Became this:
>
> ba 吧把八
>
> A straight rearrangement, with frequency of use translated into simple
> ordering of the characters. While this is obviously pretty manual, and a
> bit of work, a file like this really only needs to be updated every five
> years or so -- if that. Whenever someone thinks of it.
I think this should be done with a script, and that script should be
in our repository. The easiest kind of a script is a Lisp program, of
course, but we can also use other kinds, such as Awk scripts.
> Regarding the license, I'm even less of a lawyer than you, but these[3]
> are the terms that cover this data.
Richard, could you please look at that license and tell if we can use
this data file?
> > (Also, I'm somewhat surprised that gbk isn't capable of covering the
> > characters you want to add. Or did you not try using it?)
>
> I did not try using it! Mostly because the error message suggested
> gb18030 first. gbk also works. I don't have any opinion about encoding,
> apart from assuming utf8 unless there's a good reason not to.
I see no good reason to use anything other than UTF-8.
> [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=189523;msg=18
>
> [2] https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/jni/data/rawdict_utf16_65105_freq.txt
>
> [3] https://android.googlesource.com/platform/packages/inputmethods/PinyinIME/+/refs/heads/master/NOTICE
Thanks.
This bug report was last modified 123 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.