#20140 - 24.4; M17n shaper output rejected

GNU bug report logs - #20140
24.4; M17n shaper output rejected

Package: emacs;

Reported by: Richard Wordingham <richard.wordingham <at> ntlworld.com>

Date: Wed, 18 Mar 2015 22:21:02 UTC

Severity: normal

Tags: moreinfo

Found in version 24.4

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Message #67 received at 20140 <at> debbugs.gnu.org (full text, mbox):

From: Richard Wordingham <richard.wordingham <at> ntlworld.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 20140 <at> debbugs.gnu.org, larsi <at> gnus.org Subject: Re: bug#20140: 24.4; M17n shaper output rejected Date: Sun, 13 Feb 2022 20:53:10 +0000

On Sun, 13 Feb 2022 18:04:11 +0200 Eli Zaretskii <eliz <at> gnu.org> wrote: > > Date: Sat, 5 Feb 2022 22:52:51 +0000 > > From: Richard Wordingham <richard.wordingham <at> ntlworld.com> > > Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 20140 <at> debbugs.gnu.org > > > > You're welcome to include my composition rules. > > Thanks. I started with your code: > > > (defvar tai-tham-composable-pattern > > (let ((table > > ;; C is letters, independent vowels, digits, punctuation > > and symbols. '(("C" . > > "[\u1A20-\u1A54\u1A80-\u1A89\u1A90-\u1A99\u1AA0-\u1AAD]") ("M" . > > "[\u1A55-\u1A57\u1A59-\u1A5E\u1A61-\u1A7C\u1A7F]"); Mark ("H" . > > "\u1A60") ; sakot ("S" . "[\u1A75-\u1A7C]") ; Marks commuting with > > sakot ("N" . "\u1A58"))) ; mai kang lai > > (basic_syllable "C\\(N*\\(M\\|HS*C\\)\\)*") > > (regexp "X\\(N\\(X\\)?\\)*H?")) ; X is basic syllable > > (let ((case-fold-search nil)) > > (setq regexp (replace-regexp-in-string "X" basic_syllable > > regexp t t)) (dolist (elt table) > > (setq regexp (replace-regexp-in-string (car elt) (cdr elt) > > regexp t t)))) > > regexp)) > > > > (let ((elt (list (vector tai-tham-composable-pattern 0 > > 'font-shape-gstring) (vector "." 0 'font-shape-gstring) > > ))) > > (set-char-table-range composition-function-table '(#x1A20 . > > #x1AAD) elt)) > > But that didn't seem to work well enough: e.g., some marks in your > "sample text" didn't combine with letters, as I think they should. Which ones? Are you sure they didn't combine at the Emacs level? I did suspect the problem was writing '\u1A7C' instead of '\u1a7c', but I'm no longer so sure. (The 'C' might get expanded, but I'm beginning to think not.) > Then I tried this simplistic setting: > > (set-char-table-range composition-function-table > '(#x1a20 . #x1aaf) > (list (vector "[\u1a20-\u1aaf]+" 0 > 'font-shape-gstring))) > > and it worked much better, including passing a small number of the > tests from your renderer test page that I threw on Emacs. This is on > MS-Windows with Emacs 29 and HarfBuzz 2.4.0 (which is not even the > latest release of HarfBuzz), and with the A Tai Tham KH New V3 font. > Any reason not to use the above simple setup for Tai Tham text > composition? Mostly only that you would have to edit the text with "autocomposition at point disabled" or mark word boundaries, e.g. with U+200B ZERO WIDTH SPACE. The Tai languages that use Tai Tham use scriptio continua. While modern Pali does separate words with visible white space, its words tend to be polysyllabic; with discerning composition, it would be about as tolerable as editing Hindi in Devanagari with autocomposition enabled. (Quite a few people edit Devanagari in transliteration to Latin!) You should also add CGJ and ZWNJ, and some people may appreciate ZWJ - the Khottabun font has ligatures involving ZWJ, though it may just be an experimental feature - and ultimately WJ, for when someone writes a Tai Tham word breaker. Oh, and Thai and Lao mai t(r)i and mai chat(t)awa and U+0324 COMBINING DIAERESIS BELOW turn up occasionally - U+0324 is supported in Thep's Khottabun font, and my Da Lekh series supports Thai mai tri and mai chattawa. These characters seem to work with HarfBuzz. If using the native Windows renderer is an option with Emacs, then 'A Tai Tham KH New' works better than 'A Tai Tham KH New V3'. I've created https://wrdingham.co.uk/lanna/font_test.htm to do _font_ comparisons. I'd delayed because I've only recently satisfied myself that it is lawful, at least under English law. (The qualms were with the samples taken from books.) It's still very much a work in progress. > I needed a couple more additions to Emacs to make Tai Tham support > work OOTB: for example, script-representative-chars lacked an entry > for Tai Tham, and the default fontset needed an addition. (And on > MS-Windows, one needs to run the w32-find-non-USB-fonts magic once, to > notice the newly installed Tai Tham font.) > Other than that, assuming the above setting of > composition-function-table is okay, we are ready to officially add Tai > Tham to scripts supported by Emacs. > Btw, is there a way to get all the examples from your > https://wrdingham.co.uk/lanna/renderer_test.htm as a UTF-8 encoded > text file? I'd like to test the Emacs rendering with all of the > examples, but copy-pasting each example separately from the browser is > not my idea of useful time investment. So if you could provide the > examples as a downloadable text file, I'd appreciate. As buried (you're not the only one to have overlooked it) in the penultimate paragraph of 'Content and Layout' section, "The test words may, in principle, be extracted quite simply from this web page. Each test 'word' is the content of the first cell in each row whose class is tst1. For convenience*, I have extracted the first two cells in such rows, along with titles, to a CSV file." The file is rt.csv in the same directory. I included the meaning and pronunciation as those who don't know the script may find it easier to refer to the words by translation or transcription. You may prefer to use the file more or less as it is, but one can easily knock up an Emacs macro sequence to delete the first comma and the rest of the line. I left the section titles in for easier navigation to the renderer test file. *Some people claim to find XML files easy to use, they should then be able to analyse a file conforming to HTML4 syntax. Dodgy spellings go in pink rows whose class is 'tst2'. The alternative encodings demanded by the USE go in orange rows whose class is 'tst3'. I have not extracted these. Richard.

This bug report was last modified 3 years and 155 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20140 24.4; M17n shaper output rejected

GNU bug report logs - #20140
24.4; M17n shaper output rejected