#36923 - Combining Diacritical Marks are not Latin only

GNU bug report logs - #36923
Combining Diacritical Marks are not Latin only

Package: emacs;

Reported by: Juri Linkov <juri <at> linkov.net>

Date: Sun, 4 Aug 2019 20:50:02 UTC

Severity: normal

Done: Juri Linkov <juri <at> linkov.net>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Juri Linkov <juri <at> linkov.net> Cc: 36923 <at> debbugs.gnu.org Subject: bug#36923: Combining Diacritical Marks are not Latin only Date: Tue, 06 Aug 2019 17:32:33 +0300

> From: Juri Linkov <juri <at> linkov.net> > Cc: 36923 <at> debbugs.gnu.org > Date: Mon, 05 Aug 2019 22:41:59 +0300 > > >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) > >> > >> could return > >> > >> (latin greek cyrillic) > >> > >> instead of the current > >> > >> latin > > > > char-script-table is documented to yield a single symbol, so returning > > a list would be an incompatible change, which we should avoid. > > The docstring of char-script-table says: > > Char table of script symbols. > It has one extra slot whose value is a list of script symbols. > > So it seems char-script-table should yield a list of script symbols? No, that's only in the extra slot. The ELisp manual says: -- Variable: char-script-table The value of this variable is a char-table that specifies, for each character, a symbol whose name is the script to which the character belongs, according to the Unicode Standard classification of the Unicode code space into script-specific blocks. This char-table has a single extra slot whose value is the list of all script symbols. > I searched more for char-script-table in the documentation, and one > place where it's used is forward-word. But I don't understand why > forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is > the Latin script) and non-Latin letters. See word-combining-categories: it causes word-movement commands to ignore any script boundaries with characters whose category is combining diacritic or mark. > Maybe it doesn't stop because of special script handling in > ‘find-word-boundary-function-table’? Not by default, because find-word-boundary-function-table's entry for any character is nil by default. > BTW, while looking at forward-word and right-word I noticed inconsistency: > there are left-word and right-word commands, but no left-sexp and right-sexp > to accompany forward-sexp. Programming languages are all L2R, so there's no need to move by sexps in R2L direction. > > More generally, I think what you describe is a clear conceptual bug in > > markchars-mode: it should only pay attention to the script of the base > > characters, not to the script of combining accents. The latter is > > mostly irrelevant, certainly so for the purpose of detecting > > confusables. > > Could you suggest a proper function to strip all combining characters > from the string? Each base character has its canonical combining class attribute as zero, so you could use (get-char-code-property CHAR 'canonical-combining-class) to filter out those CHARs for which the value is non-zero. Alternatively, you could go by categories: base characters have the ?. category set, combining characters have the ?^ category set. My recommendation is to use the canonical-combining-class property, as it is a more direct way of doing this.

This bug report was last modified 5 years and 348 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #36923 Combining Diacritical Marks are not Latin only

GNU bug report logs - #36923
Combining Diacritical Marks are not Latin only