GNU bug report logs - #36923
Combining Diacritical Marks are not Latin only

Previous Next

Package: emacs;

Reported by: Juri Linkov <juri <at> linkov.net>

Date: Sun, 4 Aug 2019 20:50:02 UTC

Severity: normal

Done: Juri Linkov <juri <at> linkov.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Juri Linkov <juri <at> linkov.net>
Subject: bug#36923: closed (Re: bug#36923: Combining Diacritical Marks are
 not Latin only)
Date: Wed, 07 Aug 2019 22:03:03 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#36923: Combining Diacritical Marks are not Latin only

which was filed against the emacs package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 36923 <at> debbugs.gnu.org.

-- 
36923: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=36923
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Juri Linkov <juri <at> linkov.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 36923-done <at> debbugs.gnu.org
Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only
Date: Thu, 08 Aug 2019 00:44:49 +0300
> Each base character has its canonical combining class attribute as
> zero, so you could use
>
>    (get-char-code-property CHAR 'canonical-combining-class)
>
> to filter out those CHARs for which the value is non-zero.
>
> Alternatively, you could go by categories: base characters have the
> ?. category set, combining characters have the ?^ category set.
>
> My recommendation is to use the canonical-combining-class property, as
> it is a more direct way of doing this.

Thanks, I fixed markchars-mode by using canonical-combining-class.

[Message part 3 (message/rfc822, inline)]
From: Juri Linkov <juri <at> linkov.net>
To: bug-gnu-emacs <at> gnu.org
Subject: Combining Diacritical Marks are not Latin only
Date: Sun, 04 Aug 2019 23:40:38 +0300
The generated file lisp/international/charscript.el
assigns the block “Combining Diacritical Marks” to the ‘latin’ script
on the assumption that these characters are used only in Latin.

But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent
the acute accent marks the stressed vowel of a word in several languages
with alphabets based on the Latin, Cyrillic, and Greek scripts.
In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode
mentions how characters from other blocks are used in Cyrillic script.
Moreover, the Combining Diacritical Marks block also
contains several characters from the Greek script:
COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS
COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI

I noticed this problem recently while helping to develop char-fold where
GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was
alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA.

Of course, it's possible to add exceptions for characters in this block
in markchars-mode.  But before doing this, I'm asking a confirmation
whether Unicode data should be fixed in ‘char-script-table’, so e.g.

  (aref char-script-table ?\N{COMBINING ACUTE ACCENT})

could return

  (latin greek cyrillic)

instead of the current

  latin



This bug report was last modified 5 years and 347 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.