From unknown Sat Aug 16 16:18:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#36923: Combining Diacritical Marks are not Latin only Resent-From: Juri Linkov Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 04 Aug 2019 20:50:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 36923 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: 36923@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.156495177629586 (code B ref -1); Sun, 04 Aug 2019 20:50:02 +0000 Received: (at submit) by debbugs.gnu.org; 4 Aug 2019 20:49:36 +0000 Received: from localhost ([127.0.0.1]:34301 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huNRg-0007h8-DI for submit@debbugs.gnu.org; Sun, 04 Aug 2019 16:49:36 -0400 Received: from lists.gnu.org ([209.51.188.17]:40772) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huNRd-0007gu-N1 for submit@debbugs.gnu.org; Sun, 04 Aug 2019 16:49:35 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37615) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1huNRc-0005oI-Io for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:33 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_NONE, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1huNRb-0004n5-Gw for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:32 -0400 Received: from bonobo.birch.relay.mailchannels.net ([23.83.209.22]:18573) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1huNRb-0004mD-1V for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:31 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 39A2A50105E for ; Sun, 4 Aug 2019 20:49:29 +0000 (UTC) Received: from pdx1-sub0-mail-a13.g.dreamhost.com (100-96-15-31.trex.outbound.svc.cluster.local [100.96.15.31]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 402C1500FF7 for ; Sun, 4 Aug 2019 20:49:28 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a13.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Sun, 04 Aug 2019 20:49:28 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Fumbling-Madly: 6fb21fde769b57b6_1564951768706_3383376482 X-MC-Loop-Signature: 1564951768706:819939509 X-MC-Ingress-Time: 1564951768705 Received: from pdx1-sub0-mail-a13.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a13.g.dreamhost.com (Postfix) with ESMTP id F0D197FE72 for ; Sun, 4 Aug 2019 13:49:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to :subject:date:message-id:mime-version:content-type :content-transfer-encoding; s=linkov.net; bh=zTlrxpfIg1DPdouNZwE w5CdLjJk=; b=h9uim1bg9XOUJLlhbrxmqGsYYulVrbenI4dap6V8MbUGdZL8srZ 0KhKkQkvbYUa95bCblxw2htmVLDKsy4JAeHFZfe6cT+qTM4z4IPVr/FJFJQi9n8c 9wBICEDyB7D7ZAKWIelt7sbYNhFFH/Z8+Ylb92UAZxLEE4mbVL1an0QU= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a13.g.dreamhost.com (Postfix) with ESMTPSA id A9EBF7E401 for ; Sun, 4 Aug 2019 13:49:21 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a13 From: Juri Linkov Organization: LINKOV.NET Date: Sun, 04 Aug 2019 23:40:38 +0300 Message-ID: <87lfw8r744.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: 0 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddthedgudehhecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufhofffkfgggtgfgsehtkeertddtreejnecuhfhrohhmpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqnecuffhomhgrihhnpeifihhkihhpvgguihgrrdhorhhgnecukfhppeeluddruddvledruddtfedrledunecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopehmrghilhdrjhhurhhtrgdrohhrghdpihhnvghtpeeluddruddvledruddtfedrledupdhrvghtuhhrnhdqphgrthhhpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqpdhmrghilhhfrhhomhepjhhurhhisehlihhnkhhovhdrnhgvthdpnhhrtghpthhtohepsghughdqghhnuhdqvghmrggtshesghhnuhdrohhrghenucevlhhushhtvghrufhiiigvpedt Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 23.83.209.22 X-Spam-Score: -1.4 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.4 (--) The generated file lisp/international/charscript.el assigns the block =E2=80=9CCombining Diacritical Marks=E2=80=9D to the =E2= =80=98latin=E2=80=99 script on the assumption that these characters are used only in Latin. But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent the acute accent marks the stressed vowel of a word in several languages with alphabets based on the Latin, Cyrillic, and Greek scripts. In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode mentions how characters from other blocks are used in Cyrillic script. Moreover, the Combining Diacritical Marks block also contains several characters from the Greek script: COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI I noticed this problem recently while helping to develop char-fold where GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was alarmingly highlighted as =E2=80=9Cmixed scripts=E2=80=9D by markchars-mo= de from GNU ELPA. Of course, it's possible to add exceptions for characters in this block in markchars-mode. But before doing this, I'm asking a confirmation whether Unicode data should be fixed in =E2=80=98char-script-table=E2=80=99= , so e.g. (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) could return (latin greek cyrillic) instead of the current latin From unknown Sat Aug 16 16:18:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#36923: Combining Diacritical Marks are not Latin only Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 05 Aug 2019 16:09:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 36923 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Juri Linkov Cc: 36923@debbugs.gnu.org Received: via spool by 36923-submit@debbugs.gnu.org id=B36923.156502132229392 (code B ref 36923); Mon, 05 Aug 2019 16:09:02 +0000 Received: (at 36923) by debbugs.gnu.org; 5 Aug 2019 16:08:42 +0000 Received: from localhost ([127.0.0.1]:35728 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hufXN-0007e0-OW for submit@debbugs.gnu.org; Mon, 05 Aug 2019 12:08:41 -0400 Received: from eggs.gnu.org ([209.51.188.92]:48288) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hufXM-0007dj-B6 for 36923@debbugs.gnu.org; Mon, 05 Aug 2019 12:08:40 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:35586) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hufXH-0006kL-52; Mon, 05 Aug 2019 12:08:35 -0400 Received: from [176.228.60.248] (port=3263 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hufXG-00046X-FF; Mon, 05 Aug 2019 12:08:34 -0400 Date: Mon, 05 Aug 2019 19:08:21 +0300 Message-Id: <83k1brd28a.fsf@gnu.org> From: Eli Zaretskii In-reply-to: <87lfw8r744.fsf@mail.linkov.net> (message from Juri Linkov on Sun, 04 Aug 2019 23:40:38 +0300) References: <87lfw8r744.fsf@mail.linkov.net> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Juri Linkov > Date: Sun, 04 Aug 2019 23:40:38 +0300 > > The generated file lisp/international/charscript.el > assigns the block “Combining Diacritical Marks” to the ‘latin’ script > on the assumption that these characters are used only in Latin. > > But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent > the acute accent marks the stressed vowel of a word in several languages > with alphabets based on the Latin, Cyrillic, and Greek scripts. > In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode > mentions how characters from other blocks are used in Cyrillic script. > Moreover, the Combining Diacritical Marks block also > contains several characters from the Greek script: > COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS > COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI > > I noticed this problem recently while helping to develop char-fold where > GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was > alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA. > > Of course, it's possible to add exceptions for characters in this block > in markchars-mode. But before doing this, I'm asking a confirmation > whether Unicode data should be fixed in ‘char-script-table’, so e.g. > > (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) > > could return > > (latin greek cyrillic) > > instead of the current > > latin char-script-table is documented to yield a single symbol, so returning a list would be an incompatible change, which we should avoid. More generally, I think what you describe is a clear conceptual bug in markchars-mode: it should only pay attention to the script of the base characters, not to the script of combining accents. The latter is mostly irrelevant, certainly so for the purpose of detecting confusables. So I think this should be fixed in markchars-mode, and the fact that we somewhat arbitrarily assign those diacritics to the latin script is not a serious problem, if at all. From unknown Sat Aug 16 16:18:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#36923: Combining Diacritical Marks are not Latin only Resent-From: Juri Linkov Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 05 Aug 2019 19:59:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 36923 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Eli Zaretskii Cc: 36923@debbugs.gnu.org Received: via spool by 36923-submit@debbugs.gnu.org id=B36923.156503513327701 (code B ref 36923); Mon, 05 Aug 2019 19:59:01 +0000 Received: (at 36923) by debbugs.gnu.org; 5 Aug 2019 19:58:53 +0000 Received: from localhost ([127.0.0.1]:35905 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huj88-0007Ci-TZ for submit@debbugs.gnu.org; Mon, 05 Aug 2019 15:58:53 -0400 Received: from antelope.elm.relay.mailchannels.net ([23.83.212.4]:1894) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huj87-0007Ca-Fs for 36923@debbugs.gnu.org; Mon, 05 Aug 2019 15:58:52 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 5D8F12C269E; Mon, 5 Aug 2019 19:58:50 +0000 (UTC) Received: from pdx1-sub0-mail-a2.g.dreamhost.com (100-96-86-80.trex.outbound.svc.cluster.local [100.96.86.80]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id E0A4C2C243F; Mon, 5 Aug 2019 19:58:49 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a2.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Mon, 05 Aug 2019 19:58:50 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Hysterical-Absorbed: 3afa208b1e30f0df_1565035130207_407947143 X-MC-Loop-Signature: 1565035130207:3849131590 X-MC-Ingress-Time: 1565035130206 Received: from pdx1-sub0-mail-a2.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a2.g.dreamhost.com (Postfix) with ESMTP id D2DC4837C8; Mon, 5 Aug 2019 12:58:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type:content-transfer-encoding; s=linkov.net; bh=biP8jY cpIuPK8lEh8860XdF0WOk=; b=OtZ7A1ywja2KrleZjkuVdGQAzmjhDyuMyx8Bje E3C+wiBcdRoeOHpCTgQSTad4EhCrboc9f8GfnNMzOgGDAXvT+1zk0yvKvhz9mqDd xWrZB6N6WzDVqsqk8PqaIkalRU2R1snmh/KgXgdxDMdqDimi0I1yJt35zPN9SHNy yuHzw= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a2.g.dreamhost.com (Postfix) with ESMTPSA id 02CD8837BB; Mon, 5 Aug 2019 12:58:44 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a2 From: Juri Linkov Organization: LINKOV.NET References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> Date: Mon, 05 Aug 2019 22:41:59 +0300 In-Reply-To: <83k1brd28a.fsf@gnu.org> (Eli Zaretskii's message of "Mon, 05 Aug 2019 19:08:21 +0300") Message-ID: <87zhknzc7c.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: -100 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddtkedgtddvucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuggftfghnshhusghstghrihgsvgdpffftgfetoffjqffuvfenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhephffvufhofhffjgfkfgggtgfgsehtkeertddtreejnecuhfhrohhmpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqnecukfhppeeluddruddvledruddtfedrledunecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopehmrghilhdrjhhurhhtrgdrohhrghdpihhnvghtpeeluddruddvledruddtfedrledupdhrvghtuhhrnhdqphgrthhhpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqpdhmrghilhhfrhhomhepjhhurhhisehlihhnkhhovhdrnhgvthdpnhhrtghpthhtohepvghlihiisehgnhhurdhorhhgnecuvehluhhsthgvrhfuihiivgeptd Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) >> >> could return >> >> (latin greek cyrillic) >> >> instead of the current >> >> latin > > char-script-table is documented to yield a single symbol, so returning > a list would be an incompatible change, which we should avoid. The docstring of char-script-table says: Char table of script symbols. It has one extra slot whose value is a list of script symbols. So it seems char-script-table should yield a list of script symbols? I searched more for char-script-table in the documentation, and one place where it's used is forward-word. But I don't understand why forward-word doesn't stop between =E2=80=9CCOMBINING ACUTE ACCENT=E2=80=9D= (that is the Latin script) and non-Latin letters. This is good that it doesn't stop here, and I'm just trying to understand why - so the same logic could be used in markchars-mode. Maybe it doesn't stop because of special script handling in =E2=80=98find-word-boundary-function-table=E2=80=99? Or because it ignor= es all combining characters? BTW, while looking at forward-word and right-word I noticed inconsistency= : there are left-word and right-word commands, but no left-sexp and right-s= exp to accompany forward-sexp. > More generally, I think what you describe is a clear conceptual bug in > markchars-mode: it should only pay attention to the script of the base > characters, not to the script of combining accents. The latter is > mostly irrelevant, certainly so for the purpose of detecting > confusables. Could you suggest a proper function to strip all combining characters from the string? From unknown Sat Aug 16 16:18:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#36923: Combining Diacritical Marks are not Latin only Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 06 Aug 2019 14:33:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 36923 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Juri Linkov Cc: 36923@debbugs.gnu.org Received: via spool by 36923-submit@debbugs.gnu.org id=B36923.15651019748979 (code B ref 36923); Tue, 06 Aug 2019 14:33:02 +0000 Received: (at 36923) by debbugs.gnu.org; 6 Aug 2019 14:32:54 +0000 Received: from localhost ([127.0.0.1]:37618 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hv0WD-0002Kl-Us for submit@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:54 -0400 Received: from eggs.gnu.org ([209.51.188.92]:59345) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hv0WB-0002KX-UI for 36923@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:52 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:53619) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hv0W6-0001z4-HJ; Tue, 06 Aug 2019 10:32:46 -0400 Received: from [176.228.60.248] (port=1704 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hv0W5-0002jF-PF; Tue, 06 Aug 2019 10:32:46 -0400 Date: Tue, 06 Aug 2019 17:32:33 +0300 Message-Id: <83a7cmcqke.fsf@gnu.org> From: Eli Zaretskii In-reply-to: <87zhknzc7c.fsf@mail.linkov.net> (message from Juri Linkov on Mon, 05 Aug 2019 22:41:59 +0300) References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> <87zhknzc7c.fsf@mail.linkov.net> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Juri Linkov > Cc: 36923@debbugs.gnu.org > Date: Mon, 05 Aug 2019 22:41:59 +0300 > > >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) > >> > >> could return > >> > >> (latin greek cyrillic) > >> > >> instead of the current > >> > >> latin > > > > char-script-table is documented to yield a single symbol, so returning > > a list would be an incompatible change, which we should avoid. > > The docstring of char-script-table says: > > Char table of script symbols. > It has one extra slot whose value is a list of script symbols. > > So it seems char-script-table should yield a list of script symbols? No, that's only in the extra slot. The ELisp manual says: -- Variable: char-script-table The value of this variable is a char-table that specifies, for each character, a symbol whose name is the script to which the character belongs, according to the Unicode Standard classification of the Unicode code space into script-specific blocks. This char-table has a single extra slot whose value is the list of all script symbols. > I searched more for char-script-table in the documentation, and one > place where it's used is forward-word. But I don't understand why > forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is > the Latin script) and non-Latin letters. See word-combining-categories: it causes word-movement commands to ignore any script boundaries with characters whose category is combining diacritic or mark. > Maybe it doesn't stop because of special script handling in > ‘find-word-boundary-function-table’? Not by default, because find-word-boundary-function-table's entry for any character is nil by default. > BTW, while looking at forward-word and right-word I noticed inconsistency: > there are left-word and right-word commands, but no left-sexp and right-sexp > to accompany forward-sexp. Programming languages are all L2R, so there's no need to move by sexps in R2L direction. > > More generally, I think what you describe is a clear conceptual bug in > > markchars-mode: it should only pay attention to the script of the base > > characters, not to the script of combining accents. The latter is > > mostly irrelevant, certainly so for the purpose of detecting > > confusables. > > Could you suggest a proper function to strip all combining characters > from the string? Each base character has its canonical combining class attribute as zero, so you could use (get-char-code-property CHAR 'canonical-combining-class) to filter out those CHARs for which the value is non-zero. Alternatively, you could go by categories: base characters have the ?. category set, combining characters have the ?^ category set. My recommendation is to use the canonical-combining-class property, as it is a more direct way of doing this. From unknown Sat Aug 16 16:18:37 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Juri Linkov Subject: bug#36923: closed (Re: bug#36923: Combining Diacritical Marks are not Latin only) Message-ID: References: <87a7ckps4u.fsf@mail.linkov.net> <87lfw8r744.fsf@mail.linkov.net> X-Gnu-PR-Message: they-closed 36923 X-Gnu-PR-Package: emacs Reply-To: 36923@debbugs.gnu.org Date: Wed, 07 Aug 2019 22:03:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1565215384-14969-1" This is a multi-part message in MIME format... ------------=_1565215384-14969-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #36923: Combining Diacritical Marks are not Latin only which was filed against the emacs package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 36923@debbugs.gnu.org. --=20 36923: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D36923 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1565215384-14969-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 36923-done) by debbugs.gnu.org; 7 Aug 2019 22:03:00 +0000 Received: from localhost ([127.0.0.1]:39588 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hvU1M-0003t9-0P for submit@debbugs.gnu.org; Wed, 07 Aug 2019 18:03:00 -0400 Received: from bird.elm.relay.mailchannels.net ([23.83.212.17]:11754) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hvU1I-0003sz-Lc for 36923-done@debbugs.gnu.org; Wed, 07 Aug 2019 18:02:58 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 22A55142C69; Wed, 7 Aug 2019 22:02:55 +0000 (UTC) Received: from pdx1-sub0-mail-a93.g.dreamhost.com (100-96-35-244.trex.outbound.svc.cluster.local [100.96.35.244]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 8B715142C27; Wed, 7 Aug 2019 22:02:54 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a93.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Wed, 07 Aug 2019 22:02:54 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Arithmetic-Tart: 6c0b43f468ba5159_1565215374797_1636287199 X-MC-Loop-Signature: 1565215374797:2092584110 X-MC-Ingress-Time: 1565215374797 Received: from pdx1-sub0-mail-a93.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a93.g.dreamhost.com (Postfix) with ESMTP id 52A5481155; Wed, 7 Aug 2019 15:02:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; s=linkov.net; bh=+L/NcxGFh1Uhw6vVa2rpFpX4pt8=; b= WU9K7u8JdQY06iCiUnh5Tm9nZB6PB2G5JHwq3EacuSfXQiaJW4eF+DNnuqVDCj0L gwEsMs83JPOSZAXN9hxQ/aQx8wGz/y6Ft+BXVqdNxeuwwArox1YV+5iGIhX5gTRo 3KKMFl2dM6YO2TEY+BsEJJHGKEJPzTWbGgoWIn1KfX4= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a93.g.dreamhost.com (Postfix) with ESMTPSA id 6EF037F015; Wed, 7 Aug 2019 15:02:46 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a93 From: Juri Linkov To: Eli Zaretskii Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only Organization: LINKOV.NET References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> <87zhknzc7c.fsf@mail.linkov.net> <83a7cmcqke.fsf@gnu.org> Date: Thu, 08 Aug 2019 00:44:49 +0300 In-Reply-To: <83a7cmcqke.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 06 Aug 2019 17:32:33 +0300") Message-ID: <87a7ckps4u.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: -100 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddufedgtdejucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuggftfghnshhusghstghrihgsvgdpffftgfetoffjqffuvfenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhephffvufhofhffjgfkfgggtgesthdtredttdertdenucfhrhhomheplfhurhhiucfnihhnkhhovhcuoehjuhhriheslhhinhhkohhvrdhnvghtqeenucfkphepledurdduvdelrddutdefrdeludenucfrrghrrghmpehmohguvgepshhmthhppdhhvghlohepmhgrihhlrdhjuhhrthgrrdhorhhgpdhinhgvthepledurdduvdelrddutdefrdeluddprhgvthhurhhnqdhprghthheplfhurhhiucfnihhnkhhovhcuoehjuhhriheslhhinhhkohhvrdhnvghtqedpmhgrihhlfhhrohhmpehjuhhriheslhhinhhkohhvrdhnvghtpdhnrhgtphhtthhopegvlhhiiiesghhnuhdrohhrghenucevlhhushhtvghrufhiiigvpedt X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 36923-done Cc: 36923-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) > Each base character has its canonical combining class attribute as > zero, so you could use > > (get-char-code-property CHAR 'canonical-combining-class) > > to filter out those CHARs for which the value is non-zero. > > Alternatively, you could go by categories: base characters have the > ?. category set, combining characters have the ?^ category set. > > My recommendation is to use the canonical-combining-class property, as > it is a more direct way of doing this. Thanks, I fixed markchars-mode by using canonical-combining-class. ------------=_1565215384-14969-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 4 Aug 2019 20:49:36 +0000 Received: from localhost ([127.0.0.1]:34301 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huNRg-0007h8-DI for submit@debbugs.gnu.org; Sun, 04 Aug 2019 16:49:36 -0400 Received: from lists.gnu.org ([209.51.188.17]:40772) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huNRd-0007gu-N1 for submit@debbugs.gnu.org; Sun, 04 Aug 2019 16:49:35 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37615) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1huNRc-0005oI-Io for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:33 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_NONE, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1huNRb-0004n5-Gw for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:32 -0400 Received: from bonobo.birch.relay.mailchannels.net ([23.83.209.22]:18573) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1huNRb-0004mD-1V for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:31 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 39A2A50105E for ; Sun, 4 Aug 2019 20:49:29 +0000 (UTC) Received: from pdx1-sub0-mail-a13.g.dreamhost.com (100-96-15-31.trex.outbound.svc.cluster.local [100.96.15.31]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 402C1500FF7 for ; Sun, 4 Aug 2019 20:49:28 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a13.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Sun, 04 Aug 2019 20:49:28 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Fumbling-Madly: 6fb21fde769b57b6_1564951768706_3383376482 X-MC-Loop-Signature: 1564951768706:819939509 X-MC-Ingress-Time: 1564951768705 Received: from pdx1-sub0-mail-a13.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a13.g.dreamhost.com (Postfix) with ESMTP id F0D197FE72 for ; Sun, 4 Aug 2019 13:49:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to :subject:date:message-id:mime-version:content-type :content-transfer-encoding; s=linkov.net; bh=zTlrxpfIg1DPdouNZwE w5CdLjJk=; b=h9uim1bg9XOUJLlhbrxmqGsYYulVrbenI4dap6V8MbUGdZL8srZ 0KhKkQkvbYUa95bCblxw2htmVLDKsy4JAeHFZfe6cT+qTM4z4IPVr/FJFJQi9n8c 9wBICEDyB7D7ZAKWIelt7sbYNhFFH/Z8+Ylb92UAZxLEE4mbVL1an0QU= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a13.g.dreamhost.com (Postfix) with ESMTPSA id A9EBF7E401 for ; Sun, 4 Aug 2019 13:49:21 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a13 From: Juri Linkov To: bug-gnu-emacs@gnu.org Subject: Combining Diacritical Marks are not Latin only Organization: LINKOV.NET Date: Sun, 04 Aug 2019 23:40:38 +0300 Message-ID: <87lfw8r744.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: 0 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddthedgudehhecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufhofffkfgggtgfgsehtkeertddtreejnecuhfhrohhmpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqnecuffhomhgrihhnpeifihhkihhpvgguihgrrdhorhhgnecukfhppeeluddruddvledruddtfedrledunecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopehmrghilhdrjhhurhhtrgdrohhrghdpihhnvghtpeeluddruddvledruddtfedrledupdhrvghtuhhrnhdqphgrthhhpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqpdhmrghilhhfrhhomhepjhhurhhisehlihhnkhhovhdrnhgvthdpnhhrtghpthhtohepsghughdqghhnuhdqvghmrggtshesghhnuhdrohhrghenucevlhhushhtvghrufhiiigvpedt Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 23.83.209.22 X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.4 (--) The generated file lisp/international/charscript.el assigns the block =E2=80=9CCombining Diacritical Marks=E2=80=9D to the =E2= =80=98latin=E2=80=99 script on the assumption that these characters are used only in Latin. But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent the acute accent marks the stressed vowel of a word in several languages with alphabets based on the Latin, Cyrillic, and Greek scripts. In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode mentions how characters from other blocks are used in Cyrillic script. Moreover, the Combining Diacritical Marks block also contains several characters from the Greek script: COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI I noticed this problem recently while helping to develop char-fold where GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was alarmingly highlighted as =E2=80=9Cmixed scripts=E2=80=9D by markchars-mo= de from GNU ELPA. Of course, it's possible to add exceptions for characters in this block in markchars-mode. But before doing this, I'm asking a confirmation whether Unicode data should be fixed in =E2=80=98char-script-table=E2=80=99= , so e.g. (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) could return (latin greek cyrillic) instead of the current latin ------------=_1565215384-14969-1--