From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 04 16:49:36 2019 Received: (at submit) by debbugs.gnu.org; 4 Aug 2019 20:49:36 +0000 Received: from localhost ([127.0.0.1]:34301 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huNRg-0007h8-DI for submit@debbugs.gnu.org; Sun, 04 Aug 2019 16:49:36 -0400 Received: from lists.gnu.org ([209.51.188.17]:40772) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huNRd-0007gu-N1 for submit@debbugs.gnu.org; Sun, 04 Aug 2019 16:49:35 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37615) by lists.gnu.org with esmtp (Exim 4.86_2) (envelope-from ) id 1huNRc-0005oI-Io for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:33 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_NONE, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1huNRb-0004n5-Gw for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:32 -0400 Received: from bonobo.birch.relay.mailchannels.net ([23.83.209.22]:18573) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1huNRb-0004mD-1V for bug-gnu-emacs@gnu.org; Sun, 04 Aug 2019 16:49:31 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 39A2A50105E for ; Sun, 4 Aug 2019 20:49:29 +0000 (UTC) Received: from pdx1-sub0-mail-a13.g.dreamhost.com (100-96-15-31.trex.outbound.svc.cluster.local [100.96.15.31]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 402C1500FF7 for ; Sun, 4 Aug 2019 20:49:28 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a13.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Sun, 04 Aug 2019 20:49:28 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Fumbling-Madly: 6fb21fde769b57b6_1564951768706_3383376482 X-MC-Loop-Signature: 1564951768706:819939509 X-MC-Ingress-Time: 1564951768705 Received: from pdx1-sub0-mail-a13.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a13.g.dreamhost.com (Postfix) with ESMTP id F0D197FE72 for ; Sun, 4 Aug 2019 13:49:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to :subject:date:message-id:mime-version:content-type :content-transfer-encoding; s=linkov.net; bh=zTlrxpfIg1DPdouNZwE w5CdLjJk=; b=h9uim1bg9XOUJLlhbrxmqGsYYulVrbenI4dap6V8MbUGdZL8srZ 0KhKkQkvbYUa95bCblxw2htmVLDKsy4JAeHFZfe6cT+qTM4z4IPVr/FJFJQi9n8c 9wBICEDyB7D7ZAKWIelt7sbYNhFFH/Z8+Ylb92UAZxLEE4mbVL1an0QU= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a13.g.dreamhost.com (Postfix) with ESMTPSA id A9EBF7E401 for ; Sun, 4 Aug 2019 13:49:21 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a13 From: Juri Linkov To: bug-gnu-emacs@gnu.org Subject: Combining Diacritical Marks are not Latin only Organization: LINKOV.NET Date: Sun, 04 Aug 2019 23:40:38 +0300 Message-ID: <87lfw8r744.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: 0 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddthedgudehhecutefuodetggdotefrodftvfcurfhrohhfihhlvgemucggtfgfnhhsuhgsshgtrhhisggvpdfftffgtefojffquffvnecuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufhofffkfgggtgfgsehtkeertddtreejnecuhfhrohhmpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqnecuffhomhgrihhnpeifihhkihhpvgguihgrrdhorhhgnecukfhppeeluddruddvledruddtfedrledunecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopehmrghilhdrjhhurhhtrgdrohhrghdpihhnvghtpeeluddruddvledruddtfedrledupdhrvghtuhhrnhdqphgrthhhpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqpdhmrghilhhfrhhomhepjhhurhhisehlihhnkhhovhdrnhgvthdpnhhrtghpthhtohepsghughdqghhnuhdqvghmrggtshesghhnuhdrohhrghenucevlhhushhtvghrufhiiigvpedt Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 23.83.209.22 X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.4 (--) The generated file lisp/international/charscript.el assigns the block =E2=80=9CCombining Diacritical Marks=E2=80=9D to the =E2= =80=98latin=E2=80=99 script on the assumption that these characters are used only in Latin. But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent the acute accent marks the stressed vowel of a word in several languages with alphabets based on the Latin, Cyrillic, and Greek scripts. In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode mentions how characters from other blocks are used in Cyrillic script. Moreover, the Combining Diacritical Marks block also contains several characters from the Greek script: COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI I noticed this problem recently while helping to develop char-fold where GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was alarmingly highlighted as =E2=80=9Cmixed scripts=E2=80=9D by markchars-mo= de from GNU ELPA. Of course, it's possible to add exceptions for characters in this block in markchars-mode. But before doing this, I'm asking a confirmation whether Unicode data should be fixed in =E2=80=98char-script-table=E2=80=99= , so e.g. (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) could return (latin greek cyrillic) instead of the current latin From debbugs-submit-bounces@debbugs.gnu.org Mon Aug 05 12:08:42 2019 Received: (at 36923) by debbugs.gnu.org; 5 Aug 2019 16:08:42 +0000 Received: from localhost ([127.0.0.1]:35728 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hufXN-0007e0-OW for submit@debbugs.gnu.org; Mon, 05 Aug 2019 12:08:41 -0400 Received: from eggs.gnu.org ([209.51.188.92]:48288) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hufXM-0007dj-B6 for 36923@debbugs.gnu.org; Mon, 05 Aug 2019 12:08:40 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:35586) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hufXH-0006kL-52; Mon, 05 Aug 2019 12:08:35 -0400 Received: from [176.228.60.248] (port=3263 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hufXG-00046X-FF; Mon, 05 Aug 2019 12:08:34 -0400 Date: Mon, 05 Aug 2019 19:08:21 +0300 Message-Id: <83k1brd28a.fsf@gnu.org> From: Eli Zaretskii To: Juri Linkov In-reply-to: <87lfw8r744.fsf@mail.linkov.net> (message from Juri Linkov on Sun, 04 Aug 2019 23:40:38 +0300) Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only References: <87lfw8r744.fsf@mail.linkov.net> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 36923 Cc: 36923@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Juri Linkov > Date: Sun, 04 Aug 2019 23:40:38 +0300 > > The generated file lisp/international/charscript.el > assigns the block “Combining Diacritical Marks” to the ‘latin’ script > on the assumption that these characters are used only in Latin. > > But in fact according to e.g. https://en.wikipedia.org/wiki/Acute_accent > the acute accent marks the stressed vowel of a word in several languages > with alphabets based on the Latin, Cyrillic, and Greek scripts. > In particular https://en.wikipedia.org/wiki/Cyrillic_script_in_Unicode > mentions how characters from other blocks are used in Cyrillic script. > Moreover, the Combining Diacritical Marks block also > contains several characters from the Greek script: > COMBINING GREEK PERISPOMENI, COMBINING GREEK KORONIS > COMBINING GREEK DIALYTIKA TONOS, COMBINING GREEK YPOGEGRAMMENI > > I noticed this problem recently while helping to develop char-fold where > GREEK SMALL LETTER IOTA combined with COMBINING GREEK DIALYTIKA TONOS was > alarmingly highlighted as “mixed scripts” by markchars-mode from GNU ELPA. > > Of course, it's possible to add exceptions for characters in this block > in markchars-mode. But before doing this, I'm asking a confirmation > whether Unicode data should be fixed in ‘char-script-table’, so e.g. > > (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) > > could return > > (latin greek cyrillic) > > instead of the current > > latin char-script-table is documented to yield a single symbol, so returning a list would be an incompatible change, which we should avoid. More generally, I think what you describe is a clear conceptual bug in markchars-mode: it should only pay attention to the script of the base characters, not to the script of combining accents. The latter is mostly irrelevant, certainly so for the purpose of detecting confusables. So I think this should be fixed in markchars-mode, and the fact that we somewhat arbitrarily assign those diacritics to the latin script is not a serious problem, if at all. From debbugs-submit-bounces@debbugs.gnu.org Mon Aug 05 15:58:53 2019 Received: (at 36923) by debbugs.gnu.org; 5 Aug 2019 19:58:53 +0000 Received: from localhost ([127.0.0.1]:35905 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huj88-0007Ci-TZ for submit@debbugs.gnu.org; Mon, 05 Aug 2019 15:58:53 -0400 Received: from antelope.elm.relay.mailchannels.net ([23.83.212.4]:1894) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1huj87-0007Ca-Fs for 36923@debbugs.gnu.org; Mon, 05 Aug 2019 15:58:52 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 5D8F12C269E; Mon, 5 Aug 2019 19:58:50 +0000 (UTC) Received: from pdx1-sub0-mail-a2.g.dreamhost.com (100-96-86-80.trex.outbound.svc.cluster.local [100.96.86.80]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id E0A4C2C243F; Mon, 5 Aug 2019 19:58:49 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a2.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Mon, 05 Aug 2019 19:58:50 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Hysterical-Absorbed: 3afa208b1e30f0df_1565035130207_407947143 X-MC-Loop-Signature: 1565035130207:3849131590 X-MC-Ingress-Time: 1565035130206 Received: from pdx1-sub0-mail-a2.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a2.g.dreamhost.com (Postfix) with ESMTP id D2DC4837C8; Mon, 5 Aug 2019 12:58:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type:content-transfer-encoding; s=linkov.net; bh=biP8jY cpIuPK8lEh8860XdF0WOk=; b=OtZ7A1ywja2KrleZjkuVdGQAzmjhDyuMyx8Bje E3C+wiBcdRoeOHpCTgQSTad4EhCrboc9f8GfnNMzOgGDAXvT+1zk0yvKvhz9mqDd xWrZB6N6WzDVqsqk8PqaIkalRU2R1snmh/KgXgdxDMdqDimi0I1yJt35zPN9SHNy yuHzw= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a2.g.dreamhost.com (Postfix) with ESMTPSA id 02CD8837BB; Mon, 5 Aug 2019 12:58:44 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a2 From: Juri Linkov To: Eli Zaretskii Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only Organization: LINKOV.NET References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> Date: Mon, 05 Aug 2019 22:41:59 +0300 In-Reply-To: <83k1brd28a.fsf@gnu.org> (Eli Zaretskii's message of "Mon, 05 Aug 2019 19:08:21 +0300") Message-ID: <87zhknzc7c.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: -100 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddtkedgtddvucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuggftfghnshhusghstghrihgsvgdpffftgfetoffjqffuvfenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhephffvufhofhffjgfkfgggtgfgsehtkeertddtreejnecuhfhrohhmpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqnecukfhppeeluddruddvledruddtfedrledunecurfgrrhgrmhepmhhouggvpehsmhhtphdphhgvlhhopehmrghilhdrjhhurhhtrgdrohhrghdpihhnvghtpeeluddruddvledruddtfedrledupdhrvghtuhhrnhdqphgrthhhpefluhhrihcunfhinhhkohhvuceojhhurhhisehlihhnkhhovhdrnhgvtheqpdhmrghilhhfrhhomhepjhhurhhisehlihhnkhhovhdrnhgvthdpnhhrtghpthhtohepvghlihiisehgnhhurdhorhhgnecuvehluhhsthgvrhfuihiivgeptd Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 36923 Cc: 36923@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) >> >> could return >> >> (latin greek cyrillic) >> >> instead of the current >> >> latin > > char-script-table is documented to yield a single symbol, so returning > a list would be an incompatible change, which we should avoid. The docstring of char-script-table says: Char table of script symbols. It has one extra slot whose value is a list of script symbols. So it seems char-script-table should yield a list of script symbols? I searched more for char-script-table in the documentation, and one place where it's used is forward-word. But I don't understand why forward-word doesn't stop between =E2=80=9CCOMBINING ACUTE ACCENT=E2=80=9D= (that is the Latin script) and non-Latin letters. This is good that it doesn't stop here, and I'm just trying to understand why - so the same logic could be used in markchars-mode. Maybe it doesn't stop because of special script handling in =E2=80=98find-word-boundary-function-table=E2=80=99? Or because it ignor= es all combining characters? BTW, while looking at forward-word and right-word I noticed inconsistency= : there are left-word and right-word commands, but no left-sexp and right-s= exp to accompany forward-sexp. > More generally, I think what you describe is a clear conceptual bug in > markchars-mode: it should only pay attention to the script of the base > characters, not to the script of combining accents. The latter is > mostly irrelevant, certainly so for the purpose of detecting > confusables. Could you suggest a proper function to strip all combining characters from the string? From debbugs-submit-bounces@debbugs.gnu.org Tue Aug 06 10:32:54 2019 Received: (at 36923) by debbugs.gnu.org; 6 Aug 2019 14:32:54 +0000 Received: from localhost ([127.0.0.1]:37618 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hv0WD-0002Kl-Us for submit@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:54 -0400 Received: from eggs.gnu.org ([209.51.188.92]:59345) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hv0WB-0002KX-UI for 36923@debbugs.gnu.org; Tue, 06 Aug 2019 10:32:52 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:53619) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hv0W6-0001z4-HJ; Tue, 06 Aug 2019 10:32:46 -0400 Received: from [176.228.60.248] (port=1704 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hv0W5-0002jF-PF; Tue, 06 Aug 2019 10:32:46 -0400 Date: Tue, 06 Aug 2019 17:32:33 +0300 Message-Id: <83a7cmcqke.fsf@gnu.org> From: Eli Zaretskii To: Juri Linkov In-reply-to: <87zhknzc7c.fsf@mail.linkov.net> (message from Juri Linkov on Mon, 05 Aug 2019 22:41:59 +0300) Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> <87zhknzc7c.fsf@mail.linkov.net> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 36923 Cc: 36923@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Juri Linkov > Cc: 36923@debbugs.gnu.org > Date: Mon, 05 Aug 2019 22:41:59 +0300 > > >> (aref char-script-table ?\N{COMBINING ACUTE ACCENT}) > >> > >> could return > >> > >> (latin greek cyrillic) > >> > >> instead of the current > >> > >> latin > > > > char-script-table is documented to yield a single symbol, so returning > > a list would be an incompatible change, which we should avoid. > > The docstring of char-script-table says: > > Char table of script symbols. > It has one extra slot whose value is a list of script symbols. > > So it seems char-script-table should yield a list of script symbols? No, that's only in the extra slot. The ELisp manual says: -- Variable: char-script-table The value of this variable is a char-table that specifies, for each character, a symbol whose name is the script to which the character belongs, according to the Unicode Standard classification of the Unicode code space into script-specific blocks. This char-table has a single extra slot whose value is the list of all script symbols. > I searched more for char-script-table in the documentation, and one > place where it's used is forward-word. But I don't understand why > forward-word doesn't stop between “COMBINING ACUTE ACCENT” (that is > the Latin script) and non-Latin letters. See word-combining-categories: it causes word-movement commands to ignore any script boundaries with characters whose category is combining diacritic or mark. > Maybe it doesn't stop because of special script handling in > ‘find-word-boundary-function-table’? Not by default, because find-word-boundary-function-table's entry for any character is nil by default. > BTW, while looking at forward-word and right-word I noticed inconsistency: > there are left-word and right-word commands, but no left-sexp and right-sexp > to accompany forward-sexp. Programming languages are all L2R, so there's no need to move by sexps in R2L direction. > > More generally, I think what you describe is a clear conceptual bug in > > markchars-mode: it should only pay attention to the script of the base > > characters, not to the script of combining accents. The latter is > > mostly irrelevant, certainly so for the purpose of detecting > > confusables. > > Could you suggest a proper function to strip all combining characters > from the string? Each base character has its canonical combining class attribute as zero, so you could use (get-char-code-property CHAR 'canonical-combining-class) to filter out those CHARs for which the value is non-zero. Alternatively, you could go by categories: base characters have the ?. category set, combining characters have the ?^ category set. My recommendation is to use the canonical-combining-class property, as it is a more direct way of doing this. From debbugs-submit-bounces@debbugs.gnu.org Wed Aug 07 18:03:00 2019 Received: (at 36923-done) by debbugs.gnu.org; 7 Aug 2019 22:03:00 +0000 Received: from localhost ([127.0.0.1]:39588 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hvU1M-0003t9-0P for submit@debbugs.gnu.org; Wed, 07 Aug 2019 18:03:00 -0400 Received: from bird.elm.relay.mailchannels.net ([23.83.212.17]:11754) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hvU1I-0003sz-Lc for 36923-done@debbugs.gnu.org; Wed, 07 Aug 2019 18:02:58 -0400 X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id 22A55142C69; Wed, 7 Aug 2019 22:02:55 +0000 (UTC) Received: from pdx1-sub0-mail-a93.g.dreamhost.com (100-96-35-244.trex.outbound.svc.cluster.local [100.96.35.244]) (Authenticated sender: dreamhost) by relay.mailchannels.net (Postfix) with ESMTPA id 8B715142C27; Wed, 7 Aug 2019 22:02:54 +0000 (UTC) X-Sender-Id: dreamhost|x-authsender|jurta@jurta.org Received: from pdx1-sub0-mail-a93.g.dreamhost.com ([TEMPUNAVAIL]. [64.90.62.162]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.17.5); Wed, 07 Aug 2019 22:02:54 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: dreamhost|x-authsender|jurta@jurta.org X-MailChannels-Auth-Id: dreamhost X-Arithmetic-Tart: 6c0b43f468ba5159_1565215374797_1636287199 X-MC-Loop-Signature: 1565215374797:2092584110 X-MC-Ingress-Time: 1565215374797 Received: from pdx1-sub0-mail-a93.g.dreamhost.com (localhost [127.0.0.1]) by pdx1-sub0-mail-a93.g.dreamhost.com (Postfix) with ESMTP id 52A5481155; Wed, 7 Aug 2019 15:02:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=linkov.net; h=from:to:cc :subject:references:date:in-reply-to:message-id:mime-version :content-type; s=linkov.net; bh=+L/NcxGFh1Uhw6vVa2rpFpX4pt8=; b= WU9K7u8JdQY06iCiUnh5Tm9nZB6PB2G5JHwq3EacuSfXQiaJW4eF+DNnuqVDCj0L gwEsMs83JPOSZAXN9hxQ/aQx8wGz/y6Ft+BXVqdNxeuwwArox1YV+5iGIhX5gTRo 3KKMFl2dM6YO2TEY+BsEJJHGKEJPzTWbGgoWIn1KfX4= Received: from mail.jurta.org (m91-129-103-91.cust.tele2.ee [91.129.103.91]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: jurta@jurta.org) by pdx1-sub0-mail-a93.g.dreamhost.com (Postfix) with ESMTPSA id 6EF037F015; Wed, 7 Aug 2019 15:02:46 -0700 (PDT) X-DH-BACKEND: pdx1-sub0-mail-a93 From: Juri Linkov To: Eli Zaretskii Subject: Re: bug#36923: Combining Diacritical Marks are not Latin only Organization: LINKOV.NET References: <87lfw8r744.fsf@mail.linkov.net> <83k1brd28a.fsf@gnu.org> <87zhknzc7c.fsf@mail.linkov.net> <83a7cmcqke.fsf@gnu.org> Date: Thu, 08 Aug 2019 00:44:49 +0300 In-Reply-To: <83a7cmcqke.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 06 Aug 2019 17:32:33 +0300") Message-ID: <87a7ckps4u.fsf@mail.linkov.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.0.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-VR-OUT-STATUS: OK X-VR-OUT-SCORE: -100 X-VR-OUT-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgeduvddruddufedgtdejucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuggftfghnshhusghstghrihgsvgdpffftgfetoffjqffuvfenuceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhephffvufhofhffjgfkfgggtgesthdtredttdertdenucfhrhhomheplfhurhhiucfnihhnkhhovhcuoehjuhhriheslhhinhhkohhvrdhnvghtqeenucfkphepledurdduvdelrddutdefrdeludenucfrrghrrghmpehmohguvgepshhmthhppdhhvghlohepmhgrihhlrdhjuhhrthgrrdhorhhgpdhinhgvthepledurdduvdelrddutdefrdeluddprhgvthhurhhnqdhprghthheplfhurhhiucfnihhnkhhovhcuoehjuhhriheslhhinhhkohhvrdhnvghtqedpmhgrihhlfhhrohhmpehjuhhriheslhhinhhkohhvrdhnvghtpdhnrhgtphhtthhopegvlhhiiiesghhnuhdrohhrghenucevlhhushhtvghrufhiiigvpedt X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 36923-done Cc: 36923-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) > Each base character has its canonical combining class attribute as > zero, so you could use > > (get-char-code-property CHAR 'canonical-combining-class) > > to filter out those CHARs for which the value is non-zero. > > Alternatively, you could go by categories: base characters have the > ?. category set, combining characters have the ?^ category set. > > My recommendation is to use the canonical-combining-class property, as > it is a more direct way of doing this. Thanks, I fixed markchars-mode by using canonical-combining-class. From unknown Sat Aug 16 18:31:39 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Thu, 05 Sep 2019 11:24:08 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator