From unknown Sat Sep 06 09:27:54 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#13041 <13041@debbugs.gnu.org> To: bug#13041 <13041@debbugs.gnu.org> Subject: Status: 24.2; diacritic-fold-search Reply-To: bug#13041 <13041@debbugs.gnu.org> Date: Sat, 06 Sep 2025 16:27:54 +0000 retitle 13041 24.2; diacritic-fold-search reassign 13041 emacs submitter 13041 perin@acm.org severity 13041 wishlist thanks From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 13:30:14 2012 Received: (at submit) by debbugs.gnu.org; 30 Nov 2012 18:30:14 +0000 Received: from localhost ([127.0.0.1]:47305 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeVLp-0002pg-8G for submit@debbugs.gnu.org; Fri, 30 Nov 2012 13:30:14 -0500 Received: from eggs.gnu.org ([208.118.235.92]:36646) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeVG5-0002fm-FG for submit@debbugs.gnu.org; Fri, 30 Nov 2012 13:24:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TeVE1-0002vg-Fh for submit@debbugs.gnu.org; Fri, 30 Nov 2012 13:22:10 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:47829) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TeVE1-0002vc-Cr for submit@debbugs.gnu.org; Fri, 30 Nov 2012 13:22:09 -0500 Received: from eggs.gnu.org ([208.118.235.92]:51116) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TeVE0-0002oX-7e for bug-gnu-emacs@gnu.org; Fri, 30 Nov 2012 13:22:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TeVDy-0002vQ-TB for bug-gnu-emacs@gnu.org; Fri, 30 Nov 2012 13:22:08 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:59022) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TeVDy-0002vJ-Pz for bug-gnu-emacs@gnu.org; Fri, 30 Nov 2012 13:22:06 -0500 Received: from panix1.panix.com (panix1.panix.com [166.84.1.1]) by mailbackend.panix.com (Postfix) with ESMTP id F0D092E623 for ; Fri, 30 Nov 2012 13:22:05 -0500 (EST) Received: by panix1.panix.com (Postfix, from userid 13816) id C722F14B8D; Fri, 30 Nov 2012 13:22:05 -0500 (EST) From: Lewis Perin To: bug-gnu-emacs@gnu.org Subject: 24.2; diacritic-fold-search MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Message-Id: <20121130182205.C722F14B8D@panix1.panix.com> Date: Fri, 30 Nov 2012 13:22:05 -0500 (EST) Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -3.4 (---) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Fri, 30 Nov 2012 13:30:12 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: perin@acm.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -3.4 (---) This is not a bug report but a feature request, so I am omitting diagnostic information. Emacs search has long been able to toggle between (a) ignoring the distinction between upper- and lower-case characters (case-fold-search) and (b) searching for only one of the pair. One could say Climacs offers the choice between (a) searching for all members of a (2-member) equivalence class and (b) searching for only one member. There are larger equivalence classes of characters with practical use which Climacs is currently unaware of: the groups of characters consisting of an unadorned (ASCII) character plus all its diacritic-adorned versions. Currently, if I want to search for both =E2=80=9Capres=E2=80=9D and =E2=80=9Capr=C3=A8s=E2=80=9D, I need an addit= ive regular expression. I would like to do this as easily as I can search for =E2=80=9Capres=E2=80=9D and= =E2=80=9CApres=E2=80=9D. I would be delighted if Emacs implemented the equivalence classes spelled out here: http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold= .js.html I might add that diacritics folding is the default in web search engines. It is also a feature of at least one Web browser in searching the text of a displayed page (Chrome.) I=E2=80=99m sure that maintaining the core of Emacs is a big job, and I=E2= =80=99m grateful for the skill and effort that go into that task, including your consideration of this request! /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 13:55:07 2012 Received: (at 13041) by debbugs.gnu.org; 30 Nov 2012 18:55:08 +0000 Received: from localhost ([127.0.0.1]:47343 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeVjv-0003Vx-K9 for submit@debbugs.gnu.org; Fri, 30 Nov 2012 13:55:07 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:37663 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeVjt-0003Vq-F7 for 13041@debbugs.gnu.org; Fri, 30 Nov 2012 13:55:06 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id D732D451E17E; Fri, 30 Nov 2012 10:52:56 -0800 (PST) From: Juri Linkov To: Lewis Perin Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> Date: Fri, 30 Nov 2012 20:51:44 +0200 In-Reply-To: <20121130182205.C722F14B8D@panix1.panix.com> (Lewis Perin's message of "Fri, 30 Nov 2012 13:22:05 -0500 (EST)") Message-ID: <87hao69b5r.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > Currently, if I want to search for both =E2=80=9Capres=E2=80=9D and =E2= =80=9Capr=C3=A8s=E2=80=9D, > I need an additive regular expression. I would like to do this as > easily as I can search for =E2=80=9Capres=E2=80=9D and =E2=80=9CApres=E2= =80=9D. I would be delighted > if Emacs implemented the equivalence classes spelled out here: > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfo= ld.js.html This could be implemented in isearch using a recipe from http://thread.gmane.org/gmane.emacs.devel/117003/focus=3D117959 Instead of hard-coding a list of equivalent characters I guess it should be possible to do this automatically using Unicode information about characters. From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 14:33:22 2012 Received: (at 13041) by debbugs.gnu.org; 30 Nov 2012 19:33:22 +0000 Received: from localhost ([127.0.0.1]:47419 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeWKw-0004QC-AN for submit@debbugs.gnu.org; Fri, 30 Nov 2012 14:33:22 -0500 Received: from relais.videotron.ca ([24.201.245.36]:34910) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeWKv-0004Q3-8A; Fri, 30 Nov 2012 14:33:21 -0500 MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Received: from ceviche.home ([24.201.208.110]) by VL-VM-MR005.ip.videotron.ca (Oracle Communications Messaging Exchange Server 7u4-22.01 64bit (built Apr 21 2011)) with ESMTP id <0MEB00176FJWF5G1@VL-VM-MR005.ip.videotron.ca>; Fri, 30 Nov 2012 14:31:09 -0500 (EST) Received: by ceviche.home (Postfix, from userid 20848) id B3A7266109; Fri, 30 Nov 2012 14:31:08 -0500 (EST) From: Stefan Monnier To: Lewis Perin Subject: Re: bug#13041: 24.2; diacritic-fold-search Message-id: References: <20121130182205.C722F14B8D@panix1.panix.com> Date: Fri, 30 Nov 2012 14:31:08 -0500 In-reply-to: <20121130182205.C722F14B8D@panix1.panix.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) Content-transfer-encoding: quoted-printable X-Spam-Score: 1.6 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: severity 13041 wishlist thanks > diacritic-adorned versions. Currently, if I want to search for both > “apres” and “après”, I need an additive regular expression. I would > like to do this as easily as I can search for “apres” and “Apres”. [...] Content analysis details: (1.6 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [24.201.245.36 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] 0.1 HDRS_LCASE Odd capitalization of message header X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: severity 13041 wishlist thanks > diacritic-adorned versions. Currently, if I want to search for both > “apres” and “après”, I need an additive regular expression. I would > like to do this as easily as I can search for “apres” and “Apres”. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [24.201.245.36 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4842] severity 13041 wishlist thanks > diacritic-adorned versions. Currently, if I want to search for both > =E2=80=9Capres=E2=80=9D and =E2=80=9Capr=C3=A8s=E2=80=9D, I need an addit= ive regular expression. I would > like to do this as easily as I can search for =E2=80=9Capres=E2=80=9D and= =E2=80=9CApres=E2=80=9D. That would be a very welcome feature, indeed. Stefan From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 16:09:57 2012 Received: (at 13041) by debbugs.gnu.org; 30 Nov 2012 21:09:57 +0000 Received: from localhost ([127.0.0.1]:47507 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeXqO-0007Zq-Hn for submit@debbugs.gnu.org; Fri, 30 Nov 2012 16:09:57 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:33476) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeXqM-0007Zj-8J for 13041@debbugs.gnu.org; Fri, 30 Nov 2012 16:09:55 -0500 Received: from panix5.panix.com (panix5.panix.com [166.84.1.5]) by mailbackend.panix.com (Postfix) with ESMTP id C7C292E294; Fri, 30 Nov 2012 16:07:45 -0500 (EST) Received: by panix5.panix.com (Postfix, from userid 13816) id 9BBA3241ED; Fri, 30 Nov 2012 16:07:45 -0500 (EST) From: Lewis Perin MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-ID: <20665.8224.844876.619203@panix5.panix.com> Date: Fri, 30 Nov 2012 16:07:44 -0500 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search In-Reply-To: <87hao69b5r.fsf@mail.jurta.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> X-Mailer: VM 8.1.2 under 24.2.1 (i386-unknown-netbsdelf5.1) X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: perin@acm.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.7 (--) Juri Linkov writes: > > Currently, if I want to search for both =E2=80=9Capres=E2=80=9D and= =E2=80=9Capr=C3=A8s=E2=80=9D, > > I need an additive regular expression. I would like to do this as > > easily as I can search for =E2=80=9Capres=E2=80=9D and =E2=80=9CApr= es=E2=80=9D. I would be delighted > > if Emacs implemented the equivalence classes spelled out here: > > > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-acce= ntfold.js.html >=20 > This could be implemented in isearch using a recipe from >=20 > http://thread.gmane.org/gmane.emacs.devel/117003/focus=3D117959 >=20 > Instead of hard-coding a list of equivalent characters > I guess it should be possible to do this automatically > using Unicode information about characters. I never thought I was the first to wonder about this! In the last message of that thread, you say =E2=80=9CProvided it doesn=E2= =80=99t make the search slow, it would be nice to add it to Emacs activating on some user settings.=E2=80=9D Do you remember if that technique turned = out to be tolerably speedy=3F /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 19:41:39 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 00:41:39 +0000 Received: from localhost ([127.0.0.1]:47599 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Teb9H-0005Jo-29 for submit@debbugs.gnu.org; Fri, 30 Nov 2012 19:41:39 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:38317 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Teb9E-0005Je-Dy for 13041@debbugs.gnu.org; Fri, 30 Nov 2012 19:41:37 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 3B714451C1DE; Fri, 30 Nov 2012 16:39:26 -0800 (PST) From: Juri Linkov To: Lewis Perin Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> Date: Sat, 01 Dec 2012 02:27:40 +0200 In-Reply-To: <20665.8224.844876.619203@panix5.panix.com> (Lewis Perin's message of "Fri, 30 Nov 2012 16:07:44 -0500") Message-ID: <87hao6zko4.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > In the last message of that thread, you say =E2=80=9CProvided it doesn=E2= =80=99t make > the search slow, it would be nice to add it to Emacs activating on > some user settings.=E2=80=9D Do you remember if that technique turned = out to > be tolerably speedy? Yes, I have no problems with the speed. The problem is how to disable this feature when it is active. We need a special key to toggle it in Isearch. One variant is M-s ~ where the easy-to-type TILDE character represents diacritics. Also it's unclear whether the Isearch prompt should indicate its active state as e.g. Diacritic I-search: From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 19:49:46 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 00:49:46 +0000 Received: from localhost ([127.0.0.1]:47606 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TebH8-0005Vl-8t for submit@debbugs.gnu.org; Fri, 30 Nov 2012 19:49:46 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:32225) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TebH6-0005Vd-9S for 13041@debbugs.gnu.org; Fri, 30 Nov 2012 19:49:45 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB10lWxh027362 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 1 Dec 2012 00:47:33 GMT Received: from acsmt356.oracle.com (acsmt356.oracle.com [141.146.40.156]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB10lW7a023162 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 1 Dec 2012 00:47:32 GMT Received: from abhmt108.oracle.com (abhmt108.oracle.com [141.146.116.60]) by acsmt356.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB10lVKY005673; Fri, 30 Nov 2012 18:47:32 -0600 Received: from dradamslap1 (/10.159.169.89) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 30 Nov 2012 16:47:31 -0800 From: "Drew Adams" To: "'Juri Linkov'" , "'Lewis Perin'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Fri, 30 Nov 2012 16:47:26 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <87hao6zko4.fsf@mail.jurta.org> Thread-Index: Ac3PXGaqPk8QBc9qQdaIP7TD0oAiOQAAB9KQ X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > it's unclear whether the Isearch prompt should indicate > its active state =C7=8Fsearch (But perhaps that suggests recognizing, rather than ignoring, = diacritics.) From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 19:51:44 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 00:51:44 +0000 Received: from localhost ([127.0.0.1]:47610 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TebJ1-0005Ye-NX for submit@debbugs.gnu.org; Fri, 30 Nov 2012 19:51:44 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:32548) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TebIz-0005YX-Sp for 13041@debbugs.gnu.org; Fri, 30 Nov 2012 19:51:42 -0500 Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB10nVXa028477 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 1 Dec 2012 00:49:31 GMT Received: from acsmt356.oracle.com (acsmt356.oracle.com [141.146.40.156]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB10nUfE008690 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 1 Dec 2012 00:49:30 GMT Received: from abhmt105.oracle.com (abhmt105.oracle.com [141.146.116.57]) by acsmt356.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB10nUav006426; Fri, 30 Nov 2012 18:49:30 -0600 Received: from dradamslap1 (/10.159.169.89) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 30 Nov 2012 16:49:29 -0800 From: "Drew Adams" To: "'Juri Linkov'" , "'Lewis Perin'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Fri, 30 Nov 2012 16:49:24 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: Thread-Index: Ac3PXGaqPk8QBc9qQdaIP7TD0oAiOQAAB9KQAABEmEA= X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: ucsinet21.oracle.com [156.151.31.93] X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > > it's unclear whether the Isearch prompt should indicate > > its active state > > Isearch > > (But perhaps that suggests recognizing, rather than ignoring, > diacritics.) Hm. That was a capital I with caron when I sent it... From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 30 20:23:12 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 01:23:13 +0000 Received: from localhost ([127.0.0.1]:47651 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TebnU-0006Hu-DU for submit@debbugs.gnu.org; Fri, 30 Nov 2012 20:23:12 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:38574) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TebnS-0006Hi-A3 for 13041@debbugs.gnu.org; Fri, 30 Nov 2012 20:23:11 -0500 Received: from [192.168.1.2] (pool-71-167-144-242.nycmny.fios.verizon.net [71.167.144.242]) by mailbackend.panix.com (Postfix) with ESMTP id 5F1952EE9D; Fri, 30 Nov 2012 20:21:01 -0500 (EST) References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> Mime-Version: 1.0 (1.0) In-Reply-To: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-Id: X-Mailer: iPhone Mail (10A523) From: Lew Perin Subject: Re: bug#13041: 24.2; diacritic-fold-search Date: Fri, 30 Nov 2012 20:20:59 -0500 To: Drew Adams X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 13041 Cc: Juri Linkov , "<13041@debbugs.gnu.org>" <13041@debbugs.gnu.org>, "" X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) On Nov 30, 2012, at 7:49 PM, "Drew Adams" wrote: >>> it's unclear whether the Isearch prompt should indicate >>> its active state >> >> Isearch >> >> (But perhaps that suggests recognizing, rather than ignoring, >> diacritics.) > > Hm. That was a capital I with caron when I sent it... A caron-topped capital I is exactly what I got (on my iPhone.) /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 01 01:53:12 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 06:53:12 +0000 Received: from localhost ([127.0.0.1]:47852 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tegwo-0008C1-WA for submit@debbugs.gnu.org; Sat, 01 Dec 2012 01:53:12 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:26718) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tegwn-0008Bt-2w for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 01:53:09 -0500 Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB16otfn002879 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 1 Dec 2012 06:50:56 GMT Received: from acsmt358.oracle.com (acsmt358.oracle.com [141.146.40.158]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB16ost1000993 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 1 Dec 2012 06:50:55 GMT Received: from abhmt118.oracle.com (abhmt118.oracle.com [141.146.116.70]) by acsmt358.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB16osfo009796; Sat, 1 Dec 2012 00:50:54 -0600 Received: from dradamslap1 (/71.202.147.44) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Fri, 30 Nov 2012 22:50:53 -0800 From: "Drew Adams" To: "'Lew Perin'" References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Fri, 30 Nov 2012 22:50:48 -0800 Message-ID: <8FBC16BAC2C64112BAD6EC91546DD4F5@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: Thread-Index: Ac3PYiGJARxuMQFBSXiR0Gq7Urq+JgALew/w X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet21.oracle.com [141.146.126.237] X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 13041 Cc: 'Juri Linkov' , 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > >>> it's unclear whether the Isearch prompt should indicate > >>> its active state > >> > >> Isearch > >> > >> (But perhaps that suggests recognizing, rather than ignoring, > >> diacritics.) > > > > Hm. That was a capital I with caron when I sent it... > > A caron-topped capital I is exactly what I got (on my iPhone.) Great. I guess it's the encoding used in my mail client that's showing it with no marks. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 01 03:35:20 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 08:35:20 +0000 Received: from localhost ([127.0.0.1]:47916 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeiXf-000265-Tp for submit@debbugs.gnu.org; Sat, 01 Dec 2012 03:35:20 -0500 Received: from mtaout20.012.net.il ([80.179.55.166]:47693) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TeiXc-00025v-Ui for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 03:35:18 -0500 Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0MEC00400FJUIZ00@a-mtaout20.012.net.il> for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 10:32:47 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEC004K2FQMAK50@a-mtaout20.012.net.il>; Sat, 01 Dec 2012 10:32:47 +0200 (IST) Date: Sat, 01 Dec 2012 10:32:35 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <87hao6zko4.fsf@mail.jurta.org> To: Juri Linkov Message-id: <83fw3qtboc.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: Juri Linkov > Date: Sat, 01 Dec 2012 02:27:40 +0200 > Cc: 13041@debbugs.gnu.org, perin@acm.org > > > In the last message of that thread, you say “Provided it doesn’t make > > the search slow, it would be nice to add it to Emacs activating on > > some user settings.” Do you remember if that technique turned out to > > be tolerably speedy? > > Yes, I have no problems with the speed. The problem is how to > disable this feature when it is active. We need a special key > to toggle it in Isearch. One variant is M-s ~ where the easy-to-type > TILDE character represents diacritics. Also it's unclear whether the > Isearch prompt should indicate its active state as e.g. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.166 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: Juri Linkov > Date: Sat, 01 Dec 2012 02:27:40 +0200 > Cc: 13041@debbugs.gnu.org, perin@acm.org > > > In the last message of that thread, you say “Provided it doesn’t make > > the search slow, it would be nice to add it to Emacs activating on > > some user settings.” Do you remember if that technique turned out to > > be tolerably speedy? > > Yes, I have no problems with the speed. The problem is how to > disable this feature when it is active. We need a special key > to toggle it in Isearch. One variant is M-s ~ where the easy-to-type > TILDE character represents diacritics. Also it's unclear whether the > Isearch prompt should indicate its active state as e.g. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.166 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4082] > From: Juri Linkov > Date: Sat, 01 Dec 2012 02:27:40 +0200 > Cc: 13041@debbugs.gnu.org, perin@acm.org >=20 > > In the last message of that thread, you say =E2=80=9CProvided it = doesn=E2=80=99t make > > the search slow, it would be nice to add it to Emacs activating o= n > > some user settings.=E2=80=9D Do you remember if that technique t= urned out to > > be tolerably speedy? >=20 > Yes, I have no problems with the speed. The problem is how to > disable this feature when it is active. We need a special key > to toggle it in Isearch. One variant is M-s ~ where the easy-to-ty= pe > TILDE character represents diacritics. Also it's unclear whether t= he > Isearch prompt should indicate its active state as e.g. I don't understand why this thread is talking only about Latin characters with diacritics. That is a special case of what Unicode calls "compatibility equivalence" (q.e.). For example, even in the Latin environments, don't you want to find "sni=EF=AC=80" when search= ing for "sniff", and vice versa? And there are similar issues in many non-Latin scripts. The decomposition of a character such as '=EF=AC=80' is given by the = Unicode database, for example: FB00;LATIN SMALL LIGATURE FF;Ll;0;L; 0066 0066;;;;N;;;;; ^^^^^^^^^^^^^^^^^^ (66 hex, or 102 decimal, is the codepoint of 'f'). Emacs already supports these decomposition properties. E.g.: (get-char-code-property ?=EF=AC=80 'decomposition) =3D> (compat 102= 102) Another example, closer to the issue that triggered this thread: (get-char-code-property ?=C3=A8 'decomposition) =3D> (101 768) (If you want to understand why the previous example included "compat" in the result, while this one doesn't, read more about Unicode normalization forms. The distinction is irrelevant for the current discussion.) Using these properties, every search string can be converted to a sequence of non-decomposable characters (this process is recursive, because the 'decomposition' property can use characters that themselves are decomposable). If the user wants to ignore diacritics= , then the diacritics should be dropped from the decomposition sequence before starting the search. E.g., for the decomposition of =C3=A8 ab= ove, we will drop the 768 and will be left with 101, which is 'e'. Then searching for that string should apply the same decomposition transformation to the text being searched, when comparing them. This would be the most general way of solving this issue, a way that is not limited to diacritics nor to Latin scripts. And doing that will move Emacs closer to the goal of being Unicode compatible, since support for this is required by the Unicode Standard. By contrast, building and using custom data bases of equivalences tha= t are limited to diacritics in Latin scripts is not moving Emacs toward= s that goal. It's just a hack, IMO. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 01 04:11:48 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 09:11:48 +0000 Received: from localhost ([127.0.0.1]:47931 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tej6y-0002uQ-Hq for submit@debbugs.gnu.org; Sat, 01 Dec 2012 04:11:48 -0500 Received: from mtaout23.012.net.il ([80.179.55.175]:65197) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tej6u-0002uD-6b for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 04:11:46 -0500 Received: from conversion-daemon.a-mtaout23.012.net.il by a-mtaout23.012.net.il (HyperSendmail v2007.08) id <0MEC00100HCY2M00@a-mtaout23.012.net.il> for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 11:09:32 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout23.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEC001QSHFV2C10@a-mtaout23.012.net.il>; Sat, 01 Dec 2012 11:09:32 +0200 (IST) Date: Sat, 01 Dec 2012 11:09:20 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <83fw3qtboc.fsf@gnu.org> X-012-Sender: halo1@inter.net.il To: juri@jurta.org, perin@panix.com Message-id: <83ehjat9z3.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Date: Sat, 01 Dec 2012 10:32:35 +0200 > From: Eli Zaretskii > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > I don't understand why this thread is talking only about Latin > characters with diacritics. That is a special case of what Unicode > calls "compatibility equivalence" (q.e.). ^^^^ I meant "q.v.", of course. Sorry. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.175 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4770] X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.7 (/) > Date: Sat, 01 Dec 2012 10:32:35 +0200 > From: Eli Zaretskii > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > I don't understand why this thread is talking only about Latin > characters with diacritics. That is a special case of what Unicode > calls "compatibility equivalence" (q.e.). ^^^^ I meant "q.v.", of course. Sorry. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 01 11:41:11 2012 Received: (at 13041) by debbugs.gnu.org; 1 Dec 2012 16:41:11 +0000 Received: from localhost ([127.0.0.1]:48712 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Teq7q-0005s1-Gl for submit@debbugs.gnu.org; Sat, 01 Dec 2012 11:41:11 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:27103) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Teq7o-0005ru-Ae for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 11:41:09 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB1GcrVZ008996 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sat, 1 Dec 2012 16:38:53 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB1GcqEh019397 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 1 Dec 2012 16:38:52 GMT Received: from abhmt109.oracle.com (abhmt109.oracle.com [141.146.116.61]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB1GcqYk002774; Sat, 1 Dec 2012 10:38:52 -0600 Received: from dradamslap1 (/71.202.147.44) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sat, 01 Dec 2012 08:38:52 -0800 From: "Drew Adams" To: "'Eli Zaretskii'" , "'Juri Linkov'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Sat, 1 Dec 2012 08:38:45 -0800 Message-ID: <7DD994F3BDA241E19AFF870D8115AE51@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <83fw3qtboc.fsf@gnu.org> Thread-Index: Ac3PnpvXpj5n5OR6SL2XgkfnbyJckQAQcTKg X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.9 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > I don't understand why this thread is talking only about Latin > characters with diacritics. That is a special case of what Unicode > calls "compatibility equivalence" (q.e.). For example, even in the > Latin environments, don't you want to find "sni?" when searching for > "sniff", and vice versa? And there are similar issues in many > non-Latin scripts. Actually, in the original thread I made the same point. =20 Please see that discussion for this and other points. http://lists.gnu.org/archive/html/help-gnu-emacs/2012-11/msg00429.html > The decomposition of a character such as '?' is given by > the Unicode database... Emacs already supports these > decomposition properties. That's good news (new to me). So it sounds like even the most hopeful wanna-haves of the discussion could perhaps be realized without too much trouble. > Using these properties, every search string can be converted to a > sequence of non-decomposable characters (this process is recursive, > because the 'decomposition' property can use characters that > themselves are decomposable). If the user wants to ignore diacritics, > then the diacritics should be dropped from the decomposition sequence > before starting the search. E.g., for the decomposition of =E8 above, > we will drop the 768 and will be left with 101, which is 'e'. Then > searching for that string should apply the same decomposition > transformation to the text being searched, when comparing them. >=20 > This would be the most general way of solving this issue, a way that > is not limited to diacritics nor to Latin scripts. And doing that > will move Emacs closer to the goal of being Unicode compatible, since > support for this is required by the Unicode Standard. This sounds great. I really hope someone with the time and knowledge = adds such a feature soon (even though, to be clear, I personally do not have much = need for it). I think it would be very handy for many users - most welcome. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 01 19:49:53 2012 Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 00:49:53 +0000 Received: from localhost ([127.0.0.1]:48920 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Texkm-0001nB-Ok for submit@debbugs.gnu.org; Sat, 01 Dec 2012 19:49:53 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:58055 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Texkj-0001mv-4W for 13041@debbugs.gnu.org; Sat, 01 Dec 2012 19:49:50 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id E5A16451E1D4; Sat, 1 Dec 2012 16:47:32 -0800 (PST) From: Juri Linkov To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-Reply-To: <83fw3qtboc.fsf@gnu.org> (Eli Zaretskii's message of "Sat, 01 Dec 2012 10:32:35 +0200") Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) Date: Sun, 02 Dec 2012 02:27:32 +0200 Message-ID: <87hao5jqu3.fsf@mail.jurta.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > Using these properties, every search string can be converted to a > sequence of non-decomposable characters (this process is recursive, > because the 'decomposition' property can use characters that > themselves are decomposable). If the user wants to ignore diacritics, > then the diacritics should be dropped from the decomposition sequence > before starting the search. E.g., for the decomposition of =C3=A8 abov= e, > we will drop the 768 and will be left with 101, which is 'e'. Then > searching for that string should apply the same decomposition > transformation to the text being searched, when comparing them. Yes, using the `decomposition' property would be better than hard-coding these decomposition mappings. Though I'm surprised to see case mappings hard-coded in lisp/international/characters.el instead of using the properties `uppercase' and `lowercase' during creation of case tables. But nevertheless the `decomposition' property should be used to find all decomposable characters. The question is how to use them in the sear= ch. One solution is to use the case tables. I tried to build the case table with the decomposed characters retrieved using the `decomposition' proper= ty recursively: (defvar decomposition-table nil) (defun make-decomposition-table () (let ((table (standard-case-table)) canon) (setq canon (copy-sequence table)) (let ((c #x0000) d) (while (<=3D c #xFFFD) (make-decomposition-table-1 canon c c) (setq c (1+ c)))) (set-char-table-extra-slot table 1 canon) (set-char-table-extra-slot table 2 nil) (setq decomposition-table table))) (defun make-decomposition-table-1 (canon c0 c1) (let ((d (get-char-code-property c1 'decomposition))) (when d (unless (characterp (car d)) (pop d)) (if (eq c1 (car d)) (aset canon c0 (car d)) (make-decomposition-table-1 canon c0 (car d)))))) (make-decomposition-table) Then a new Isearch command (the existing `isearch-toggle-case-fold' can't be used because it enables/disables the standard case table) could toggle between the current case table and the decomposition case table using (set-case-table decomposition-table) After evaluating this, Isearch correctly finds all related characters in every row of this example: http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold= .js.html But it seems using the case table for decomposition has one limitation. I see no way to ignore combining accent characters in the case table, i.e. to map combining accent characters to nothing. These characters have the general-category "Mn (Mark, Nonspacing)", so they should be igno= red in the search. An alternative would be to build a regexp from the search string like building a regexp for word-search: (define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition) (defun isearch-toggle-decomposition () "Toggle Unicode decomposition searching on or off." (interactive) (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-rege= xp) 'isearch-decomposition-regexp)) (if isearch-word (setq isearch-regexp nil)) (setq isearch-success t isearch-adjusted t) (isearch-update)) (defun isearch-decomposition-regexp (string &optional _lax) "Return a regexp that matches decomposed Unicode characters in STRING." (mapconcat (lambda (c0) (if (eq (get-char-code-property c0 'general-category) 'Mn) ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optiona= l. (concat (string c0) "?") (let ((c1 c0) c2 chars) (while (and (setq c2 (aref (char-table-extra-slot decomposition-table 2) c1)) (not (eq c2 c0))) (push c2 chars) (setq c1 c2)) (if chars ;; Character alternatives from the case equivalences table. (concat "[" (string c0) chars "]") (string c0))))) string "")) (put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ") This uses the decomposition table created above but instead of activating= it, it's necessary to "shuffle" the equivalences table with the following cod= e that prepares the table but doesn't enable it in the current buffer: (with-temp-buffer (set-case-table decomposition-table)) The advantage of the regexp-based approach is making combining accents optional in the search string. But there is another problem: how to igno= re combining accents in the buffer when the search string doesn't contain th= em. With regexps this means adding a group of all possible combining accents after every character in the search string like turning a search string like "abc" into "a[=CC=81=CC=82=CC=83=CC=84=CC=86]?b[=CC=81=CC=82=CC=83=CC= =84=CC=86]?c[=CC=81=CC=82=CC=83=CC=84=CC=86]?". This would make the search slow, and I have no better idea. From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 02 12:48:11 2012 Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 17:48:11 +0000 Received: from localhost ([127.0.0.1]:50343 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfDeE-00058P-Gj for submit@debbugs.gnu.org; Sun, 02 Dec 2012 12:48:10 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:42854) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TfDeC-00058H-M9 for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 12:48:09 -0500 Received: (qmail invoked by alias); 02 Dec 2012 17:45:49 -0000 Received: from 62-47-40-197.adsl.highway.telekom.at (EHLO [62.47.40.197]) [62.47.40.197] by mail.gmx.net (mp028) with SMTP; 02 Dec 2012 18:45:49 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX19Hv0VH0DNHJ0N4u/1glSjq4TgUnJOFHCaZtoFwbO Tg18I6HL2oXwbJ Message-ID: <50BB93C2.1050007@gmx.at> Date: Sun, 02 Dec 2012 18:45:38 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> In-Reply-To: <87hao5jqu3.fsf@mail.jurta.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > But nevertheless the `decomposition' property should be used to find > all decomposable characters. The question is how to use them in the search. Whatever solution you find most suitable here, it would be nice to come up with a similar solution for sorting. I've been playing around with a function like (defun decomposed-string-lessp (string1 string2) "Return t if STRING1 is decomposition-less than STRING2." (let* ((length1 (length string1)) (length2 (length string2)) (min-length (min length1 length2)) (index 0) type1 type2) (catch 'found (while (< index min-length) (setq type1 (car (get-char-code-property (elt string1 index) 'decomposition))) (setq type2 (car (get-char-code-property (elt string2 index) 'decomposition))) (cond ((< type1 type2) (throw 'found t)) ((> type1 type2) (throw 'found nil))) ;; Continue. (setq index (1+ index))) ;; Shorter is less. (< length1 length2)))) but am not sure whether I'm missing something wrt the return value of `get-char-code-property'. martin From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 02 13:05:32 2012 Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 18:05:32 +0000 Received: from localhost ([127.0.0.1]:50351 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfDv2-0005Wy-9d for submit@debbugs.gnu.org; Sun, 02 Dec 2012 13:05:32 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:49941) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfDuz-0005Wq-OS for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 13:05:31 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEF004000SMEJ00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 20:03:07 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEF004LH0T69820@a-mtaout22.012.net.il>; Sun, 02 Dec 2012 20:03:07 +0200 (IST) Date: Sun, 02 Dec 2012 20:02:59 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50BB93C2.1050007@gmx.at> X-012-Sender: halo1@inter.net.il To: martin rudalics Message-id: <83y5hgs564.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > Date: Sun, 02 Dec 2012 18:45:38 +0100 > From: martin rudalics > CC: Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > (setq type1 (car (get-char-code-property > (elt string1 index) 'decomposition))) > (setq type2 (car (get-char-code-property > (elt string2 index) 'decomposition))) > (cond > ((< type1 type2) > (throw 'found t)) > ((> type1 type2) > (throw 'found nil))) > ;; Continue. > (setq index (1+ index))) > ;; Shorter is less. > (< length1 length2)))) > > but am not sure whether I'm missing something wrt the return value of > `get-char-code-property'. Maybe only the fact that it can return a list whose car is 'compat', see the examples I posted. From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 02 13:19:21 2012 Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 18:19:21 +0000 Received: from localhost ([127.0.0.1]:50357 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfE8O-0005q4-KI for submit@debbugs.gnu.org; Sun, 02 Dec 2012 13:19:20 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:53401) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfE8L-0005pw-GX for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 13:19:18 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEF004001DEGX00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 20:16:25 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEF004081FC37A0@a-mtaout22.012.net.il>; Sun, 02 Dec 2012 20:16:25 +0200 (IST) Date: Sun, 02 Dec 2012 20:16:17 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <87hao5jqu3.fsf@mail.jurta.org> X-012-Sender: halo1@inter.net.il To: Juri Linkov Message-id: <83wqx0s4jy.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: Juri Linkov > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > Date: Sun, 02 Dec 2012 02:27:32 +0200 > > I'm surprised to see case mappings hard-coded in > lisp/international/characters.el instead of using the properties > `uppercase' and `lowercase' during creation of case tables. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4999] X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.2 (/) > From: Juri Linkov > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > Date: Sun, 02 Dec 2012 02:27:32 +0200 > > I'm surprised to see case mappings hard-coded in > lisp/international/characters.el instead of using the properties > `uppercase' and `lowercase' during creation of case tables. My guess is that this is because the code in characters.el was written long before we had access to Unicode character properties in Emacs, and in fact before Emacs was switched to character representation based on Unicode codepoints. And no one bothered to rewrite that code since then; volunteers are welcome. > (defvar decomposition-table nil) > > (defun make-decomposition-table () > (let ((table (standard-case-table)) > canon) > (setq canon (copy-sequence table)) > (let ((c #x0000) d) > (while (<= c #xFFFD) > (make-decomposition-table-1 canon c c) > (setq c (1+ c)))) > (set-char-table-extra-slot table 1 canon) > (set-char-table-extra-slot table 2 nil) > (setq decomposition-table table))) > > (defun make-decomposition-table-1 (canon c0 c1) > (let ((d (get-char-code-property c1 'decomposition))) > (when d > (unless (characterp (car d)) (pop d)) > (if (eq c1 (car d)) > (aset canon c0 (car d)) > (make-decomposition-table-1 canon c0 (car d)))))) > > (make-decomposition-table) > > Then a new Isearch command (the existing `isearch-toggle-case-fold' > can't be used because it enables/disables the standard case table) > could toggle between the current case table and the decomposition > case table using > > (set-case-table decomposition-table) > > After evaluating this, Isearch correctly finds all related characters > in every row of this example: > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html > > But it seems using the case table for decomposition has one limitation. > I see no way to ignore combining accent characters in the case table, > i.e. to map combining accent characters to nothing. These characters > have the general-category "Mn (Mark, Nonspacing)", so they should be ignored > in the search. IMO, using case tables for this is evil. If I want to "fold" diacritics in search, that doesn't necessarily mean I want to fold the letter-case as well. I might want doing that, or I might not; these are two orthogonal features. So we need a separate kind of char-table, one that could be installed in addition to the case table, and one that will interpret nil as an indication to ignore the character during search. Then we will be able to ignore combining accents, as we indeed should. We also need to modify the searching primitives to consult this new table, in addition to case table. IOW, I don't think we can implement this feature entirely in Lisp. Some changes are needed on the C level as well. From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 02 17:10:16 2012 Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 22:10:16 +0000 Received: from localhost ([127.0.0.1]:50528 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfHjr-0003ef-Jp for submit@debbugs.gnu.org; Sun, 02 Dec 2012 17:10:15 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:59873 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfHjo-0003eW-N0 for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 17:10:13 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 44160451E187; Sun, 2 Dec 2012 14:07:52 -0800 (PST) From: Juri Linkov To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <83wqx0s4jy.fsf@gnu.org> Date: Sun, 02 Dec 2012 23:31:20 +0200 In-Reply-To: <83wqx0s4jy.fsf@gnu.org> (Eli Zaretskii's message of "Sun, 02 Dec 2012 20:16:17 +0200") Message-ID: <87txs4jgkv.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > IMO, using case tables for this is evil. If I want to "fold" > diacritics in search, that doesn't necessarily mean I want to fold the > letter-case as well. I might want doing that, or I might not; these > are two orthogonal features. `decomposition-table' is a separate char-table that has the subtype `case-table'. It should not conflict with the standard case table, so using `isearch-toggle-case-fold' should still toggle the usage of the standard case table. To toggle folding in the diacritics search perhaps requires having two decomposition tables: one where upper and lower case letters belong to one equivalence set, and another where they are in different sets, so `isearch-toggle-decomposition' could toggle between them. Or should the standard case table and the decomposition table be combined some other way? Maybe like the existing variable `case-fold-search' to add a new variable `decomposition-search' to enable/disable diacritics in search. > So we need a separate kind of char-table, one that could be installed > in addition to the case table, and one that will interpret nil as > an indication to ignore the character during search. I believe this kind of char-table should be based on the existing subtype `case-table' because it provides the features necessary for decomposition search such as extra table EQUIVALENCES (that permutes each equivalence class) and the extra table CANONICALIZE (where the canonical character is the final character in the recursion that traverses the `decomposition' property). > Then we will be able to ignore combining accents, as we indeed should. > We also need to modify the searching primitives to consult this new > table, in addition to case table. Yes, it seems the feature of ignoring combining accents (i.e. mapping some characters to nil) can't be added to existing case tables because for the case table this would mean that converting a string to upper case might delete some characters (like combining accents) and converting a string to lower case might add combining accents to the string that of course makes no sense. > IOW, I don't think we can implement this feature entirely in Lisp. > Some changes are needed on the C level as well. A hack that abuses the standard case table is already possible in Lisp. A complete implementation requires changes on the C level. From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 02 17:10:19 2012 Received: (at 13041) by debbugs.gnu.org; 2 Dec 2012 22:10:19 +0000 Received: from localhost ([127.0.0.1]:50532 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfHju-0003et-5E for submit@debbugs.gnu.org; Sun, 02 Dec 2012 17:10:19 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:59912 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfHjr-0003ee-M4 for 13041@debbugs.gnu.org; Sun, 02 Dec 2012 17:10:16 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id DF63F451E18B; Sun, 2 Dec 2012 14:07:54 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> Date: Sun, 02 Dec 2012 23:39:17 +0200 In-Reply-To: <50BB93C2.1050007@gmx.at> (martin rudalics's message of "Sun, 02 Dec 2012 18:45:38 +0100") Message-ID: <87zk1wgloq.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > Whatever solution you find most suitable here, it would be nice to come > up with a similar solution for sorting. I've been playing around with a > function like Did you try to build the case table with the diacritics mappings? It should affect the sorting as well without requiring any changes in sorting functions. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 05:18:58 2012 Received: (at 13041) by debbugs.gnu.org; 3 Dec 2012 10:18:58 +0000 Received: from localhost ([127.0.0.1]:50839 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfT73-0000rl-Ox for submit@debbugs.gnu.org; Mon, 03 Dec 2012 05:18:57 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:59059) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TfT71-0000rd-4H for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 05:18:55 -0500 Received: (qmail invoked by alias); 03 Dec 2012 10:16:32 -0000 Received: from 62-47-53-55.adsl.highway.telekom.at (EHLO [62.47.53.55]) [62.47.53.55] by mail.gmx.net (mp034) with SMTP; 03 Dec 2012 11:16:32 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX19XC49g39NXs6gBtQA6jrY4x+8cvL33wDyDnmOUVs IVgkNRXdvpC2Sk Message-ID: <50BC7BF5.2020400@gmx.at> Date: Mon, 03 Dec 2012 11:16:21 +0100 From: martin rudalics MIME-Version: 1.0 To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> In-Reply-To: <83y5hgs564.fsf@gnu.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Maybe only the fact that it can return a list whose car is 'compat', > see the examples I posted. So I need two indices for looping. But what are the guidelines to interpet `compat'? Does every list starting with a `compat' mean that the remaining entries of that list represent the constituents of that composite? [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.22 listed in list.dnswl.org] 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.53.55 listed in combined.njabl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Maybe only the fact that it can return a list whose car is 'compat', > see the examples I posted. So I need two indices for looping. But what are the guidelines to interpet `compat'? Does every list starting with a `compat' mean that the remaining entries of that list represent the constituents of that composite? [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.53.55 listed in combined.njabl.org] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.22 listed in list.dnswl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4930] > Maybe only the fact that it can return a list whose car is 'compat', > see the examples I posted. So I need two indices for looping. But what are the guidelines to interpet `compat'? Does every list starting with a `compat' mean that the remaining entries of that list represent the constituents of that composite? And how do I now call `put-char-code-property' to make the German sharp "s" ("=DF") equivalent to "ss"? Or am I not supposed to do such a thing?= martin From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 05:19:18 2012 Received: (at 13041) by debbugs.gnu.org; 3 Dec 2012 10:19:18 +0000 Received: from localhost ([127.0.0.1]:50843 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfT7N-0000sl-WA for submit@debbugs.gnu.org; Mon, 03 Dec 2012 05:19:18 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:39151) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TfT7M-0000se-Ca for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 05:19:16 -0500 Received: (qmail invoked by alias); 03 Dec 2012 10:16:53 -0000 Received: from 62-47-53-55.adsl.highway.telekom.at (EHLO [62.47.53.55]) [62.47.53.55] by mail.gmx.net (mp017) with SMTP; 03 Dec 2012 11:16:53 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX1+oZdw0uE6oPW+sF+OpfOHDSI2k7GESsb1CsesRT/ MTxs+nYeU/WNQl Message-ID: <50BC7C0A.4040401@gmx.at> Date: Mon, 03 Dec 2012 11:16:42 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <87zk1wgloq.fsf@mail.jurta.org> In-Reply-To: <87zk1wgloq.fsf@mail.jurta.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Did you try to build the case table with the diacritics mappings? It should > affect the sorting as well without requiring any changes in sorting functions. I tried but it didn't work out. I have to understand your code first before I can tell what happens. In any case, doing your [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.53.55 listed in combined.njabl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.23 listed in list.dnswl.org] X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Did you try to build the case table with the diacritics mappings? It should > affect the sorting as well without requiring any changes in sorting functions. I tried but it didn't work out. I have to understand your code first before I can tell what happens. In any case, doing your [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.53.55 listed in combined.njabl.org] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.23 listed in list.dnswl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4991] > Did you try to build the case table with the diacritics mappings? It should > affect the sorting as well without requiring any changes in sorting functions. I tried but it didn't work out. I have to understand your code first before I can tell what happens. In any case, doing your (set-case-table decomposition-table) permanently for a buffer crashed Emacs here. martin From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 11:50:30 2012 Received: (at 13041) by debbugs.gnu.org; 3 Dec 2012 16:50:30 +0000 Received: from localhost ([127.0.0.1]:51674 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfZDx-0003eL-QS for submit@debbugs.gnu.org; Mon, 03 Dec 2012 11:50:30 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:53304) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfZDv-0003eD-ID for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 11:50:28 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEG00K00RXY0H00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 18:47:36 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEG00KHVRZB0G00@a-mtaout22.012.net.il>; Mon, 03 Dec 2012 18:47:36 +0200 (IST) Date: Mon, 03 Dec 2012 18:47:30 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50BC7BF5.2020400@gmx.at> To: martin rudalics Message-id: <83hao3rskd.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-15 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Date: Mon, 03 Dec 2012 11:16:21 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > But what are the guidelines to interpet `compat'? [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4650] X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.7 (/) > Date: Mon, 03 Dec 2012 11:16:21 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,=20 > perin@acm.org >=20 > But what are the guidelines to interpet `compat'? For the purposes of comparing strings, both 'compatibility' and 'canonical' decompositions should be treated the same, AFAIU. You ca= n find the details here: http://unicode.org/reports/tr15/ > Does every list starting with a `compat' mean that the remaining > entries of that list represent the constituents of that composite? Yes. This comes directly from UnicdeData.txt, e.g.: 0132;LATIN CAPITAL LIGATURE IJ;Lu;0;L; 0049 004A;;;;N;LATIN= CAPITAL LETTER I J;;;0133; ^^^^^^^^^^^^^^^^^^ > And how do I now call `put-char-code-property' to make the German s= harp > "s" ("=DF") equivalent to "ss"? Or am I not supposed to do such a = thing? That's already set up in the appropriate case table, I think. But it is not a compatibility decomposition, AFAIK. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 12:45:31 2012 Received: (at 13041) by debbugs.gnu.org; 3 Dec 2012 17:45:31 +0000 Received: from localhost ([127.0.0.1]:51715 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tfa5C-0004vu-W7 for submit@debbugs.gnu.org; Mon, 03 Dec 2012 12:45:31 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:40615) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1Tfa5B-0004vm-0l for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 12:45:29 -0500 Received: (qmail invoked by alias); 03 Dec 2012 17:43:05 -0000 Received: from 62-47-53-55.adsl.highway.telekom.at (EHLO [62.47.53.55]) [62.47.53.55] by mail.gmx.net (mp041) with SMTP; 03 Dec 2012 18:43:05 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18CV/8cXO8Cm4CYmUrKJ3CtNKSSvxjsnLfz08S92T Xn5Uiqy0bRO0dm Message-ID: <50BCE49D.6010001@gmx.at> Date: Mon, 03 Dec 2012 18:42:53 +0100 From: martin rudalics MIME-Version: 1.0 To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> In-Reply-To: <83hao3rskd.fsf@gnu.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: >> And how do I now call `put-char-code-property' to make the German sharp >> "s" ("") equivalent to "ss"? Or am I not supposed to do such a thing? > > That's already set up in the appropriate case table, I think. [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.53.55 listed in combined.njabl.org] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.23 listed in list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: >> And how do I now call `put-char-code-property' to make the German sharp >> "s" ("") equivalent to "ss"? Or am I not supposed to do such a thing? > > That's already set up in the appropriate case table, I think. [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.53.55 listed in combined.njabl.org] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.23 listed in list.dnswl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4932] >> And how do I now call `put-char-code-property' to make the German sha= rp >> "s" ("=DF") equivalent to "ss"? Or am I not supposed to do such a th= ing? > > That's already set up in the appropriate case table, I think. Why in a case table? Both "=DF" and "ss" are lower case. > But it > is not a compatibility decomposition, AFAIK. But I can make it one? martin From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 13:02:23 2012 Received: (at 13041) by debbugs.gnu.org; 3 Dec 2012 18:02:24 +0000 Received: from localhost ([127.0.0.1]:51736 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfaLX-0005KE-3a for submit@debbugs.gnu.org; Mon, 03 Dec 2012 13:02:23 -0500 Received: from mtaout23.012.net.il ([80.179.55.175]:61176) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfaLV-0005K6-6K for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 13:02:22 -0500 Received: from conversion-daemon.a-mtaout23.012.net.il by a-mtaout23.012.net.il (HyperSendmail v2007.08) id <0MEG00J00UOS2L00@a-mtaout23.012.net.il> for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 19:59:36 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout23.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEG00IYNVB9VX50@a-mtaout23.012.net.il>; Mon, 03 Dec 2012 19:59:34 +0200 (IST) Date: Mon, 03 Dec 2012 19:59:28 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50BCE49D.6010001@gmx.at> To: martin rudalics Message-id: <837gozrp8f.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-15 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Date: Mon, 03 Dec 2012 18:42:53 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > >> And how do I now call `put-char-code-property' to make the German sharp > >> "s" ("") equivalent to "ss"? Or am I not supposed to do such a thing? > > > > That's already set up in the appropriate case table, I think. > > Why in a case table? Both "" and "ss" are lower case. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.175 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4494] X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.7 (/) > Date: Mon, 03 Dec 2012 18:42:53 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org,=20 > perin@acm.org >=20 > >> And how do I now call `put-char-code-property' to make the Germ= an sharp > >> "s" ("=DF") equivalent to "ss"? Or am I not supposed to do suc= h a thing? > > > > That's already set up in the appropriate case table, I think. >=20 > Why in a case table? Both "=DF" and "ss" are lower case. I meant the relation "=DF" =3D> "SS". > > But it > > is not a compatibility decomposition, AFAIK. >=20 > But I can make it one? Yes, you can modify the table set up by uni-decomposition.el. I think. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 19:20:43 2012 Received: (at 13041) by debbugs.gnu.org; 4 Dec 2012 00:20:43 +0000 Received: from localhost ([127.0.0.1]:52157 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfgFe-0006X3-W3 for submit@debbugs.gnu.org; Mon, 03 Dec 2012 19:20:43 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:56792 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfgFc-0006Ww-P1 for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 19:20:41 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id E70DCAAB4B9C; Mon, 3 Dec 2012 16:18:13 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <87zk1wgloq.fsf@mail.jurta.org> <50BC7C0A.4040401@gmx.at> Date: Tue, 04 Dec 2012 02:17:04 +0200 In-Reply-To: <50BC7C0A.4040401@gmx.at> (martin rudalics's message of "Mon, 03 Dec 2012 11:16:42 +0100") Message-ID: <87sj7miscf.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > In any case, doing your > > (set-case-table decomposition-table) > > permanently for a buffer crashed Emacs here. With more use I see crashes too. The backtrace says that crashes are in boyer_moore. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 03 22:43:44 2012 Received: (at 13041) by debbugs.gnu.org; 4 Dec 2012 03:43:44 +0000 Received: from localhost ([127.0.0.1]:52283 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfjQ6-0003b8-BC for submit@debbugs.gnu.org; Mon, 03 Dec 2012 22:43:43 -0500 Received: from mtaout20.012.net.il ([80.179.55.166]:49298) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfjQ0-0003au-3j for 13041@debbugs.gnu.org; Mon, 03 Dec 2012 22:43:37 -0500 Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0MEH00000LQ4KS00@a-mtaout20.012.net.il> for 13041@debbugs.gnu.org; Tue, 04 Dec 2012 05:41:08 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEH00MM2M8JDTB0@a-mtaout20.012.net.il>; Tue, 04 Dec 2012 05:41:08 +0200 (IST) Date: Tue, 04 Dec 2012 05:41:04 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <87sj7miscf.fsf@mail.jurta.org> X-012-Sender: halo1@inter.net.il To: Juri Linkov Message-id: <83txs2qyb3.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <87zk1wgloq.fsf@mail.jurta.org> <50BC7C0A.4040401@gmx.at> <87sj7miscf.fsf@mail.jurta.org> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: rudalics@gmx.at, 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.2 (/) > From: Juri Linkov > Cc: Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > Date: Tue, 04 Dec 2012 02:17:04 +0200 > > > In any case, doing your > > > > (set-case-table decomposition-table) > > > > permanently for a buffer crashed Emacs here. > > With more use I see crashes too. The backtrace says that crashes are in > boyer_moore. Please file a bug report with a minimal reproducible recipe. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 04 12:57:36 2012 Received: (at 13041) by debbugs.gnu.org; 4 Dec 2012 17:57:36 +0000 Received: from localhost ([127.0.0.1]:53514 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfwkR-0007f5-RC for submit@debbugs.gnu.org; Tue, 04 Dec 2012 12:57:36 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:39643) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TfwkP-0007ex-9P for 13041@debbugs.gnu.org; Tue, 04 Dec 2012 12:57:34 -0500 Received: (qmail invoked by alias); 04 Dec 2012 17:55:02 -0000 Received: from 62-47-48-231.adsl.highway.telekom.at (EHLO [62.47.48.231]) [62.47.48.231] by mail.gmx.net (mp029) with SMTP; 04 Dec 2012 18:55:02 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX191R98pzRV1Pw5iW9/HSl4nQZVrvSor7BIGFDdUlF Dmp31KSkJeueer Message-ID: <50BE38F3.3030907@gmx.at> Date: Tue, 04 Dec 2012 18:54:59 +0100 From: martin rudalics MIME-Version: 1.0 To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> In-Reply-To: <837gozrp8f.fsf@gnu.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > Yes, you can modify the table set up by uni-decomposition.el. I > think. Seems to work well. The function I came up with goes as below. Thanks for the hints, martin (defun decomposed-string-lessp (string1 string2) "Return t if STRING1 is decomposition-less than STRING2." (let* ((length1 (length string1)) (length2 (length string2)) (min-length (min length1 length2)) (index1 0) (index2 0) prop1 prop2 type1 type2 compat1 compat2) (catch 'found (while (and (< index1 length1) (< index2 length2)) (setq prop1 (get-char-code-property (downcase (elt string1 index1)) 'decomposition)) (setq type1 (car prop1)) (setq prop2 (get-char-code-property (downcase (elt string2 index2)) 'decomposition)) (setq type2 (car prop2)) (cond ((and (eq type1 'compat) (eq type2 'compat)) (setq compat1 (concat (cdr prop1))) (setq compat2 (concat (cdr prop2))) (let ((value (compare-strings compat1 0 nil compat2 0 nil t))) (cond ((eq value t) (setq index1 (1+ index1)) (setq index2 (1+ index2))) ((< value 0) (throw 'found t)) ((< value 0) (throw 'found nil))))) ((eq type1 'compat) (setq compat1 (concat (cdr prop1))) (let ((value (compare-strings compat1 0 nil string2 index2 (min (+ index2 (length compat1)) length2) t))) (cond ((eq value t) (setq index1 (1+ index1)) (setq index2 (+ index2 (length compat1)))) ((< value 0) (throw 'found t)) ((< value 0) (throw 'found nil))))) ((eq type2 'compat) (setq compat2 (concat (cdr prop2))) (let ((value (compare-strings string1 index1 (min (+ index1 (length compat2)) length1) compat2 0 nil t))) (cond ((eq value t) (setq index1 (+ index1 (length compat2))) (setq index2 (1+ index2))) ((< value 0) (throw 'found t)) ((< value 0) (throw 'found nil))))) ((< type1 type2) (throw 'found t)) ((> type1 type2) (throw 'found nil)) (t (setq index1 (1+ index1)) (setq index2 (1+ index2))))) ;; Shorter is less. (< length1 length2)))) From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 04 14:28:41 2012 Received: (at 13041) by debbugs.gnu.org; 4 Dec 2012 19:28:41 +0000 Received: from localhost ([127.0.0.1]:53626 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfyAa-0002LU-Uj for submit@debbugs.gnu.org; Tue, 04 Dec 2012 14:28:41 -0500 Received: from mtaout20.012.net.il ([80.179.55.166]:33392) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TfyAY-0002LK-G5 for 13041@debbugs.gnu.org; Tue, 04 Dec 2012 14:28:39 -0500 Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0MEI00C00U1PBH00@a-mtaout20.012.net.il> for 13041@debbugs.gnu.org; Tue, 04 Dec 2012 21:28:34 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEI00CRFU3L5V10@a-mtaout20.012.net.il>; Tue, 04 Dec 2012 21:28:33 +0200 (IST) Date: Tue, 04 Dec 2012 21:28:32 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50BE38F3.3030907@gmx.at> X-012-Sender: halo1@inter.net.il To: martin rudalics Message-id: <83hao1r50f.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.2 (/) > Date: Tue, 04 Dec 2012 18:54:59 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > > Yes, you can modify the table set up by uni-decomposition.el. I > > think. > > Seems to work well. The function I came up with goes as below. How about putting it in subr.el? From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 04 15:13:12 2012 Received: (at 13041) by debbugs.gnu.org; 4 Dec 2012 20:13:12 +0000 Received: from localhost ([127.0.0.1]:53638 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tfyrg-0003Nf-AX for submit@debbugs.gnu.org; Tue, 04 Dec 2012 15:13:12 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:41973) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tfyre-0003NW-2B for 13041@debbugs.gnu.org; Tue, 04 Dec 2012 15:13:11 -0500 Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB4KD3dB031447 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 4 Dec 2012 20:13:04 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB4KD2W6023000 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 4 Dec 2012 20:13:03 GMT Received: from abhmt107.oracle.com (abhmt107.oracle.com [141.146.116.59]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB4KD2w9023659; Tue, 4 Dec 2012 14:13:02 -0600 Received: from dradamslap1 (/10.159.232.171) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 04 Dec 2012 12:13:01 -0800 From: "Drew Adams" To: "'martin rudalics'" , "'Eli Zaretskii'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Tue, 4 Dec 2012 12:12:57 -0800 Message-ID: <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <50BE38F3.3030907@gmx.at> Thread-Index: Ac3SSJfOqNbJe0MsQy68+FDUM9xTrwAA6L5A X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: ucsinet21.oracle.com [156.151.31.93] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -4.2 (----) > The function [Martin] came up with goes as below. > (defun decomposed-string-lessp (string1 string2) > "Return t if STRING1 is decomposition-less than STRING2." > ... I know nothing about character composition and have not tested this with anything but a few western accents. But this seems like good stuff. 1. Assuming this or similar is added to Emacs (please do). Please consider modifying it to respect `case-fold-search'. These modified lines do that. (setq prop1 (get-char-code-property (if case-fold-search (downcase (elt string1 index1)) (elt string1 index1)) 'decomposition)) [Same thing for prop2 with string2 and index2.] (let ((value (compare-strings compat1 0 nil compat2 0 nil case-fold-search))) 2. In addition, consider updating `string-lessp' to be sensitive to a variable such as this: (defvar ignore-diacritics nil "Non-nil means ignore diacritics for string comparisons.") With that, an alternative to hard-coding a call to `decomposed-string-lessp' is to bind `ignore-diacritics' and use `string-lessp'. A similar change could be made for `compare-strings': reflect the value of `ignore-diacritics'. Or since that function has made the choice to pass case-sensitivity as a parameter instead of respecting `case-fold-search', pass another parameter for diacritic sensitivity. 3. More general than #2 would be a function like this, which is sensitive to both `ignore-diacritics' and `case-fold-search' (this assumes the change suggested above in #1 for `decomposed-string-lessp'). (defun my-string-lessp (s1 s2) "..." (if ignore-diacritics (decomposed-string-lessp s1 s2) (when case-fold-search (setq s1 (upcase s1) s2 (upcase s2))) (string-lessp s1 s2))) Dunno a good name for this. It's too late to let `string-lessp' itself act like this - that would break stuff. 4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and `decomposed-string-lessp' would be to have those functions be sensitive to a variable such as this: (defvar string-case-variable 'case-fold-search "Value is a case-sensitivity variable such as `case-fold-search'. The values of that variable must be like those for `case-fold-search': nil means case-sensitive, non-nil means case-insensitive.") Code could then bind `string-case-variable' to, say, `(not completion-ignore-case)' or to any other case-sensitivity controlling sexp, when appropriate. This would have the advantages offered by passing an explicit case-sensitivity parameter, as in `compare-strings', but also the advantages of dynamic scope: binding `string-case-var' to affect all comparisons within scope. Comparers such as `(my-)string-lessp' are often used as arguments to higher-order functions that treat them as (only) binary predicates, i.e., predicates where any additional parameters specifying case or diacritic sensitivity are ignored. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 04 18:16:12 2012 Received: (at 13041) by debbugs.gnu.org; 4 Dec 2012 23:16:12 +0000 Received: from localhost ([127.0.0.1]:53800 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tg1il-0007ez-Si for submit@debbugs.gnu.org; Tue, 04 Dec 2012 18:16:12 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:30198) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tg1ij-0007es-OQ for 13041@debbugs.gnu.org; Tue, 04 Dec 2012 18:16:10 -0500 Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB4NG1hY030887 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Tue, 4 Dec 2012 23:16:02 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB4NG1he010768 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 4 Dec 2012 23:16:01 GMT Received: from abhmt110.oracle.com (abhmt110.oracle.com [141.146.116.62]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB4NG0ix025542; Tue, 4 Dec 2012 17:16:00 -0600 Received: from dradamslap1 (/10.159.232.171) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 04 Dec 2012 15:16:00 -0800 From: "Drew Adams" To: "'martin rudalics'" , "'Eli Zaretskii'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Tue, 4 Dec 2012 15:15:56 -0800 Message-ID: <30445B60073741078590414669EF9536@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> Thread-Index: Ac3SSJfOqNbJe0MsQy68+FDUM9xTrwAA6L5AAAohcSA= X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet21.oracle.com [141.146.126.237] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) BTW, there are a couple of minor things to check wrt the code you sent, Martin: * `min-length' is not used. * The `cond's all repeat condition (< value 0) twice, with different actions. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 01:50:50 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 06:50:50 +0000 Received: from localhost ([127.0.0.1]:54138 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tg8oj-0004Sz-NG for submit@debbugs.gnu.org; Wed, 05 Dec 2012 01:50:50 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:34238) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tg8og-0004Sr-JO for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 01:50:47 -0500 Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB56obGP024159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 06:50:38 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB56oain005123 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 06:50:37 GMT Received: from abhmt115.oracle.com (abhmt115.oracle.com [141.146.116.67]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB56oapD003470; Wed, 5 Dec 2012 00:50:36 -0600 Received: from dradamslap1 (/10.159.232.122) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 04 Dec 2012 22:50:36 -0800 From: "Drew Adams" To: "'martin rudalics'" , "'Eli Zaretskii'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <30445B60073741078590414669EF9536@us.oracle.com> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Tue, 4 Dec 2012 22:50:30 -0800 Message-ID: <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <30445B60073741078590414669EF9536@us.oracle.com> Thread-Index: Ac3SSJfOqNbJe0MsQy68+FDUM9xTrwAA6L5AAAohcSAADeMhcA== X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet21.oracle.com [141.146.126.237] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.8 (--) This version of Martin's function (but respecting `case-fold-search') is maybe a tiny bit simpler. It could also be a bit slower because of `substring' returning a copy (vs just incrementing an offset). It should also be checked for correctness - not really tested. FWIW/HTH. (It does correct the two double `(< value 0)' typos I mentioned earlier. That should be done in any case.) (defun decomposed-string-lessp (string1 string2) "Return non-nil if decomposed STRING1 is less than decomposed STRING2. Comparison respects `case-fold-search'." (let ((s1 string1) (s2 string2) prop1 prop2 type1 type2) (catch 'found (while (and (> (length s1) 0) (> (length s2) 0)) (setq prop1 (get-char-code-property (if case-fold-search (downcase (elt s1 0)) (elt s1 0)) 'decomposition) prop2 (get-char-code-property (if case-fold-search (downcase (elt s2 0)) (elt s2 0)) 'decomposition) type1 (car prop1) type2 (car prop2)) (when (eq type1 'compat) (setq s1 (concat (cdr prop1)))) (when (eq type2 'compat) (setq s2 (concat (cdr prop2)))) (cond ((eq type1 'compat) (let ((cs (compare-strings s1 0 nil s2 0 (and (not (eq type2 'compat)) (min (length s1) (length s2))) case-fold-search))) (unless (eq cs t) (throw 'found (< cs 0))))) ((eq type2 'compat) (let ((cs (compare-strings s1 0 (min (length s2) (length s1)) s2 0 nil case-fold-search))) (unless (eq cs t) (throw 'found (< cs 0))))) ((= type1 type2) (setq s1 (substring s1 1) s2 (substring s2 1))) (t (throw 'found (< type1 type2))))) (< (length string1) (length string2))))) From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 04:41:55 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 09:41:55 +0000 Received: from localhost ([127.0.0.1]:54292 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgBUI-0008Ot-EA for submit@debbugs.gnu.org; Wed, 05 Dec 2012 04:41:54 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:55710) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgBUG-0008Ok-Fa for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 04:41:53 -0500 Received: (qmail invoked by alias); 05 Dec 2012 09:41:44 -0000 Received: from 62-47-60-75.adsl.highway.telekom.at (EHLO [62.47.60.75]) [62.47.60.75] by mail.gmx.net (mp024) with SMTP; 05 Dec 2012 10:41:44 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX19dqipWl+5dOp3AdH5JRQEFkoDO+iMmQXuApKaZed JrMzLSCbA7zu7H Message-ID: <50BF16D4.1070506@gmx.at> Date: Wed, 05 Dec 2012 10:41:40 +0100 From: martin rudalics MIME-Version: 1.0 To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <83hao1r50f.fsf@gnu.org> In-Reply-To: <83hao1r50f.fsf@gnu.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > How about putting it in subr.el? If I correctly understand Juri, I next have to deal with things like (get-char-code-property #xff59 'decomposition) and related issues we might unearth in the course of this. Also, while currently sorting is stable in the sense that with respect to diacritics text remains unchanged from the original order, this is not nice for sorting larger pieces of text. So I'd rather have to use the second list element returned by `get-char-code-property' to make sure that, for example, "e" gets always sorted before "=E8" before "=E9".= martin From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 04:42:40 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 09:42:40 +0000 Received: from localhost ([127.0.0.1]:54296 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgBV2-0008QA-43 for submit@debbugs.gnu.org; Wed, 05 Dec 2012 04:42:40 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:60410) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgBUz-0008Q2-89 for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 04:42:38 -0500 Received: (qmail invoked by alias); 05 Dec 2012 09:42:29 -0000 Received: from 62-47-60-75.adsl.highway.telekom.at (EHLO [62.47.60.75]) [62.47.60.75] by mail.gmx.net (mp024) with SMTP; 05 Dec 2012 10:42:29 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX194/ORTaBp40CwiZBEJhPEPHD4uBN+QN1ilHcP9Ge ULVRPZY5Xcf+Hz Message-ID: <50BF1702.4020100@gmx.at> Date: Wed, 05 Dec 2012 10:42:26 +0100 From: martin rudalics MIME-Version: 1.0 To: Drew Adams Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> In-Reply-To: <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 'Eli Zaretskii' , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > 1. Assuming this or similar is added to Emacs (please do). Please consider > modifying it to respect `case-fold-search'. These modified lines do that. > > (setq prop1 (get-char-code-property > (if case-fold-search > (downcase (elt string1 index1)) > (elt string1 index1)) > 'decomposition)) > > [Same thing for prop2 with string2 and index2.] This would have to be done, yes. > (let ((value (compare-strings compat1 0 nil > compat2 0 nil case-fold-search))) > > > 2. In addition, consider updating `string-lessp' to be sensitive to a variable > such as this: > > (defvar ignore-diacritics nil > "Non-nil means ignore diacritics for string comparisons.") > > With that, an alternative to hard-coding a call to `decomposed-string-lessp' is > to bind `ignore-diacritics' and use `string-lessp'. `ignore-diacritics' is misleading. The variable would have to be called `observe-decompositions' or something the like. > A similar change could be made for `compare-strings': reflect the value of > `ignore-diacritics'. Or since that function has made the choice to pass > case-sensitivity as a parameter instead of respecting `case-fold-search', pass > another parameter for diacritic sensitivity. Indeed, `string-lessp' is too weak - we'd need a function to tell whether two strings are equal disregarding "certain" decomposition properties. > 3. More general than #2 would be a function like this, which is sensitive to > both `ignore-diacritics' and `case-fold-search' (this assumes the change > suggested above in #1 for `decomposed-string-lessp'). > > (defun my-string-lessp (s1 s2) > "..." > (if ignore-diacritics > (decomposed-string-lessp s1 s2) > (when case-fold-search (setq s1 (upcase s1) > s2 (upcase s2))) > (string-lessp s1 s2))) > > Dunno a good name for this. It's too late to let `string-lessp' itself act like > this - that would break stuff. `string-lessp' is in C. I wouldn't touch it anyway. > 4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and > `decomposed-string-lessp' would be to have those functions be sensitive to a > variable such as this: > > (defvar string-case-variable 'case-fold-search > "Value is a case-sensitivity variable such as `case-fold-search'. > The values of that variable must be like those for `case-fold-search': > nil means case-sensitive, non-nil means case-insensitive.") > > Code could then bind `string-case-variable' to, say, `(not > completion-ignore-case)' or to any other case-sensitivity controlling sexp, when > appropriate. > > This would have the advantages offered by passing an explicit case-sensitivity > parameter, as in `compare-strings', but also the advantages of dynamic scope: > binding `string-case-var' to affect all comparisons within scope. > > Comparers such as `(my-)string-lessp' are often used as arguments to > higher-order functions that treat them as (only) binary predicates, i.e., > predicates where any additional parameters specifying case or diacritic > sensitivity are ignored. I first have to solve the problems with the values returned by `get-char-code-property'. Then I will look into this. martin From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 04:42:53 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 09:42:53 +0000 Received: from localhost ([127.0.0.1]:54299 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgBVF-0008QX-H9 for submit@debbugs.gnu.org; Wed, 05 Dec 2012 04:42:53 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:34930) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgBVE-0008QQ-6T for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 04:42:52 -0500 Received: (qmail invoked by alias); 05 Dec 2012 09:42:44 -0000 Received: from 62-47-60-75.adsl.highway.telekom.at (EHLO [62.47.60.75]) [62.47.60.75] by mail.gmx.net (mp017) with SMTP; 05 Dec 2012 10:42:44 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX1/WG9N9AynWV3hF5T6Zm02QZWEzI38bZIglMsWDLm TRLgPhU32vtrYi Message-ID: <50BF1711.1010803@gmx.at> Date: Wed, 05 Dec 2012 10:42:41 +0100 From: martin rudalics MIME-Version: 1.0 To: Drew Adams Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <30445B60073741078590414669EF9536@us.oracle.com> In-Reply-To: <30445B60073741078590414669EF9536@us.oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 'Eli Zaretskii' , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > BTW, there are a couple of minor things to check wrt the code you sent, Martin: > > * `min-length' is not used. Leftover from a previous version. > * The `cond's all repeat condition (< value 0) twice, with different actions. These are clearly silly, yes. Funnily, they don't affect the result since they are never taken and the return value is nil as intended. martin From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 04:43:06 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 09:43:06 +0000 Received: from localhost ([127.0.0.1]:54304 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgBVR-0008RN-Ne for submit@debbugs.gnu.org; Wed, 05 Dec 2012 04:43:05 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:60629) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgBVQ-0008RG-6Y for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 04:43:04 -0500 Received: (qmail invoked by alias); 05 Dec 2012 09:42:56 -0000 Received: from 62-47-60-75.adsl.highway.telekom.at (EHLO [62.47.60.75]) [62.47.60.75] by mail.gmx.net (mp019) with SMTP; 05 Dec 2012 10:42:56 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX1+7lYFSfwG7XY3WJacEh/wSsoDKwCA9wWZ4F4OfUz EwU8XGkD1lkoqH Message-ID: <50BF171E.3090108@gmx.at> Date: Wed, 05 Dec 2012 10:42:54 +0100 From: martin rudalics MIME-Version: 1.0 To: Drew Adams Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <30445B60073741078590414669EF9536@us.oracle.com> <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com> In-Reply-To: <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 'Eli Zaretskii' , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > This version of Martin's function (but respecting `case-fold-search') is maybe a > tiny bit simpler. It could also be a bit slower because of `substring' > returning a copy (vs just incrementing an offset). It should also be checked > for correctness - not really tested. FWIW/HTH. The most important application I see for this is within `sort-subr' where I want to compare buffer substrings in situ by passing their boundaries. Hence I plan to provide a version working in terms of buffer positions. For simple string checking your version might be preferable. martin From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 10:38:35 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 15:38:35 +0000 Received: from localhost ([127.0.0.1]:55134 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgH3T-0001gZ-6t for submit@debbugs.gnu.org; Wed, 05 Dec 2012 10:38:35 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:21660) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgH3O-0001gP-9s for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 10:38:31 -0500 Received: from ucsinet22.oracle.com (ucsinet22.oracle.com [156.151.31.94]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB5FcJiX029884 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 15:38:19 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by ucsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB5FcIk5019189 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 15:38:18 GMT Received: from abhmt107.oracle.com (abhmt107.oracle.com [141.146.116.59]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB5FcHPW022064; Wed, 5 Dec 2012 09:38:17 -0600 Received: from dradamslap1 (/10.159.232.122) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 05 Dec 2012 07:38:17 -0800 From: "Drew Adams" To: "'martin rudalics'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Wed, 5 Dec 2012 07:38:10 -0800 Message-ID: <611DD154E83240D183A7B5B88691DC37@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <50BF1702.4020100@gmx.at> Thread-Index: Ac3SzNkIeOhfShQLQAypKA2c11bTAgAKwFpw X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: ucsinet22.oracle.com [156.151.31.94] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 'Eli Zaretskii' , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -4.2 (----) > `ignore-diacritics' is misleading. The variable would have > to be called `observe-decompositions' or something the like. 1. "Observe decompositions" doesn't mean anything to me. The verb should probably be more active - what does it mean to observe the char decompositions here? BTW, if we use "decomposition" in the name and description then we should probably also use "char" - this is not about decomposing strings in some way (whatever that might mean); it involves decomposing Unicode characters. 2. But my confusion over the name/description is in fact wrt function `decomposed-string-lessp': I guess it's not 100% clear to me what it does. Your doc string said "STRING1 is decomposition-less than STRING2", which confuses me. And it is a bit ambiguous wrt "-less": a. decomposition-less as in comparing the strings only after removing (some parts of) their decompositions (i.e., "-less" as in "sans")? or b. -lessp as in `string<': a comparison ordering relation? In the version of `decomposed-string-lessp' that I sent, I changed the doc string to this: "decomposed STRING1 is less than decomposed STRING2". But that is no doubt incorrect (less correct than yours, if perhaps clearer). In particular, it says nothing about how we compare the two decompositions. In practical (use) terms, this is typically about ignoring diacritics, keeping only the "base" characters. Something about that should at least be mentioned in the doc, so that users know they can use this for that. But IIUC this is not just about diacritics; it sometimes might not be about diacritics at all; and diacritics present are sometimes not ignored. E.g., the ligature ffi gets treated the same as the 3 chars f f i. There are no diacritics present in that case. IIUC, we convert the two strings to their Unicode decompositions and then use the Unicode char compatibility specs to compare the decompositions. IOW, we treat equivalent chars, as defined by Unicode, as the same. Perhaps the name/description should speak in terms of Unicode char compatibility or equivalence. Perhaps a name like `string-less-compat-p'? Or `Unicode-equivalent-p'? Or `string-equivalent-p'? How would you characterize what the function does? No doubt Eli can help here. It is important to try to get the function name and description right from the outset, if we can. If the Unicode standard has some terminology that applies here then perhaps we can/should leverage that. Beyond the name and an accurate description, the doc should, as I say, at least mention that you can use this to ignore diacritics (such as accents), as that will be a common use case. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 10:38:43 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 15:38:43 +0000 Received: from localhost ([127.0.0.1]:55137 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgH3a-0001gs-MH for submit@debbugs.gnu.org; Wed, 05 Dec 2012 10:38:43 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:21786) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgH3X-0001gk-Jx for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 10:38:40 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB5FcR8l030061 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 15:38:29 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB5FcRnc018167 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 15:38:27 GMT Received: from abhmt120.oracle.com (abhmt120.oracle.com [141.146.116.72]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB5FcRHO022248; Wed, 5 Dec 2012 09:38:27 -0600 Received: from dradamslap1 (/10.159.232.122) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 05 Dec 2012 07:38:26 -0800 From: "Drew Adams" To: "'martin rudalics'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <30445B60073741078590414669EF9536@us.oracle.com> <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com> <50BF171E.3090108@gmx.at> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Wed, 5 Dec 2012 07:38:20 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <50BF171E.3090108@gmx.at> Thread-Index: Ac3SzOm5QUyYA1ZZT+2Ky/w9vwL4lQAMXVLg X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 'Eli Zaretskii' , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > The most important application I see for this is within `sort-subr' > where I want to compare buffer substrings in situ by passing their > boundaries. Hence I plan to provide a version working in terms of > buffer positions. For simple string checking your version might be > preferable. Please do whatever is right - using positions as you intended. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 10:51:32 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 15:51:32 +0000 Received: from localhost ([127.0.0.1]:55143 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgHFv-0001z3-PW for submit@debbugs.gnu.org; Wed, 05 Dec 2012 10:51:32 -0500 Received: from mailbackend.panix.com ([166.84.1.89]:46464) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgHFp-0001yt-T5 for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 10:51:26 -0500 Received: from panix1.panix.com (panix1.panix.com [166.84.1.1]) by mailbackend.panix.com (Postfix) with ESMTP id 726A12E4DB; Wed, 5 Dec 2012 10:51:13 -0500 (EST) Received: by panix1.panix.com (Postfix, from userid 13816) id 5CCBE14B8D; Wed, 5 Dec 2012 10:51:13 -0500 (EST) From: Lewis Perin MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-ID: <20671.28016.555406.918606@panix1.panix.com> Date: Wed, 5 Dec 2012 10:51:12 -0500 To: "Drew Adams" Subject: RE: bug#13041: 24.2; diacritic-fold-search In-Reply-To: <611DD154E83240D183A7B5B88691DC37@us.oracle.com> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> <611DD154E83240D183A7B5B88691DC37@us.oracle.com> X-Mailer: VM 8.1.2 under 24.2.1 (i386-unknown-netbsdelf5.1) X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: 'martin rudalics' , 'Eli Zaretskii' , 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: perin@acm.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.3 (--) Drew Adams writes: > > `ignore-diacritics' is misleading. The variable would have=20 > > to be called `observe-decompositions' or something the like. >=20 >=20 > 1. "Observe decompositions" doesn't mean anything to me. The verb > should probably be more active - what does it mean to observe the > char decompositions here=3F What about =E2=80=9Cheed=E2=80=9D=3F /Lew --- Lew Perin | perin@acm.org | http://babelcarp.org From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 11:21:02 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 16:21:02 +0000 Received: from localhost ([127.0.0.1]:55165 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgHiX-0002f6-Aw for submit@debbugs.gnu.org; Wed, 05 Dec 2012 11:21:01 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:41383) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgHiV-0002ez-Gl for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 11:21:00 -0500 Received: from acsinet21.oracle.com (acsinet21.oracle.com [141.146.126.237]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB5GKluV016932 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 16:20:48 GMT Received: from acsmt358.oracle.com (acsmt358.oracle.com [141.146.40.158]) by acsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB5GKkxV006458 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 16:20:47 GMT Received: from abhmt120.oracle.com (abhmt120.oracle.com [141.146.116.72]) by acsmt358.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB5GKkw4026970; Wed, 5 Dec 2012 10:20:46 -0600 Received: from dradamslap1 (/10.159.232.122) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 05 Dec 2012 08:20:46 -0800 From: "Drew Adams" To: References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com><50BF1702.4020100@gmx.at><611DD154E83240D183A7B5B88691DC37@us.oracle.com> <20671.28016.555406.918606@panix1.panix.com> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Wed, 5 Dec 2012 08:20:39 -0800 Message-ID: <7583513D67AD48FCB4686C2BB402587F@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <20671.28016.555406.918606@panix1.panix.com> Thread-Index: Ac3TAFyPqx8QHcdARVWBqEWcFDzxawAAzlFg X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet21.oracle.com [141.146.126.237] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: 'martin rudalics' , 'Eli Zaretskii' , 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > > > `ignore-diacritics' is misleading. The variable would have > > > to be called `observe-decompositions' or something the like. > > > > 1. "Observe decompositions" doesn't mean anything to me. The verb > > should probably be more active - what does it mean to observe the > > char decompositions here? > > What about "heed"? "Respect" is a more common term with that meaning. But the point (to me) is that we are not conveying much by that - too vague. "Heed" meaning what? Heed how? Those are terms, like "treat", "handle" and "process" (verb), that are generally signs, in computer science as elsewhere, of insufficient understanding or laziness in communication. They say essentially, "it does something". Sometimes (not here though) such words can even be signals that the function in question is a congeries of things that do not necessarily belong together. We should be able to do better here. If I understood better what the function does I might be able to offer better name suggestions. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 11:37:31 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 16:37:31 +0000 Received: from localhost ([127.0.0.1]:55190 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgHyR-00034C-E0 for submit@debbugs.gnu.org; Wed, 05 Dec 2012 11:37:31 -0500 Received: from mtaout20.012.net.il ([80.179.55.166]:44479) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgHyL-000340-RM for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 11:37:25 -0500 Received: from conversion-daemon.a-mtaout20.012.net.il by a-mtaout20.012.net.il (HyperSendmail v2007.08) id <0MEK00300GPLBP00@a-mtaout20.012.net.il> for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 18:37:08 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout20.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEK003BSGTVCI00@a-mtaout20.012.net.il>; Wed, 05 Dec 2012 18:37:08 +0200 (IST) Date: Wed, 05 Dec 2012 18:37:09 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50BF16D4.1070506@gmx.at> X-012-Sender: halo1@inter.net.il To: martin rudalics Message-id: <83zk1spia2.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <83hao1r50f.fsf@gnu.org> <50BF16D4.1070506@gmx.at> X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > Date: Wed, 05 Dec 2012 10:41:40 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > > How about putting it in subr.el? > > If I correctly understand Juri, I next have to deal with things like > > (get-char-code-property #xff59 'decomposition) > > and related issues we might unearth in the course of this. My reading of the table in http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings you should ignore any car of the list returned by get-char-code-property if it does not pass the characterp test (or those that do pass the symbolp test). That is, the character #xff59 should sort exactly like lower-case y. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 12:16:28 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 17:16:28 +0000 Received: from localhost ([127.0.0.1]:55211 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgIaC-00040M-7i for submit@debbugs.gnu.org; Wed, 05 Dec 2012 12:16:28 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:48886) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgIa9-00040F-Tu for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 12:16:26 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB5HGDFI029693 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 17:16:13 GMT Received: from acsmt358.oracle.com (acsmt358.oracle.com [141.146.40.158]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB5HGCYs012278 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 17:16:12 GMT Received: from abhmt102.oracle.com (abhmt102.oracle.com [141.146.116.54]) by acsmt358.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB5HGCZD011424; Wed, 5 Dec 2012 11:16:12 -0600 Received: from dradamslap1 (/130.35.178.8) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 05 Dec 2012 09:16:12 -0800 From: "Drew Adams" To: "'martin rudalics'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com><50BF1702.4020100@gmx.at> <611DD154E83240D183A7B5B88691DC37@us.oracle.com> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Wed, 5 Dec 2012 09:16:11 -0800 Message-ID: <8164D22E74F94504B41247F314787E10@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <611DD154E83240D183A7B5B88691DC37@us.oracle.com> Thread-Index: Ac3SzNkIeOhfShQLQAypKA2c11bTAgAKwFpwAATZzQA= X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > Perhaps the name/description should speak in terms of Unicode > char compatibility or equivalence. Perhaps a name like > `string-less-compat-p'? Or `Unicode-equivalent-p'? Or > `string-equivalent-p'? In the last two suggestions I forgot about the "less" part. Taking a quick look at the Unicode specs, it seems that what we do involves (Unicode) "compatibility equivalence". But it also seemed that Eli was saying that for us this is not distinguished from (Unicode) "canonical equivalence". So perhaps `unicode-equivalence-less-p'? Or if there is a risk of confusion with char (not string) comparison, then perhaps `unicode-equiv-string-less-p'? Or just `equiv-string-less-p'? From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 13:00:38 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 18:00:38 +0000 Received: from localhost ([127.0.0.1]:55236 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgJGt-00054c-VE for submit@debbugs.gnu.org; Wed, 05 Dec 2012 13:00:37 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:20993) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgJGq-00054T-FZ for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 13:00:33 -0500 Received: from ucsinet22.oracle.com (ucsinet22.oracle.com [156.151.31.94]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB5I0HC4022071 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 18:00:17 GMT Received: from acsmt356.oracle.com (acsmt356.oracle.com [141.146.40.156]) by ucsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB5I0Grm007615 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 18:00:16 GMT Received: from abhmt106.oracle.com (abhmt106.oracle.com [141.146.116.58]) by acsmt356.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB5I0FH7009706; Wed, 5 Dec 2012 12:00:15 -0600 Received: from dradamslap1 (/130.35.178.8) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 05 Dec 2012 10:00:15 -0800 From: "Drew Adams" To: "'martin rudalics'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com><50BF1702.4020100@gmx.at><611DD154E83240D183A7B5B88691DC37@us.oracle.com> <8164D22E74F94504B41247F314787E10@us.oracle.com> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Wed, 5 Dec 2012 10:00:14 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <8164D22E74F94504B41247F314787E10@us.oracle.com> Thread-Index: Ac3SzNkIeOhfShQLQAypKA2c11bTAgAKwFpwAATZzQAAAWVdYA== X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: ucsinet22.oracle.com [156.151.31.94] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.3 (--) FWIW - Some more browsing on the topic tells me that what we are trying to come up with here is a predicate for the NFKD canonical ordering (as applied to a char sequence, not to a single char). IOW, a string-ordering predicate that uses the canonical ordering for a character's decomposed normal code point sequence. We are using compatibility normalization, not canonical normalization. So a search (or a string comparison test) for `f' will match the ligature `ffi' (whereas it would not match wrt canonical normalization). Someone please correct me if any of this is wrong. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 13:27:51 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 18:27:51 +0000 Received: from localhost ([127.0.0.1]:55241 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgJhG-0005gh-6M for submit@debbugs.gnu.org; Wed, 05 Dec 2012 13:27:51 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:64041) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgJhC-0005gY-Jf for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 13:27:48 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEK00600LNPY600@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 20:27:36 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEK0067NLXX9ND0@a-mtaout22.012.net.il>; Wed, 05 Dec 2012 20:27:36 +0200 (IST) Date: Wed, 05 Dec 2012 20:27:34 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: X-012-Sender: halo1@inter.net.il To: Drew Adams Message-id: <83vccgpd61.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> <611DD154E83240D183A7B5B88691DC37@us.oracle.com> <8164D22E74F94504B41247F314787E10@us.oracle.com> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: rudalics@gmx.at, 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > From: "Drew Adams" > Date: Wed, 5 Dec 2012 10:00:14 -0800 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > We are using compatibility normalization, not canonical normalization. So a > search (or a string comparison test) for `f' will match the ligature `ffi' > (whereas it would not match wrt canonical normalization). > > Someone please correct me if any of this is wrong. I'm not sure who is wrong ;-), but I think when compatibility decomposition exists, it should be used; if not, the canonical decomposition should be used. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 14:17:24 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 19:17:24 +0000 Received: from localhost ([127.0.0.1]:55272 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgKTC-0006sg-OQ for submit@debbugs.gnu.org; Wed, 05 Dec 2012 14:17:24 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:40380) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgKT9-0006sU-Uy for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 14:17:20 -0500 Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB5JH8w3010488 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 5 Dec 2012 19:17:08 GMT Received: from acsmt356.oracle.com (acsmt356.oracle.com [141.146.40.156]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB5JH6IF011843 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 5 Dec 2012 19:17:07 GMT Received: from abhmt104.oracle.com (abhmt104.oracle.com [141.146.116.56]) by acsmt356.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB5JH6f6005365; Wed, 5 Dec 2012 13:17:06 -0600 Received: from dradamslap1 (/130.35.178.8) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 05 Dec 2012 11:17:06 -0800 From: "Drew Adams" To: "'Eli Zaretskii'" , "'Juri Linkov'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org> <83wqx0s4jy.fsf@gnu.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Wed, 5 Dec 2012 11:17:04 -0800 Message-ID: <331C8BE9164A4C228B0FBF0D579C708E@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: <83wqx0s4jy.fsf@gnu.org> Thread-Index: Ac3QuVnvUlEF6+92RqG0sqylG4ZsNwCYtixA X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: ucsinet21.oracle.com [156.151.31.93] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > > I'm surprised to see case mappings hard-coded in > > lisp/international/characters.el instead of using the properties > > `uppercase' and `lowercase' during creation of case tables. > > My guess is that this is because the code in characters.el was written > long before we had access to Unicode character properties in Emacs, > and in fact before Emacs was switched to character representation > based on Unicode codepoints. And no one bothered to rewrite that code > since then; volunteers are welcome. Doesn't file CaseFolding.txt contain all the info needed? If so, what about populating the case tables from the latest CaseFolding.txt file at Emacs build time? Or if no Internet access during build, populate from a copy of the file to be distributed with Emacs. And provide the same population code as a Lisp function, in case someone wants to refresh an old Emacs release to use a more recent CaseFolding.txt file. Would this make any sense? From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 16:19:46 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 21:19:46 +0000 Received: from localhost ([127.0.0.1]:55367 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgMNe-00038T-DC for submit@debbugs.gnu.org; Wed, 05 Dec 2012 16:19:46 -0500 Received: from mtaout23.012.net.il ([80.179.55.175]:50176) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgMNc-00038J-3r for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 16:19:45 -0500 Received: from conversion-daemon.a-mtaout23.012.net.il by a-mtaout23.012.net.il (HyperSendmail v2007.08) id <0MEK00C00TS3PN00@a-mtaout23.012.net.il> for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 23:19:33 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout23.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEK00CJ6TWLHX90@a-mtaout23.012.net.il>; Wed, 05 Dec 2012 23:19:33 +0200 (IST) Date: Wed, 05 Dec 2012 23:19:35 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <331C8BE9164A4C228B0FBF0D579C708E@us.oracle.com> X-012-Sender: halo1@inter.net.il To: Drew Adams Message-id: <83sj7kp57c.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <83wqx0s4jy.fsf@gnu.org> <331C8BE9164A4C228B0FBF0D579C708E@us.oracle.com> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: "Drew Adams" > Cc: , <13041@debbugs.gnu.org>, > Date: Wed, 5 Dec 2012 11:17:04 -0800 > > > > I'm surprised to see case mappings hard-coded in > > > lisp/international/characters.el instead of using the properties > > > `uppercase' and `lowercase' during creation of case tables. > > > > My guess is that this is because the code in characters.el was written > > long before we had access to Unicode character properties in Emacs, > > and in fact before Emacs was switched to character representation > > based on Unicode codepoints. And no one bothered to rewrite that code > > since then; volunteers are welcome. > > Doesn't file CaseFolding.txt contain all the info needed? [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.175 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4825] X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.7 (/) > From: "Drew Adams" > Cc: , <13041@debbugs.gnu.org>, > Date: Wed, 5 Dec 2012 11:17:04 -0800 > > > > I'm surprised to see case mappings hard-coded in > > > lisp/international/characters.el instead of using the properties > > > `uppercase' and `lowercase' during creation of case tables. > > > > My guess is that this is because the code in characters.el was written > > long before we had access to Unicode character properties in Emacs, > > and in fact before Emacs was switched to character representation > > based on Unicode codepoints. And no one bothered to rewrite that code > > since then; volunteers are welcome. > > Doesn't file CaseFolding.txt contain all the info needed? You don't need CaseFolding.txt, because UnicodeData.txt includes the same information, and uni-lowercase.el, uni-uppercase.el, and uni-titlecase.el already read that information into char-tables. > If so, what about populating the case tables from the latest CaseFolding.txt > file at Emacs build time? Or if no Internet access during build, populate from > a copy of the file to be distributed with Emacs. > > And provide the same population code as a Lisp function, in case someone wants > to refresh an old Emacs release to use a more recent CaseFolding.txt file. > > Would this make any sense? It would make sense to load case tables from uni-*.el at Emacs build time. Volunteers are welcome. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 18:14:36 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 23:14:36 +0000 Received: from localhost ([127.0.0.1]:55439 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgOAm-0006Uc-1E for submit@debbugs.gnu.org; Wed, 05 Dec 2012 18:14:36 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:40949 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgOAj-0006UT-Qf for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 18:14:34 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id E95F246FA011; Wed, 5 Dec 2012 15:14:21 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> Date: Thu, 06 Dec 2012 01:04:02 +0200 In-Reply-To: <50BF1702.4020100@gmx.at> (martin rudalics's message of "Wed, 05 Dec 2012 10:42:26 +0100") Message-ID: <87y5hc6s05.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, Drew Adams X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > `ignore-diacritics' is misleading. The variable would have to be called > `observe-decompositions' or something the like. Since the existing variable that corresponds to the Unicode file CaseFolding.txt is `case-fold-search', its counterpart variable that corresponds to the Unicode file Decomposition.txt could be called `decomposition-search'. Also like the existing `sort-fold-case', its counterpart could be called `sort-decomposition'. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 05 18:14:40 2012 Received: (at 13041) by debbugs.gnu.org; 5 Dec 2012 23:14:40 +0000 Received: from localhost ([127.0.0.1]:55442 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgOAp-0006Us-CP for submit@debbugs.gnu.org; Wed, 05 Dec 2012 18:14:40 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:40962 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgOAm-0006Uh-E0 for 13041@debbugs.gnu.org; Wed, 05 Dec 2012 18:14:37 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id D1DF4451E1A2; Wed, 5 Dec 2012 15:14:24 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search In-Reply-To: <50BF16D4.1070506@gmx.at> (martin rudalics's message of "Wed, 05 Dec 2012 10:41:40 +0100") Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <83hao1r50f.fsf@gnu.org> <50BF16D4.1070506@gmx.at> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) Date: Thu, 06 Dec 2012 01:05:42 +0200 Message-ID: <87d2yo5cc9.fsf@mail.jurta.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > If I correctly understand Juri, I next have to deal with things like > > (get-char-code-property #xff59 'decomposition) > > and related issues we might unearth in the course of this. Only until bug#13084 is fixed that is a separate problem. > Also, while currently sorting is stable in the sense that with respect > to diacritics text remains unchanged from the original order, this is > not nice for sorting larger pieces of text. So I'd rather have to use > the second list element returned by `get-char-code-property' to make > sure that, for example, "e" gets always sorted before "=E8" before "=E9= ". In principle, you could do this by let-binding a new variable `sort-decomposition' to non-nil for stable sorting. And later to let-bind `sort-decomposition' to nil for last-resort comparison where equal lines (equal according to non-nil `sort-decomposition') will be sorted without regard to decomposition. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 04:27:19 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 09:27:19 +0000 Received: from localhost ([127.0.0.1]:55846 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgXji-0000Mu-Ab for submit@debbugs.gnu.org; Thu, 06 Dec 2012 04:27:19 -0500 Received: from fencepost.gnu.org ([208.118.235.10]:40314) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgXjg-0000Mn-9M for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 04:27:17 -0500 Received: from 253.240.accsnet.ne.jp ([202.220.240.253]:51286 helo=mongkok) by fencepost.gnu.org with esmtpsa (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1TgXjT-0002fO-0K; Thu, 06 Dec 2012 04:27:03 -0500 From: Kenichi Handa To: "Drew Adams" Subject: Re: bug#13041: 24.2; diacritic-fold-search In-Reply-To: <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com> (drew.adams@oracle.com) Date: Thu, 06 Dec 2012 18:25:12 +0900 Message-ID: <87ip8fjzwn.fsf@gnu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -4.2 (----) X-Debbugs-Envelope-To: 13041 Cc: rudalics@gmx.at, eliz@gnu.org, perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -4.2 (----) In article <707786B35E94470FB727BCF7F3DDA41A@us.oracle.com>, "Drew Adams" <= drew.adams@oracle.com> writes: > This version of Martin's function (but respecting `case-fold-search') is = maybe a > tiny bit simpler. It could also be a bit slower because of `substring' > returning a copy (vs just incrementing an offset). It should also be che= cked > for correctness - not really tested. FWIW/HTH. Emacs contains ucs-normailze package which provides various normalization functions. For instance, (require 'ucs-normalize) (ucs-normalize-NFKD-string "=C3=84ffin") =3D> "A=CC=88ffin" Isn't it usable? --- Kenichi Handa handa@gnu.org From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 05:28:28 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 10:28:28 +0000 Received: from localhost ([127.0.0.1]:55893 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgYgt-0001t5-UG for submit@debbugs.gnu.org; Thu, 06 Dec 2012 05:28:28 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:46420) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgYgq-0001su-Pc for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 05:28:26 -0500 Received: (qmail invoked by alias); 06 Dec 2012 10:28:10 -0000 Received: from 62-47-51-163.adsl.highway.telekom.at (EHLO [62.47.51.163]) [62.47.51.163] by mail.gmx.net (mp037) with SMTP; 06 Dec 2012 11:28:10 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18GQ4te4TcGV9G1jO3d9QKkZFnoZv1/mSv6d9oy/3 3VWLQlwh9NBRvB Message-ID: <50C07335.2090602@gmx.at> Date: Thu, 06 Dec 2012 11:28:05 +0100 From: martin rudalics MIME-Version: 1.0 To: Drew Adams Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> <611DD154E83240D183A7B5B88691DC37@us.oracle.com> In-Reply-To: <611DD154E83240D183A7B5B88691DC37@us.oracle.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 'Eli Zaretskii' , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) >> `ignore-diacritics' is misleading. The variable would have >> to be called `observe-decompositions' or something the like. > > > 1. "Observe decompositions" doesn't mean anything to me. The verb sho= uld > probably be more active - what does it mean to observe the char decomp= ositions > here? > > BTW, if we use "decomposition" in the name and description then we sho= uld > probably also use "char" - this is not about decomposing strings in so= me way > (whatever that might mean); it involves decomposing Unicode characters= =2E `ignore-diacritics' is misleading because when we, for example, sort/match ligatures we already do more than ignore diacritics. A variable using the term `observe-decompositions' would express what the underlying algorithm does - observe the decomposition properties provided by `get-char-code-property'. Bear in mind that a "correct" solution for searching and sorting would have to be based on a correct implementation of a collation table (see bug#12008) plus some options that make searching more convenient (aka "asymmetric searching" http://www.unicode.org/reports/tr10/#Searching). In that sense, Juri's approach for searching and my function can be considered only as poor man's variants of what should be eventually done. For example my Austrian locale sorts o < =F6 < p while IIUC Swedish has o < p ... < z < =F6 which IIUC can't be done via the decomposition table. I don't know whether this implies that searching for "o" in Swedish means to _not_ list results for "=F6" either. > 2. But my confusion over the name/description is in fact wrt function > `decomposed-string-lessp': I guess it's not 100% clear to me what it d= oes. > > Your doc string said "STRING1 is decomposition-less than STRING2", whi= ch > confuses me. And it is a bit ambiguous wrt "-less": > > a. decomposition-less as in comparing the strings only after > removing (some parts of) their decompositions (i.e., "-less" > as in "sans")? > > or > > b. -lessp as in `string<': a comparison ordering relation? I didn't think much about the wording. But I can't, in general, talk about comparing characters because in the ligature case (or the "=DF" vs "ss" case) I do compare substrings. > In the version of `decomposed-string-lessp' that I sent, I changed the= doc > string to this: "decomposed STRING1 is less than decomposed STRING2". = But that > is no doubt incorrect (less correct than yours, if perhaps clearer). = In > particular, it says nothing about how we compare the two decomposition= s. > > In practical (use) terms, this is typically about ignoring diacritics,= keeping > only the "base" characters. Something about that should at least be m= entioned > in the doc, so that users know they can use this for that. Yes. > But IIUC this is not just about diacritics; it sometimes might not be = about > diacritics at all; and diacritics present are sometimes not ignored. = E.g., the > ligature ffi gets treated the same as the 3 chars f f i. There are no= > diacritics present in that case. That's why I want to just talk about decompositions for the moment. > IIUC, we convert the two strings to their Unicode decompositions and t= hen use > the Unicode char compatibility specs to compare the decompositions. I= OW, we > treat equivalent chars, as defined by Unicode, as the same. Character sequences, IIUC. > Perhaps the name/description should speak in terms of Unicode char com= patibility > or equivalence. Perhaps a name like `string-less-compat-p'? Or > `Unicode-equivalent-p'? Or `string-equivalent-p'? > > How would you characterize what the function does? No doubt Eli can h= elp here. > It is important to try to get the function name and description right = from the > outset, if we can. If the Unicode standard has some terminology that = applies > here then perhaps we can/should leverage that. I'm not sure whether we can ever fully support Unicode here - the weights you find in http://www.unicode.org/Public/UCA/6.2.0/allkeys.txt appear hardly digestible for me (and my machine, presumably). > Beyond the name and an accurate description, the doc should, as I say,= at least > mention that you can use this to ignore diacritics (such as accents), = as that > will be a common use case. Sure. martin From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 05:31:51 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 10:31:52 +0000 Received: from localhost ([127.0.0.1]:55897 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgYkB-0001yb-MR for submit@debbugs.gnu.org; Thu, 06 Dec 2012 05:31:51 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:37238) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgYk9-0001yT-1g for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 05:31:49 -0500 Received: (qmail invoked by alias); 06 Dec 2012 10:31:35 -0000 Received: from 62-47-51-163.adsl.highway.telekom.at (EHLO [62.47.51.163]) [62.47.51.163] by mail.gmx.net (mp072) with SMTP; 06 Dec 2012 11:31:35 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX1+2F1wb0lS2O7ncGpg/+cmh0tdtDML2iayTqcTCZM oIEugxOBD9yHeR Message-ID: <50C07403.8090005@gmx.at> Date: Thu, 06 Dec 2012 11:31:31 +0100 From: martin rudalics MIME-Version: 1.0 To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <83hao1r50f.fsf@gnu.org> <50BF16D4.1070506@gmx.at> <83zk1spia2.fsf@gnu.org> In-Reply-To: <83zk1spia2.fsf@gnu.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.5 (/) > My reading of the table in > > http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings > > you should ignore any car of the list returned by > get-char-code-property if it does not pass the characterp test (or > those that do pass the symbolp test). That is, the character #xff59 > should sort exactly like lower-case y. That is, `wide' and `compat' are completely equivalent in this regard? martin From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 05:32:04 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 10:32:04 +0000 Received: from localhost ([127.0.0.1]:55901 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgYkN-0001zM-Tc for submit@debbugs.gnu.org; Thu, 06 Dec 2012 05:32:04 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:58450) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgYkL-0001ys-OL for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 05:32:02 -0500 Received: (qmail invoked by alias); 06 Dec 2012 10:31:48 -0000 Received: from 62-47-51-163.adsl.highway.telekom.at (EHLO [62.47.51.163]) [62.47.51.163] by mail.gmx.net (mp041) with SMTP; 06 Dec 2012 11:31:48 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18YHnUMA+oYwmsqciEmG8ByhHOIezgQYfjecmPfr5 BckGXmIfJGGFOh Message-ID: <50C07410.8060705@gmx.at> Date: Thu, 06 Dec 2012 11:31:44 +0100 From: martin rudalics MIME-Version: 1.0 To: Drew Adams Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com><50BF1702.4020100@gmx.at><611DD154E83240D183A7B5B88691DC37@us.oracle.com> <8164D22E74F94504B41247F314787E10@us.oracle.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > We are using compatibility normalization, not canonical normalization. So a > search (or a string comparison test) for `f' will match the ligature `ffi' > (whereas it would not match wrt canonical normalization). If it can be done, searching for "f" should match ligatures like "ff" and "fi". martin From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 05:32:14 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 10:32:14 +0000 Received: from localhost ([127.0.0.1]:55904 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgYkX-0001zg-8r for submit@debbugs.gnu.org; Thu, 06 Dec 2012 05:32:14 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:54586) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgYkV-0001zZ-Bz for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 05:32:11 -0500 Received: (qmail invoked by alias); 06 Dec 2012 10:31:58 -0000 Received: from 62-47-51-163.adsl.highway.telekom.at (EHLO [62.47.51.163]) [62.47.51.163] by mail.gmx.net (mp034) with SMTP; 06 Dec 2012 11:31:58 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18Ch24oFUkCxxyyop1lvWBT11WhEGT1M+iVdYeJyl b7uotQdL6dkSwV Message-ID: <50C07419.6060900@gmx.at> Date: Thu, 06 Dec 2012 11:31:53 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> <87y5hc6s05.fsf@mail.jurta.org> In-Reply-To: <87y5hc6s05.fsf@mail.jurta.org> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, Drew Adams X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > Since the existing variable that corresponds to the > Unicode file CaseFolding.txt is `case-fold-search', > its counterpart variable that corresponds to the Unicode file > Decomposition.txt Where is this file? > could be called `decomposition-search'. > > Also like the existing `sort-fold-case', its counterpart could be called > `sort-decomposition'. martin From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 05:32:57 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 10:32:57 +0000 Received: from localhost ([127.0.0.1]:55907 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgYlE-00020b-FL for submit@debbugs.gnu.org; Thu, 06 Dec 2012 05:32:56 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:59461) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgYlD-00020S-0l for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 05:32:55 -0500 Received: (qmail invoked by alias); 06 Dec 2012 10:32:41 -0000 Received: from 62-47-51-163.adsl.highway.telekom.at (EHLO [62.47.51.163]) [62.47.51.163] by mail.gmx.net (mp034) with SMTP; 06 Dec 2012 11:32:41 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18gxFNtfPsgXP7Nd9U7X8utCEiWhR7kFNBu3PxiVv bcGkcVSYxwCGgJ Message-ID: <50C07445.1020807@gmx.at> Date: Thu, 06 Dec 2012 11:32:37 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <83hao1r50f.fsf@gnu.org> <50BF16D4.1070506@gmx.at> <87d2yo5cc9.fsf@mail.jurta.org> In-Reply-To: <87d2yo5cc9.fsf@mail.jurta.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, Eli Zaretskii , perin@panix.com, 13041@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > And later to let-bind `sort-decomposition' to nil for > last-resort comparison where equal lines > (equal according to non-nil `sort-decomposition') > will be sorted without regard to decomposition. Indeed. In any case, equal lines shouldn't be the rule - especially with functions that remove duplicates ;-) martin From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 05:34:46 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 10:34:46 +0000 Received: from localhost ([127.0.0.1]:55913 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgYn0-00023T-57 for submit@debbugs.gnu.org; Thu, 06 Dec 2012 05:34:46 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:35115) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgYmx-00023H-IQ for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 05:34:44 -0500 Received: (qmail invoked by alias); 06 Dec 2012 10:34:30 -0000 Received: from 62-47-51-163.adsl.highway.telekom.at (EHLO [62.47.51.163]) [62.47.51.163] by mail.gmx.net (mp028) with SMTP; 06 Dec 2012 11:34:30 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX1+qtVAEjVSqEXUZnP3oofy2aKZyW3uH8kMAGucwOF WCXXnwTtJ8B0CU Message-ID: <50C074B2.60808@gmx.at> Date: Thu, 06 Dec 2012 11:34:26 +0100 From: martin rudalics MIME-Version: 1.0 To: Kenichi Handa Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <87ip8fjzwn.fsf@gnu.org> In-Reply-To: <87ip8fjzwn.fsf@gnu.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, eliz@gnu.org, perin@panix.com, 13041@debbugs.gnu.org, Drew Adams X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > Emacs contains ucs-normailze package which provides various > normalization functions. For instance, > > (require 'ucs-normalize) > (ucs-normalize-NFKD-string "=C3=84ffin") =3D> "A=CC=88ffin" > > Isn't it usable? Actually, the function should do what we need. But I have no idea how to integrate it into a searching algorithm. And when sorting, it seems expensive for comparing buffer substrings. Also, the use of a temporary buffer for normalizing every single string makes its weight quite heavy. In any case, I would probably steal the entire decomposition property handling part from it. So thanks a lot for this hint. martin From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 11:00:30 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 16:00:30 +0000 Received: from localhost ([127.0.0.1]:56988 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgdsD-0002Gv-FY for submit@debbugs.gnu.org; Thu, 06 Dec 2012 11:00:30 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:36754) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgds7-0002Gh-TM for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 11:00:26 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB6G05kN030181 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Thu, 6 Dec 2012 16:00:06 GMT Received: from acsmt358.oracle.com (acsmt358.oracle.com [141.146.40.158]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB6G03uv011490 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 6 Dec 2012 16:00:03 GMT Received: from abhmt119.oracle.com (abhmt119.oracle.com [141.146.116.71]) by acsmt358.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB6G03qr018563; Thu, 6 Dec 2012 10:00:03 -0600 Received: from dradamslap1 (/10.159.236.61) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 06 Dec 2012 08:00:02 -0800 From: "Drew Adams" To: "'martin rudalics'" References: <20121130182205.C722F14B8D@panix1.panix.com><87hao69b5r.fsf@mail.jurta.org><20665.8224.844876.619203@panix5.panix.com><87hao6zko4.fsf@mail.jurta.org><83fw3qtboc.fsf@gnu.org><87hao5jqu3.fsf@mail.jurta.org><50BB93C2.1050007@gmx.at><83y5hgs564.fsf@gnu.org><50BC7BF5.2020400@gmx.at><83hao3rskd.fsf@gnu.org><50BCE49D.6010001@gmx.at><837gozrp8f.fsf@gnu.org><50BE38F3.3030907@gmx.at><3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com><50BF1702.4020100@gmx.at><611DD154E83240D183A7B5B88691DC37@us.oracle.com> <8164D22E74F94504B41247F314787E10@us.oracle.com> <50C07410.8060705@gmx.at> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Thu, 6 Dec 2012 07:59:59 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 11 In-reply-to: <50C07410.8060705@gmx.at> Thread-Index: Ac3TnOcm/CdinjtMRa6+Y/CRcjdyywAK/bng X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > > We are using compatibility normalization, not canonical=20 > > normalization. So a search (or a string comparison test) > > for `f' will match the ligature `ffi' > > (whereas it would not match wrt canonical normalization). >=20 > If it can be done, searching for "f" should match ligatures like "ff" > and "fi". That's what I thought you were planning/preparing to do. On the other hand, as the Unicode spec points out (for level 2), = sometimes someone wants to distinguish searching for f from searching for the = ligature. Ideally (we might never get there), that would be possible as an = alternative (choice). The spec also points to hybrid situations regarding case conversion (see = sect RL2.4) where, e.g., you might want to do full case matching on =DF in a = literal name such as Strau=DF but simple case folding on =DF when used in a = character class, such as [=DF]. Dunno whether we would ever get there either. There seems to be a lot in the Unicode regexp spec (http://www.unicode.org/reports/tr18/) that could be food for thought = for Emacs. I imagine that some Emacs Dev folks have already taken a close look and = given it some thought. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 12:49:35 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 17:49:35 +0000 Received: from localhost ([127.0.0.1]:57054 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgfZn-0004kN-DB for submit@debbugs.gnu.org; Thu, 06 Dec 2012 12:49:35 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:59130) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgfZk-0004kE-Dl for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 12:49:34 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEM00M00EKNK100@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 19:48:44 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEM00MKPET6IA30@a-mtaout22.012.net.il>; Thu, 06 Dec 2012 19:48:44 +0200 (IST) Date: Thu, 06 Dec 2012 19:48:47 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50C07403.8090005@gmx.at> X-012-Sender: halo1@inter.net.il To: martin rudalics Message-id: <83lidboyv4.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <83hao1r50f.fsf@gnu.org> <50BF16D4.1070506@gmx.at> <83zk1spia2.fsf@gnu.org> <50C07403.8090005@gmx.at> X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > Date: Thu, 06 Dec 2012 11:31:31 +0100 > From: martin rudalics > CC: juri@jurta.org, perin@panix.com, 13041@debbugs.gnu.org, > perin@acm.org > > > My reading of the table in > > > > http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings > > > > you should ignore any car of the list returned by > > get-char-code-property if it does not pass the characterp test (or > > those that do pass the symbolp test). That is, the character #xff59 > > should sort exactly like lower-case y. > > That is, `wide' and `compat' are completely equivalent in this regard? Yes. They are all different forms of the same character, which should all compare equal in this context. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 12:51:06 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 17:51:06 +0000 Received: from localhost ([127.0.0.1]:57058 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgfbF-0004ms-Tc for submit@debbugs.gnu.org; Thu, 06 Dec 2012 12:51:06 -0500 Received: from mtaout21.012.net.il ([80.179.55.169]:62647) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgfbD-0004mh-N5 for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 12:51:04 -0500 Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0MEM00K00EU6CX00@a-mtaout21.012.net.il> for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 19:50:44 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEM00K1TEWJD000@a-mtaout21.012.net.il>; Thu, 06 Dec 2012 19:50:44 +0200 (IST) Date: Thu, 06 Dec 2012 19:50:48 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50C074B2.60808@gmx.at> To: martin rudalics Message-id: <83k3svoyrr.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <87ip8fjzwn.fsf@gnu.org> <50C074B2.60808@gmx.at> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Date: Thu, 06 Dec 2012 11:34:26 +0100 > From: martin rudalics > CC: Drew Adams , eliz@gnu.org, perin@panix.com, > 13041@debbugs.gnu.org, perin@acm.org > > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > > > Isn't it usable? > > Actually, the function should do what we need. But I have no idea how > to integrate it into a searching algorithm. And when sorting, it seems > expensive for comparing buffer substrings. Also, the use of a temporary > buffer for normalizing every single string makes its weight quite heavy. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.169 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4878] X-Debbugs-Envelope-To: 13041 Cc: handa@gnu.org, 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org, drew.adams@oracle.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.7 (/) > Date: Thu, 06 Dec 2012 11:34:26 +0100 > From: martin rudalics > CC: Drew Adams , eliz@gnu.org, perin@panix.c= om,=20 > 13041@debbugs.gnu.org, perin@acm.org >=20 > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "=C3=84ffin") =3D> "A=CC=88ffin" > > > > Isn't it usable? >=20 > Actually, the function should do what we need. But I have no idea = how > to integrate it into a searching algorithm. And when sorting, it s= eems > expensive for comparing buffer substrings. Also, the use of a temp= orary > buffer for normalizing every single string makes its weight quite h= eavy. Yes, I don't think this will be possible without changes on the C level. Those changes should use code very similar to what we currently do for case-insensitive search. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 12:53:59 2012 Received: (at 13041) by debbugs.gnu.org; 6 Dec 2012 17:54:00 +0000 Received: from localhost ([127.0.0.1]:57064 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgfe3-0004qd-FP for submit@debbugs.gnu.org; Thu, 06 Dec 2012 12:53:59 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:60240) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgfe1-0004qS-7d for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 12:53:57 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEM00M00EFNJF00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 19:53:16 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEM00LE8F0SL7O0@a-mtaout22.012.net.il>; Thu, 06 Dec 2012 19:53:16 +0200 (IST) Date: Thu, 06 Dec 2012 19:53:20 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50C07335.2090602@gmx.at> X-012-Sender: halo1@inter.net.il To: martin rudalics Message-id: <83ip8foynj.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> <611DD154E83240D183A7B5B88691DC37@us.oracle.com> <50C07335.2090602@gmx.at> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, drew.adams@oracle.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > Date: Thu, 06 Dec 2012 11:28:05 +0100 > From: martin rudalics > CC: 'Eli Zaretskii' , perin@panix.com, > 13041@debbugs.gnu.org, perin@acm.org > > >> `ignore-diacritics' is misleading. The variable would have > >> to be called `observe-decompositions' or something the like. > > > > > > 1. "Observe decompositions" doesn't mean anything to me. The verb should > > probably be more active - what does it mean to observe the char decompositions > > here? > > > > BTW, if we use "decomposition" in the name and description then we should > > probably also use "char" - this is not about decomposing strings in some way > > (whatever that might mean); it involves decomposing Unicode characters. > > `ignore-diacritics' is misleading because when we, for example, > sort/match ligatures we already do more than ignore diacritics. A > variable using the term `observe-decompositions' would express what the > underlying algorithm does - observe the decomposition properties > provided by `get-char-code-property'. I would suggest something like equivalence-search or maybe loose-match-search. The latter is slightly less suitable, since loose matches include not just decompositions, see the Unicode Regular Expressions report. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 20:32:37 2012 Received: (at 13041) by debbugs.gnu.org; 7 Dec 2012 01:32:37 +0000 Received: from localhost ([127.0.0.1]:57506 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgmnt-0007OD-62 for submit@debbugs.gnu.org; Thu, 06 Dec 2012 20:32:37 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:39204 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgmno-0007Nz-Pa for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 20:32:35 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id C21B846FA011; Thu, 6 Dec 2012 17:32:14 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87hao69b5r.fsf@mail.jurta.org> <20665.8224.844876.619203@panix5.panix.com> <87hao6zko4.fsf@mail.jurta.org> <83fw3qtboc.fsf@gnu.org> <87hao5jqu3.fsf@mail.jurta.org> <50BB93C2.1050007@gmx.at> <83y5hgs564.fsf@gnu.org> <50BC7BF5.2020400@gmx.at> <83hao3rskd.fsf@gnu.org> <50BCE49D.6010001@gmx.at> <837gozrp8f.fsf@gnu.org> <50BE38F3.3030907@gmx.at> <3E2D742BA0FC44B7A61665D85AAC3712@us.oracle.com> <50BF1702.4020100@gmx.at> <87y5hc6s05.fsf@mail.jurta.org> <50C07419.6060900@gmx.at> Date: Fri, 07 Dec 2012 02:52:12 +0200 In-Reply-To: <50C07419.6060900@gmx.at> (martin rudalics's message of "Thu, 06 Dec 2012 11:31:53 +0100") Message-ID: <87r4n264y7.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, Drew Adams X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) >> Since the existing variable that corresponds to the >> Unicode file CaseFolding.txt is `case-fold-search', >> its counterpart variable that corresponds to the Unicode file >> Decomposition.txt > > Where is this file? There was a reference to http://www.unicode.org/Public/UNIDATA/extracted/DerivedDecompositionType.txt from http://www.unicode.org/faq/casemap_charprop.html but it seems this file is redundant since you can get the same information from admin/unidata/UnicodeData.txt using (get-char-code-property ?? 'decomposition) From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 06 20:32:40 2012 Received: (at 13041) by debbugs.gnu.org; 7 Dec 2012 01:32:40 +0000 Received: from localhost ([127.0.0.1]:57508 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgmnv-0007OM-Dm for submit@debbugs.gnu.org; Thu, 06 Dec 2012 20:32:40 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:39225 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Tgmnr-0007O2-3V for 13041@debbugs.gnu.org; Thu, 06 Dec 2012 20:32:35 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 9041646FA012; Thu, 6 Dec 2012 17:32:17 -0800 (PST) From: Juri Linkov To: Kenichi Handa Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> Date: Fri, 07 Dec 2012 02:58:17 +0200 In-Reply-To: <87ip8fjzwn.fsf@gnu.org> (Kenichi Handa's message of "Thu, 06 Dec 2012 18:25:12 +0900") Message-ID: <871uf2647i.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, Drew Adams X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > Emacs contains ucs-normailze package which provides various > normalization functions. For instance, > > (require 'ucs-normalize) > (ucs-normalize-NFKD-string "=C3=84ffin") =3D> "A=CC=88ffin" > > Isn't it usable? This is usable to sort and compare strings, but I don't see how ucs-normalize.el could help in the search. I suppose the searched buffer can't be normalized before starting a search. So the search function somehow should be able to skip combining characters in the buffer. But to do this, the translation table needs to contain additional information about certain characters to ignore. Also the translation table should be able to map a sequence of characters like "ss" to "=C3=9F". From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 07 01:33:41 2012 Received: (at 13041) by debbugs.gnu.org; 7 Dec 2012 06:33:41 +0000 Received: from localhost ([127.0.0.1]:57699 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgrVF-0006jQ-79 for submit@debbugs.gnu.org; Fri, 07 Dec 2012 01:33:41 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:59911) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgrVB-0006jG-84 for 13041@debbugs.gnu.org; Fri, 07 Dec 2012 01:33:39 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEN00600E63FQ00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Fri, 07 Dec 2012 08:33:19 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEN006TJE7I8Y30@a-mtaout22.012.net.il>; Fri, 07 Dec 2012 08:33:19 +0200 (IST) Date: Fri, 07 Dec 2012 08:33:04 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <871uf2647i.fsf@mail.jurta.org> To: Juri Linkov Message-id: <83pq2mnzhb.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: Juri Linkov > Date: Fri, 07 Dec 2012 02:58:17 +0200 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > > > Isn't it usable? > > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] X-Debbugs-Envelope-To: 13041 Cc: handa@gnu.org, 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: Juri Linkov > Date: Fri, 07 Dec 2012 02:58:17 +0200 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org > > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "Äffin") => "Äffin" > > > > Isn't it usable? > > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4602] > From: Juri Linkov > Date: Fri, 07 Dec 2012 02:58:17 +0200 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org >=20 > > Emacs contains ucs-normailze package which provides various > > normalization functions. For instance, > > > > (require 'ucs-normalize) > > (ucs-normalize-NFKD-string "=C3=84ffin") =3D> "A=CC=88ffin" > > > > Isn't it usable? >=20 > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. I agree. > I suppose the searched buffer can't be normalized before starting a > search. Yes, that's not acceptable. > So the search function somehow should be able to skip combining > characters in the buffer. But to do this, the translation table ne= eds > to contain additional information about certain characters to ignor= e. Right. This is very similar to how the search primitives currently use the case tables, except that they don't skip characters. But adding such a skip operation should be easy. > Also the translation table should be able to map a sequence of > characters like "ss" to "=C3=9F". I'd say the other way around: map =C3=9F to ss. From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 07 05:37:26 2012 Received: (at 13041) by debbugs.gnu.org; 7 Dec 2012 10:37:27 +0000 Received: from localhost ([127.0.0.1]:57968 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TgvJ8-0004lO-MK for submit@debbugs.gnu.org; Fri, 07 Dec 2012 05:37:26 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:35127) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1TgvJ7-0004lH-AY for 13041@debbugs.gnu.org; Fri, 07 Dec 2012 05:37:25 -0500 Received: (qmail invoked by alias); 07 Dec 2012 10:37:06 -0000 Received: from 62-47-58-231.adsl.highway.telekom.at (EHLO [62.47.58.231]) [62.47.58.231] by mail.gmx.net (mp072) with SMTP; 07 Dec 2012 11:37:06 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX1+RUhdYWfKRIoGDmxOX5fYkJzOvrcCsOpNifmwenH CivcEQsuqAQJCw Message-ID: <50C1C6CC.9020103@gmx.at> Date: Fri, 07 Dec 2012 11:37:00 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> In-Reply-To: <871uf2647i.fsf@mail.jurta.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. I suppose the > searched buffer can't be normalized before starting a search. You can either temporarily [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.58.231 listed in combined.njabl.org] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.22 listed in list.dnswl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] X-Debbugs-Envelope-To: 13041 Cc: Kenichi Handa , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 3.0 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. I suppose the > searched buffer can't be normalized before starting a search. You can either temporarily [...] Content analysis details: (3.0 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.2 RCVD_IN_NJABL_PROXY RBL: NJABL: sender is an open proxy [62.47.58.231 listed in combined.njabl.org] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [213.165.64.22 listed in list.dnswl.org] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (rudalics[at]gmx.at) -0.0 SPF_PASS SPF: sender matches SPF record 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.5000] > This is usable to sort and compare strings, but I don't see > how ucs-normalize.el could help in the search. I suppose the > searched buffer can't be normalized before starting a search. You can either temporarily - leave the text alone but give each string that should be handled specially a text property with the normalized form. In this case searching has to pay attention to these properties, if present. - normalize the text and give each normalized string a text property with the original text. In this case searching will proceed as usual but you have to restore the original text when done. I don't know how feasible these are for searching. But I used the second approach for sorting without problems. Also I don't know how to handle the return value and/or highlighting when, for example, finding a match for "suf" within "su=EF=AC=80er". For= example, replacing each occurrence of "suf" with the empty string should leave us with "fer" here. So in this case, we have to deal with the normalized string anyway. OTOH replacing a match for "res" in "r=C3=A9su= m=C3=A9" with the empty string should probably leave us with "um=C3=A9". > So the search function somehow should be able to skip combining > characters in the buffer. But to do this, the translation table needs= > to contain additional information about certain characters to ignore. > Also the translation table should be able to map a sequence of > characters like "ss" to "=C3=9F". I have no idea how many mappings like "=C3=9F" -> "ss" exist. The proble= m is that we don't get them from UnicodeData.txt IIUC. martin From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 07 19:05:50 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 00:05:50 +0000 Received: from localhost ([127.0.0.1]:59545 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Th7vR-00008Q-GK for submit@debbugs.gnu.org; Fri, 07 Dec 2012 19:05:49 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:44947 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Th7vJ-000087-VT for 13041@debbugs.gnu.org; Fri, 07 Dec 2012 19:05:44 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 4B18846FA011; Fri, 7 Dec 2012 16:05:17 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> Date: Sat, 08 Dec 2012 01:55:22 +0200 In-Reply-To: <50C1C6CC.9020103@gmx.at> (martin rudalics's message of "Fri, 07 Dec 2012 11:37:00 +0100") Message-ID: <87ehj18l9p.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: Kenichi Handa , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > - leave the text alone but give each string that should be handled > specially a text property with the normalized form. In this case > searching has to pay attention to these properties, if present. > > - normalize the text and give each normalized string a text property > with the original text. In this case searching will proceed as usual > but you have to restore the original text when done. This reminds an idea that searching should take into account the text displayed with the `display' property and other display-related propertie= s. It seems this is more difficult to implement. > Also I don't know how to handle the return value and/or highlighting > when, for example, finding a match for "suf" within "su=EF=AC=80er". F= or > example, replacing each occurrence of "suf" with the empty string shoul= d > leave us with "fer" here. I believe such ligature characters should be handled as a whole, i.e. "suf" doesn't match "su=EF=AC=80er", only "suff" should match it. > I have no idea how many mappings like "=C3=9F" -> "ss" exist. The prob= lem is > that we don't get them from UnicodeData.txt IIUC. I can't find them in UnicodeData.txt too. Looking at the files in http://www.unicode.org/Public/UNIDATA/ can find them in the file http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt that is derived from http://www.unicode.org/Public/UNIDATA/CaseFolding.txt http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 03:21:02 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 08:21:02 +0000 Received: from localhost ([127.0.0.1]:59817 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThFef-0008Lo-RZ for submit@debbugs.gnu.org; Sat, 08 Dec 2012 03:21:02 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:45134) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThFec-0008LY-Pq for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 03:21:00 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEP00J00DSVXS00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 10:20:34 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEP00J4HDU9UP50@a-mtaout22.012.net.il>; Sat, 08 Dec 2012 10:20:34 +0200 (IST) Date: Sat, 08 Dec 2012 10:20:18 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <87ehj18l9p.fsf@mail.jurta.org> To: Juri Linkov Message-id: <83boe5lzul.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > From: Juri Linkov > Date: Sat, 08 Dec 2012 01:55:22 +0200 > Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org > > This reminds an idea that searching should take into account the text > displayed with the `display' property and other display-related properties. > It seems this is more difficult to implement. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4981] X-Debbugs-Envelope-To: 13041 Cc: rudalics@gmx.at, 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.7 (/) > From: Juri Linkov > Date: Sat, 08 Dec 2012 01:55:22 +0200 > Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org >=20 > This reminds an idea that searching should take into account the te= xt > displayed with the `display' property and other display-related pro= perties. > It seems this is more difficult to implement. I don't know if it's more difficult. After all, the primitives you need to (a) find out whether there's a display string at given buffer position, and (b) access its text, are already there, ready to be used. Moreover, there's even a C function that searches the current buffer for a specific Lisp string, which you could use as a model for this feature. What is definitely true, though, is that searching display string is = a separate feature, with an entirely different implementation. I suggest therefore to keep it in mind, but not mix with what's being discussed here. > > I have no idea how many mappings like "=C3=9F" -> "ss" exist. Th= e problem is > > that we don't get them from UnicodeData.txt IIUC. >=20 > I can't find them in UnicodeData.txt too. Looking at the files in > http://www.unicode.org/Public/UNIDATA/ can find them in the file >=20 > http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt >=20 > that is derived from >=20 > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt Maybe we should extend ucs-normalize.el to include that as well. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 06:22:23 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 11:22:23 +0000 Received: from localhost ([127.0.0.1]:59909 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThIUB-0004sg-GK for submit@debbugs.gnu.org; Sat, 08 Dec 2012 06:22:23 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:54130) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1ThIU9-0004sY-Dk for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 06:22:22 -0500 Received: (qmail invoked by alias); 08 Dec 2012 11:21:56 -0000 Received: from 62-47-55-242.adsl.highway.telekom.at (EHLO [62.47.55.242]) [62.47.55.242] by mail.gmx.net (mp034) with SMTP; 08 Dec 2012 12:21:56 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18TAKQ6w3cTELhfMybYBiM0IM4AijNWZPhA8CpXoO SRhym0njx5W8Ae Message-ID: <50C322CC.1000806@gmx.at> Date: Sat, 08 Dec 2012 12:21:48 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> In-Reply-To: <87ehj18l9p.fsf@mail.jurta.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: Kenichi Handa , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) >> - leave the text alone but give each string that should be handled >> specially a text property with the normalized form. In this case >> searching has to pay attention to these properties, if present. >> >> - normalize the text and give each normalized string a text property >> with the original text. In this case searching will proceed as usu= al >> but you have to restore the original text when done. > > This reminds an idea that searching should take into account the text > displayed with the `display' property and other display-related proper= ties. > It seems this is more difficult to implement. =2E.. and probably should include searching for overlays too. >> Also I don't know how to handle the return value and/or highlighting >> when, for example, finding a match for "suf" within "su=EF=AC=80er". = For >> example, replacing each occurrence of "suf" with the empty string sho= uld >> leave us with "fer" here. > > I believe such ligature characters should be handled as a whole, > i.e. "suf" doesn't match "su=EF=AC=80er", only "suff" should match it.= This means that when you type the second "f" you might get a match before the present one. Consider a buffer containing the two lines su=EF=AC=80er suffer Typing "suf" as search string would go to "suffer". Adding an "f" to the search string now would go back to "su=EF=AC=80er" (or not). Disconc= erting in any case. >> I have no idea how many mappings like "=C3=9F" -> "ss" exist. The pr= oblem is >> that we don't get them from UnicodeData.txt IIUC. > > I can't find them in UnicodeData.txt too. Looking at the files in > http://www.unicode.org/Public/UNIDATA/ can find them in the file > > http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt > > that is derived from > > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt Case folding "=C3=9F" to "SS" (upper case "S") is not what I had in mind.= I was talking about the (weak?) equivalence of "=C3=9F" and "ss" (lower cas= e "s") which is much more important when searching. In particular so, because many German words that were earlier written with an "=C3=9F" are = now written with "ss". martin From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 06:36:12 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 11:36:12 +0000 Received: from localhost ([127.0.0.1]:59925 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThIhX-0005EP-RU for submit@debbugs.gnu.org; Sat, 08 Dec 2012 06:36:12 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:41164) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1ThIhW-0005EJ-IM for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 06:36:11 -0500 Received: (qmail invoked by alias); 08 Dec 2012 11:35:45 -0000 Received: from 62-47-55-242.adsl.highway.telekom.at (EHLO [62.47.55.242]) [62.47.55.242] by mail.gmx.net (mp002) with SMTP; 08 Dec 2012 12:35:45 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18/O2Yt5I0/J8dUqf1zplFn9zdR1lJiCkVd9QRFzK x74JmDDoXxg5SS Message-ID: <50C32609.8000704@gmx.at> Date: Sat, 08 Dec 2012 12:35:37 +0100 From: martin rudalics MIME-Version: 1.0 To: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <83boe5lzul.fsf@gnu.org> In-Reply-To: <83boe5lzul.fsf@gnu.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: Juri Linkov , perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > I don't know if it's more difficult. After all, the primitives you > need to (a) find out whether there's a display string at given buffer > position, and (b) access its text, are already there, ready to be > used. Moreover, there's even a C function that searches the current > buffer for a specific Lisp string, which you could use as a model for > this feature. I think that mirroring/cloning (part of) the current buffer in a special search buffer would be the cheapest solution. The search buffer would contain the normalized text, be built only when normalization is needed and be rebuilt whenever a search option or the buffer text changes. I don't know whether `buffer-swap-text' could be used here. martin From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 07:40:55 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 12:40:56 +0000 Received: from localhost ([127.0.0.1]:60004 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThJiB-0007ao-Cy for submit@debbugs.gnu.org; Sat, 08 Dec 2012 07:40:55 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:33301) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThJi9-0007ah-6J for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 07:40:54 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEP00M00PSH7P00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 14:40:20 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEP00M2WPV87920@a-mtaout22.012.net.il>; Sat, 08 Dec 2012 14:40:20 +0200 (IST) Date: Sat, 08 Dec 2012 14:40:05 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <50C32609.8000704@gmx.at> X-012-Sender: halo1@inter.net.il To: martin rudalics Message-id: <83obi4lntm.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <83boe5lzul.fsf@gnu.org> <50C32609.8000704@gmx.at> X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Date: Sat, 08 Dec 2012 12:35:37 +0100 > From: martin rudalics > CC: Juri Linkov , 13041@debbugs.gnu.org, perin@panix.com, > perin@acm.org > > > I don't know if it's more difficult. After all, the primitives you > > need to (a) find out whether there's a display string at given buffer > > position, and (b) access its text, are already there, ready to be > > used. Moreover, there's even a C function that searches the current > > buffer for a specific Lisp string, which you could use as a model for > > this feature. > > I think that mirroring/cloning (part of) the current buffer in a special > search buffer would be the cheapest solution. The search buffer would > contain the normalized text, be built only when normalization is > needed and be rebuilt whenever a search option or the buffer text > changes. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4995] X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: > Date: Sat, 08 Dec 2012 12:35:37 +0100 > From: martin rudalics > CC: Juri Linkov , 13041@debbugs.gnu.org, perin@panix.com, > perin@acm.org > > > I don't know if it's more difficult. After all, the primitives you > > need to (a) find out whether there's a display string at given buffer > > position, and (b) access its text, are already there, ready to be > > used. Moreover, there's even a C function that searches the current > > buffer for a specific Lisp string, which you could use as a model for > > this feature. > > I think that mirroring/cloning (part of) the current buffer in a special > search buffer would be the cheapest solution. The search buffer would > contain the normalized text, be built only when normalization is > needed and be rebuilt whenever a search option or the buffer text > changes. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [80.179.55.172 listed in list.dnswl.org] 0.7 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) 0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60% [score: 0.4309] > Date: Sat, 08 Dec 2012 12:35:37 +0100 > From: martin rudalics > CC: Juri Linkov , 13041@debbugs.gnu.org, perin@panix.com, > perin@acm.org > > > I don't know if it's more difficult. After all, the primitives you > > need to (a) find out whether there's a display string at given buffer > > position, and (b) access its text, are already there, ready to be > > used. Moreover, there's even a C function that searches the current > > buffer for a specific Lisp string, which you could use as a model for > > this feature. > > I think that mirroring/cloning (part of) the current buffer in a special > search buffer would be the cheapest solution. The search buffer would > contain the normalized text, be built only when normalization is > needed and be rebuilt whenever a search option or the buffer text > changes. Maybe this is the cheapest, but it still needs the same support the other alternatives do. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 18:20:13 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 23:20:13 +0000 Received: from localhost ([127.0.0.1]:33284 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThTgp-0005Rd-QR for submit@debbugs.gnu.org; Sat, 08 Dec 2012 18:20:13 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:33170 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThTgl-0005RL-Vp for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 18:20:08 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id D1FB1AAA49C9; Sat, 8 Dec 2012 15:19:39 -0800 (PST) From: Juri Linkov To: martin rudalics Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <50C322CC.1000806@gmx.at> Date: Sun, 09 Dec 2012 01:07:12 +0200 In-Reply-To: <50C322CC.1000806@gmx.at> (martin rudalics's message of "Sat, 08 Dec 2012 12:21:48 +0100") Message-ID: <87ip8cz2zu.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: Kenichi Handa , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.9 (-) > This means that when you type the second "f" you might get a match > before the present one. Consider a buffer containing the two lines > su=EF=AC=80er > suffer > > Typing "suf" as search string would go to "suffer". Adding an "f" to > the search string now would go back to "su=EF=AC=80er" (or not). Going back looks like backtracking in the regexp search. OTOH, instead of using an approach of matching only a full match like in Chromium, we could do like GEdit and OpenOffice that match the whole ligature character in a partial match (i.e. to match "=EF=AC=80" when the search string is just "f"). Though this has a problem of highlighting the whole character for a partial match that looks wrong, but perhaps no one can do better. >> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt >> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt > > Case folding "=C3=9F" to "SS" (upper case "S") is not what I had in min= d. I > was talking about the (weak?) equivalence of "=C3=9F" and "ss" (lower c= ase > "s") which is much more important when searching. In particular so, > because many German words that were earlier written with an "=C3=9F" ar= e now > written with "ss". Yes, this is what I meant too. It is surprising but http://www.unicode.org/Public/UNIDATA/CaseFolding.txt defines the equivalence of "=C3=9F" and "ss" (lower case "s") instead of case-folding. The following line in CaseFolding.txt: 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S maps 00DF (LATIN SMALL LETTER SHARP S) to two characters 0073 0073 (LATIN SMALL LETTER S) keeping the lower case. Maybe this is a bug in Unicode data? From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 18:54:50 2012 Received: (at 13041) by debbugs.gnu.org; 8 Dec 2012 23:54:50 +0000 Received: from localhost ([127.0.0.1]:33299 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThUEL-0006Fa-SL for submit@debbugs.gnu.org; Sat, 08 Dec 2012 18:54:50 -0500 Received: from ironport2-out.teksavvy.com ([206.248.154.182]:11831) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThUEK-0006FU-8E for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 18:54:48 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av8EAG6Zu09soXOY/2dsb2JhbABEhS2uZIEIghUBAQQBIzMjBQsLGgIYDgICFBgNJIgcBacOknuBJo4KgRQDiEKacYFYgwc X-IronPort-AV: E=Sophos;i="4.75,637,1330923600"; d="scan'208";a="209435069" Received: from 108-161-115-152.dsl.teksavvy.com (HELO pastel.home) ([108.161.115.152]) by ironport2-out.teksavvy.com with ESMTP/TLS/ADH-AES256-SHA; 08 Dec 2012 18:54:16 -0500 Received: by pastel.home (Postfix, from userid 20848) id EC21358D3B; Sat, 8 Dec 2012 18:54:15 -0500 (EST) From: Stefan Monnier To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search Message-ID: References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> Date: Sat, 08 Dec 2012 18:54:15 -0500 In-Reply-To: <87ehj18l9p.fsf@mail.jurta.org> (Juri Linkov's message of "Sat, 08 Dec 2012 01:55:22 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 13041 Cc: martin rudalics , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.5 (/) > i.e. "suf" doesn't match "su=EF=AC=80er", only "suff" should match it. I completely disagree here. "suf" should match "su=EF=AC=80er". Stefan From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 19:05:22 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 00:05:22 +0000 Received: from localhost ([127.0.0.1]:33307 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThUOY-0006Uz-Dn for submit@debbugs.gnu.org; Sat, 08 Dec 2012 19:05:22 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:26179) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThUOW-0006Ur-79 for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 19:05:21 -0500 Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB904mBu014149 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sun, 9 Dec 2012 00:04:49 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB904ldq027735 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 9 Dec 2012 00:04:48 GMT Received: from abhmt112.oracle.com (abhmt112.oracle.com [141.146.116.64]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB904l0d001977; Sat, 8 Dec 2012 18:04:47 -0600 Received: from dradamslap1 (/71.202.147.44) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sat, 08 Dec 2012 16:04:47 -0800 From: "Drew Adams" To: "'Juri Linkov'" , "'martin rudalics'" References: <20121130182205.C722F14B8D@panix1.panix.com><87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org><50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org><50C322CC.1000806@gmx.at> <87ip8cz2zu.fsf@mail.jurta.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Sat, 8 Dec 2012 16:04:38 -0800 Message-ID: <18FE346ED0D14D3EA18A23E867356CAA@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-reply-to: <87ip8cz2zu.fsf@mail.jurta.org> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Thread-Index: Ac3VmqfFmLx9UIO+TPGKhqMMKJR5hgABaZzQ X-Source-IP: ucsinet21.oracle.com [156.151.31.93] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > > Typing "suf" as search string would go to "suffer". Adding > > an "f" to the search string now would go back to "su?er" (or not). > > Going back looks like backtracking in the regexp search. > > OTOH, instead of using an approach of matching only a full match > like in Chromium, we could do like GEdit and OpenOffice that > match the whole ligature character in a partial match > (i.e. to match "?" when the search string is just "f"). Seems to me that the starting point should be the Unicode Regexp spec, which outlines the behavior of level 1 and level 2 searches. Emacs Dev can choose what it wants to do, of course, but that is a good place to start, I think. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 19:15:10 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 00:15:11 +0000 Received: from localhost ([127.0.0.1]:33315 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThUY2-0006in-P9 for submit@debbugs.gnu.org; Sat, 08 Dec 2012 19:15:10 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:45506) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThUY1-0006ig-0V for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 19:15:09 -0500 Received: from ucsinet21.oracle.com (ucsinet21.oracle.com [156.151.31.93]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB90EcUZ012569 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sun, 9 Dec 2012 00:14:39 GMT Received: from acsmt358.oracle.com (acsmt358.oracle.com [141.146.40.158]) by ucsinet21.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB90Ebb1004689 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 9 Dec 2012 00:14:38 GMT Received: from abhmt110.oracle.com (abhmt110.oracle.com [141.146.116.62]) by acsmt358.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB90EbMI018544; Sat, 8 Dec 2012 18:14:37 -0600 Received: from dradamslap1 (/71.202.147.44) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sat, 08 Dec 2012 16:14:36 -0800 From: "Drew Adams" To: "'Stefan Monnier'" , "'Juri Linkov'" References: <20121130182205.C722F14B8D@panix1.panix.com><87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org><50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Sat, 8 Dec 2012 16:14:28 -0800 Message-ID: <104BC54C3EC34891820EDBAB804A380E@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-reply-to: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Thread-Index: Ac3Vn2QgO4+j05+dQUiXQ9HwhwZC+gAAXQ/w X-Source-IP: ucsinet21.oracle.com [156.151.31.93] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.5 (-) > > i.e. "suf" doesn't match "su?er", only "suff" should match it. > > I completely disagree here. "suf" should match "su?er". The Unicode Regexp spec says that it is best, if possible, to let users do either. It discusses such different search possibilities explicitly. We might not be able to support that superior level (level 2) for Emacs search, but the point is that each kind of matching can be useful here. At this stage of the discussion it should not, I think, be a case of "I completely disagree" (or completely agree), unless you have already decided something wrt design/implementation etc. Better to look at the possibilities for users and then discuss what it might take to be able to support this or that kind of search matching. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 08 19:54:00 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 00:54:01 +0000 Received: from localhost ([127.0.0.1]:33380 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThV9a-0007bf-Gg for submit@debbugs.gnu.org; Sat, 08 Dec 2012 19:53:59 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:51753 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThV9U-0007bN-3X for 13041@debbugs.gnu.org; Sat, 08 Dec 2012 19:53:53 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 8856D451E1A2; Sat, 8 Dec 2012 16:53:23 -0800 (PST) From: Juri Linkov To: Stefan Monnier Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> Date: Sun, 09 Dec 2012 02:35:46 +0200 In-Reply-To: (Stefan Monnier's message of "Sat, 08 Dec 2012 18:54:15 -0500") Message-ID: <87hanwuk3x.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: martin rudalics , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) >> i.e. "suf" doesn't match "su=EF=AC=80er", only "suff" should match it. > > I completely disagree here. "suf" should match "su=EF=AC=80er". AFAIS, there are more programs that find a partial match, but neither of them can do the right highlighting: both possibilities (to highlight the whole ligature and not to highlight) are wrong, and highlighting a part of the ligature is impossible. From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 06:36:39 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 11:36:39 +0000 Received: from localhost ([127.0.0.1]:33661 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThfBW-0006Qm-Gj for submit@debbugs.gnu.org; Sun, 09 Dec 2012 06:36:39 -0500 Received: from mailout-de.gmx.net ([213.165.64.23]:47053) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1ThfBU-0006Qd-8j for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 06:36:37 -0500 Received: (qmail invoked by alias); 09 Dec 2012 11:36:04 -0000 Received: from i59F57F49.versanet.de (EHLO rosalinde.fritz.box) [89.245.127.73] by mail.gmx.net (mp019) with SMTP; 09 Dec 2012 12:36:04 +0100 X-Authenticated: #20778731 X-Provags-ID: V01U2FsdGVkX1/dxDhP0y7ddBDh4YHGROYzTF0azAJ09KH149Rved 6Df8L+zbwytc/n From: Stephen Berman To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <87hanwuk3x.fsf@mail.jurta.org> Date: Sun, 09 Dec 2012 12:35:59 +0100 In-Reply-To: <87hanwuk3x.fsf@mail.jurta.org> (Juri Linkov's message of "Sun, 09 Dec 2012 02:35:46 +0200") Message-ID: <8738zfpie8.fsf@rosalinde.fritz.box> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, Stefan Monnier X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) On Sun, 09 Dec 2012 02:35:46 +0200 Juri Linkov wrote: >>> i.e. "suf" doesn't match "su=EF=AC=80er", only "suff" should match it. >> >> I completely disagree here. "suf" should match "su=EF=AC=80er". > > AFAIS, there are more programs that find a partial match, > but neither of them can do the right highlighting: > both possibilities (to highlight the whole ligature and not to highlight) > are wrong, and highlighting a part of the ligature is impossible. Could a ligature be highlighted in a different way (different color or additional attribute such as underlining) to indicate a partial or potential match? Steve Berman From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 10:42:49 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 15:42:50 +0000 Received: from localhost ([127.0.0.1]:34246 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thj1l-0004VC-MU for submit@debbugs.gnu.org; Sun, 09 Dec 2012 10:42:49 -0500 Received: from ironport2-out.teksavvy.com ([206.248.154.182]:36386) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thj1j-0004V2-4v for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 10:42:47 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av0EAG6Zu09soXOY/2dsb2JhbABEtBGBCIIVAQEEAVYjBQsLNBIUGA0kiBwFugmQRAOIQppxgViDBw X-IronPort-AV: E=Sophos;i="4.75,637,1330923600"; d="scan'208";a="209459779" Received: from 108-161-115-152.dsl.teksavvy.com (HELO pastel.home) ([108.161.115.152]) by ironport2-out.teksavvy.com with ESMTP/TLS/ADH-AES256-SHA; 09 Dec 2012 10:42:16 -0500 Received: by pastel.home (Postfix, from userid 20848) id BE1A058D3B; Sun, 9 Dec 2012 10:42:15 -0500 (EST) From: Stefan Monnier To: "Drew Adams" Subject: Re: bug#13041: 24.2; diacritic-fold-search Message-ID: References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <104BC54C3EC34891820EDBAB804A380E@us.oracle.com> Date: Sun, 09 Dec 2012 10:42:15 -0500 In-Reply-To: <104BC54C3EC34891820EDBAB804A380E@us.oracle.com> (Drew Adams's message of "Sat, 8 Dec 2012 16:14:28 -0800") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: 'Juri Linkov' , perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.5 (/) > The Unicode Regexp spec says that it is best, if possible, to let users do > either. We're talking about the (now misnamed) "diacritic-fold" search. If the user wants to be more strict, there's always going to be the "non-diacritic-fold" search. Stefan From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 10:45:40 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 15:45:40 +0000 Received: from localhost ([127.0.0.1]:34250 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thj4W-0004Zi-4N for submit@debbugs.gnu.org; Sun, 09 Dec 2012 10:45:40 -0500 Received: from ironport2-out.teksavvy.com ([206.248.154.182]:25563) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thj4U-0004Zc-PI for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 10:45:38 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Av8EAG6Zu09soXOY/2dsb2JhbABEhS2uZIEIghUBAQQBIzMjEAsaAhgOAgIUGA0kiBwFpw6Se4EmjgqBFAOIQppxgViDB4E4Gg X-IronPort-AV: E=Sophos;i="4.75,637,1330923600"; d="scan'208";a="209459852" Received: from 108-161-115-152.dsl.teksavvy.com (HELO pastel.home) ([108.161.115.152]) by ironport2-out.teksavvy.com with ESMTP/TLS/ADH-AES256-SHA; 09 Dec 2012 10:45:07 -0500 Received: by pastel.home (Postfix, from userid 20848) id 8E06458D3B; Sun, 9 Dec 2012 10:45:07 -0500 (EST) From: Stefan Monnier To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search Message-ID: References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <87hanwuk3x.fsf@mail.jurta.org> Date: Sun, 09 Dec 2012 10:45:07 -0500 In-Reply-To: <87hanwuk3x.fsf@mail.jurta.org> (Juri Linkov's message of "Sun, 09 Dec 2012 02:35:46 +0200") User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: martin rudalics , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) >>> i.e. "suf" doesn't match "su=EF=AC=80er", only "suff" should match it. >> I completely disagree here. "suf" should match "su=EF=AC=80er". > AFAIS, there are more programs that find a partial match, > but neither of them can do the right highlighting: > both possibilities (to highlight the whole ligature and not to highlight) > are wrong, and highlighting a part of the ligature is impossible. One step at a time: first, let's make sure we can match it. Then we'll worry about what the match-boundaries should be and how to display it (when we get to this point, we can even consider displaying su=EF=AC=80er as suffer temporarily, just like we do when point is in the middle of a composition). Stefan From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 12:53:04 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 17:53:04 +0000 Received: from localhost ([127.0.0.1]:34313 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thl3o-0007XE-7v for submit@debbugs.gnu.org; Sun, 09 Dec 2012 12:53:04 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:54253) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1Thl3k-0007Wm-MF for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 12:53:01 -0500 Received: (qmail invoked by alias); 09 Dec 2012 17:52:28 -0000 Received: from 62-47-58-158.adsl.highway.telekom.at (EHLO [62.47.58.158]) [62.47.58.158] by mail.gmx.net (mp038) with SMTP; 09 Dec 2012 18:52:28 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX188XZs0a648yTFv3wF+/lywp3vgSp7mCrwb+eWYtJ nhru0oNK/0xRlA Message-ID: <50C4CFD1.9000101@gmx.at> Date: Sun, 09 Dec 2012 18:52:17 +0100 From: martin rudalics MIME-Version: 1.0 To: Juri Linkov Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <50C322CC.1000806@gmx.at> <87ip8cz2zu.fsf@mail.jurta.org> In-Reply-To: <87ip8cz2zu.fsf@mail.jurta.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: Kenichi Handa , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > OTOH, instead of using an approach of matching only a full match > like in Chromium, we could do like GEdit and OpenOffice that > match the whole ligature character in a partial match > (i.e. to match "=EF=AC=80" when the search string is just "f"). Strictly spoken, they should match the first "f" in "=EF=AC=80". When ma= tching "suf" against "su=EF=AC=80er", the `match-string' would be "suf", with `match-end' after "=EF=AC=80". That is, the match length would not incre= ase when adding an "f" to the search string now. But I don't know what `match-string' should return - "su=EF=AC=80" or "suff". > Though this has a problem of highlighting the whole character for > a partial match that looks wrong, but perhaps no one can do better. We needed a display string "ff" replacing "=EF=AC=80" during highlighting= and highlight only the first "f" in it. > Yes, this is what I meant too. It is surprising but > http://www.unicode.org/Public/UNIDATA/CaseFolding.txt > defines the equivalence of "=C3=9F" and "ss" (lower case "s") > instead of case-folding. The following line in CaseFolding.txt: > > 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S > > maps 00DF (LATIN SMALL LETTER SHARP S) to two characters > 0073 0073 (LATIN SMALL LETTER S) keeping the lower case. > Maybe this is a bug in Unicode data? Maybe it's explained here http://www.unicode.org/faq/idn.html in the answer to Q: Why does IDNA2003 map final sigma (=CF=82) to sigma (=CF=83), map e= szett (=C3=9F) to "ss", and delete ZWJ/ZWNJ? One possible interpretation of this is that mapping "=C3=9F" to "SS" woul= d imply that downcasing "SS" should produce "=C3=9F" and this is unwanted. = But I still wonder whether we are supposed to apply mappings recursively. martin From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 12:53:17 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 17:53:17 +0000 Received: from localhost ([127.0.0.1]:34316 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thl40-0007Xd-IH for submit@debbugs.gnu.org; Sun, 09 Dec 2012 12:53:17 -0500 Received: from mailout-de.gmx.net ([213.165.64.22]:50279) by debbugs.gnu.org with smtp (Exim 4.72) (envelope-from ) id 1Thl3y-0007XV-7E for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 12:53:14 -0500 Received: (qmail invoked by alias); 09 Dec 2012 17:52:42 -0000 Received: from 62-47-58-158.adsl.highway.telekom.at (EHLO [62.47.58.158]) [62.47.58.158] by mail.gmx.net (mp070) with SMTP; 09 Dec 2012 18:52:42 +0100 X-Authenticated: #14592706 X-Provags-ID: V01U2FsdGVkX18d8C5CUQAd0HVfgwjaaWbXR7hfvlZvfIlpxwkCVu jKCV7rsXeGgg7R Message-ID: <50C4CFE1.7050803@gmx.at> Date: Sun, 09 Dec 2012 18:52:33 +0100 From: martin rudalics MIME-Version: 1.0 To: Stephen Berman Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <87hanwuk3x.fsf@mail.jurta.org> <8738zfpie8.fsf@rosalinde.fritz.box> In-Reply-To: <8738zfpie8.fsf@rosalinde.fritz.box> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Y-GMX-Trusted: 0 X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: Juri Linkov , perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -0.0 (/) > Could a ligature be highlighted in a different way (different color or= > additional attribute such as underlining) to indicate a partial or > potential match? I think ligatures can be easily handled by displaying the corresponding decomposed string. But a different color could be used to higlight the "=C3=9F" with an incremental search string "Mas" and a match in "Ma=C3=9F= e". martin From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 13:01:07 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 18:01:07 +0000 Received: from localhost ([127.0.0.1]:34321 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThlBZ-0007jS-OS for submit@debbugs.gnu.org; Sun, 09 Dec 2012 13:01:06 -0500 Received: from userp1040.oracle.com ([156.151.31.81]:42437) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThlBW-0007jD-Ol for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 13:01:05 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB9I0St4019746 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sun, 9 Dec 2012 18:00:29 GMT Received: from acsmt358.oracle.com (acsmt358.oracle.com [141.146.40.158]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB9I0Rtv022458 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 9 Dec 2012 18:00:27 GMT Received: from abhmt115.oracle.com (abhmt115.oracle.com [141.146.116.67]) by acsmt358.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB9I0Qh5024600; Sun, 9 Dec 2012 12:00:27 -0600 Received: from dradamslap1 (/71.202.147.44) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sun, 09 Dec 2012 10:00:26 -0800 From: "Drew Adams" To: "'Stefan Monnier'" References: <20121130182205.C722F14B8D@panix1.panix.com><87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org><50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org><104BC54C3EC34891820EDBAB804A380E@us.oracle.com> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Sun, 9 Dec 2012 10:00:16 -0800 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-reply-to: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Thread-Index: Ac3WI8WKg1px29sIQGKesetEr3aYFAAESlug X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: 'Juri Linkov' , perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.3 (--) > > The Unicode Regexp spec says that it is best, if possible, > > to let users do either. > > We're talking about the (now misnamed) "diacritic-fold" search. > If the user wants to be more strict, there's always going to be > the "non-diacritic-fold" search. Yes, and? That ignoring of diacritics etc. is essentially what the Unicode Regexp spec refers to as "loose matching", IIUC. And that means "at least the simple, default Unicode case folding." You are considering, among other things, whether `f' should match the ? ligature or whether only `ff' should match it. The standard deals with this question, I believe. (BTW, I cannot actually see that ligature with my mail client. So I copied the char from another mail message and pasted it, above. If that copy+paste didn't work, what I meant was the ligature for ff.) From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 09 13:07:33 2012 Received: (at 13041) by debbugs.gnu.org; 9 Dec 2012 18:07:33 +0000 Received: from localhost ([127.0.0.1]:34325 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThlHo-0007sK-WC for submit@debbugs.gnu.org; Sun, 09 Dec 2012 13:07:33 -0500 Received: from aserp1040.oracle.com ([141.146.126.69]:39793) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThlHm-0007sC-Cx for 13041@debbugs.gnu.org; Sun, 09 Dec 2012 13:07:31 -0500 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by aserp1040.oracle.com (Sentrion-MTA-4.2.2/Sentrion-MTA-4.2.2) with ESMTP id qB9I6uHV019205 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Sun, 9 Dec 2012 18:06:56 GMT Received: from acsmt357.oracle.com (acsmt357.oracle.com [141.146.40.157]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id qB9I6tAl027396 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 9 Dec 2012 18:06:55 GMT Received: from abhmt119.oracle.com (abhmt119.oracle.com [141.146.116.71]) by acsmt357.oracle.com (8.12.11.20060308/8.12.11) with ESMTP id qB9I6suN010513; Sun, 9 Dec 2012 12:06:54 -0600 Received: from dradamslap1 (/71.202.147.44) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Sun, 09 Dec 2012 10:06:54 -0800 From: "Drew Adams" To: "'martin rudalics'" , "'Juri Linkov'" References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org><871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at><87ehj18l9p.fsf@mail.jurta.org> <50C322CC.1000806@gmx.at><87ip8cz2zu.fsf@mail.jurta.org> <50C4CFD1.9000101@gmx.at> Subject: RE: bug#13041: 24.2; diacritic-fold-search Date: Sun, 9 Dec 2012 10:06:44 -0800 Message-ID: <94267C51F6B04F83948A5F4E5A812A7C@us.oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Mailer: Microsoft Office Outlook 11 In-reply-to: <50C4CFD1.9000101@gmx.at> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6157 Thread-Index: Ac3WNh5jHvTiK+VPQKWeEn7nUt8a/wAAY4jg X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.3 (--) > Maybe it's explained here > http://www.unicode.org/faq/idn.html > in the answer to >=20 > Q: Why does IDNA2003 map final sigma (?) to sigma (s), map=20 > eszett (=DF) to "ss", and delete ZWJ/ZWNJ? >=20 > One possible interpretation of this is that mapping "=DF" to "SS" = would > imply that downcasing "SS" should produce "=DF" and this is=20 > unwanted. This is also covered in the Unicode Regexp spec. http://www.unicode.org/reports/tr18/ From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 10 03:10:38 2012 Received: (at 13041) by debbugs.gnu.org; 10 Dec 2012 08:10:38 +0000 Received: from localhost ([127.0.0.1]:34812 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThyRg-0003M7-AH for submit@debbugs.gnu.org; Mon, 10 Dec 2012 03:10:38 -0500 Received: from ps18281.dreamhost.com ([69.163.218.105]:36190 helo=ps18281.dreamhostps.com) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1ThyRc-0003Lw-VD for 13041@debbugs.gnu.org; Mon, 10 Dec 2012 03:10:34 -0500 Received: from localhost (ps18281.dreamhostps.com [69.163.218.105]) by ps18281.dreamhostps.com (Postfix) with ESMTP id 651CF451E19C; Mon, 10 Dec 2012 00:09:56 -0800 (PST) From: Juri Linkov To: Stefan Monnier Subject: Re: bug#13041: 24.2; diacritic-fold-search Organization: JURTA References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <87hanwuk3x.fsf@mail.jurta.org> Date: Mon, 10 Dec 2012 09:57:49 +0200 In-Reply-To: (Stefan Monnier's message of "Sun, 09 Dec 2012 10:45:07 -0500") Message-ID: <87hanu1jiq.fsf@mail.jurta.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3.50 (x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.8 (/) X-Debbugs-Envelope-To: 13041 Cc: martin rudalics , 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: 0.8 (/) > One step at a time: first, let's make sure we can match it. Then we'll > worry about what the match-boundaries should be and how to display it > (when we get to this point, we can even consider displaying su=EF=AC=80= er as > suffer temporarily, just like we do when point is in the middle of > a composition). Isearch used to decompose a composition of a character with a combining accent and displaying them separately in the middle of a composition in Emacs 23. But as I see now in the latest version Isearch in the middle of a composition doesn't decompose them. It highlights the matched character with still unmatched combining accent as a whole. It seems the current behavior is better then earlier because it doesn't change the displayed characters. This is more WYSIWYG. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 10 03:22:16 2012 Received: (at 13041) by debbugs.gnu.org; 10 Dec 2012 08:22:16 +0000 Received: from localhost ([127.0.0.1]:34825 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thycy-0003cc-IO for submit@debbugs.gnu.org; Mon, 10 Dec 2012 03:22:16 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:37733) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1Thycx-0003cU-1E for 13041@debbugs.gnu.org; Mon, 10 Dec 2012 03:22:15 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MET0060034NCB00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Mon, 10 Dec 2012 10:21:05 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MET005E1374X860@a-mtaout22.012.net.il>; Mon, 10 Dec 2012 10:21:04 +0200 (IST) Date: Mon, 10 Dec 2012 10:20:55 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <87hanu1jiq.fsf@mail.jurta.org> X-012-Sender: halo1@inter.net.il To: Juri Linkov Message-id: <83txrub9nc.fsf@gnu.org> References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <87hanwuk3x.fsf@mail.jurta.org> <87hanu1jiq.fsf@mail.jurta.org> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: perin@acm.org, 13041@debbugs.gnu.org, perin@panix.com, monnier@iro.umontreal.ca X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > From: Juri Linkov > Date: Mon, 10 Dec 2012 09:57:49 +0200 > Cc: 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org > > Isearch used to decompose a composition of a character with a combining > accent and displaying them separately in the middle of a composition > in Emacs 23. AFAIR, this was due to problems in the display engine wrt composite characters, and problems with composition support in general, problems which are now solved. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 11 02:20:47 2012 Received: (at 13041) by debbugs.gnu.org; 11 Dec 2012 07:20:47 +0000 Received: from localhost ([127.0.0.1]:36196 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TiK91-00051F-4b for submit@debbugs.gnu.org; Tue, 11 Dec 2012 02:20:47 -0500 Received: from mtaout22.012.net.il ([80.179.55.172]:34198) by debbugs.gnu.org with esmtp (Exim 4.72) (envelope-from ) id 1TiK8x-000517-Vg for 13041@debbugs.gnu.org; Tue, 11 Dec 2012 02:20:45 -0500 Received: from conversion-daemon.a-mtaout22.012.net.il by a-mtaout22.012.net.il (HyperSendmail v2007.08) id <0MEU00L00UZKEL00@a-mtaout22.012.net.il> for 13041@debbugs.gnu.org; Tue, 11 Dec 2012 09:20:02 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout22.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0MEU00KG4V1EJZJ0@a-mtaout22.012.net.il>; Tue, 11 Dec 2012 09:20:02 +0200 (IST) Date: Tue, 11 Dec 2012 09:19:55 +0200 From: Eli Zaretskii Subject: Re: bug#13041: 24.2; diacritic-fold-search In-reply-to: <94267C51F6B04F83948A5F4E5A812A7C@us.oracle.com> To: Drew Adams Message-id: <83obi19ht0.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=iso-8859-1 Content-transfer-encoding: QUOTED-PRINTABLE X-012-Sender: halo1@inter.net.il References: <20121130182205.C722F14B8D@panix1.panix.com> <87ip8fjzwn.fsf@gnu.org> <871uf2647i.fsf@mail.jurta.org> <50C1C6CC.9020103@gmx.at> <87ehj18l9p.fsf@mail.jurta.org> <50C322CC.1000806@gmx.at> <87ip8cz2zu.fsf@mail.jurta.org> <50C4CFD1.9000101@gmx.at> <94267C51F6B04F83948A5F4E5A812A7C@us.oracle.com> X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: 13041 Cc: juri@jurta.org, rudalics@gmx.at, 13041@debbugs.gnu.org, perin@panix.com, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.13 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -1.2 (-) > From: "Drew Adams" > Date: Sun, 9 Dec 2012 10:06:44 -0800 > Cc: perin@panix.com, 13041@debbugs.gnu.org, perin@acm.org >=20 > > Maybe it's explained here > > http://www.unicode.org/faq/idn.html > > in the answer to > >=20 > > Q: Why does IDNA2003 map final sigma (?) to sigma (s), map= =20 > > eszett (=DF) to "ss", and delete ZWJ/ZWNJ? > >=20 > > One possible interpretation of this is that mapping "=DF" to "SS"= would > > imply that downcasing "SS" should produce "=DF" and this is=20 > > unwanted. >=20 > This is also covered in the Unicode Regexp spec. > http://www.unicode.org/reports/tr18/ Another relevant Unicode document is the Unicode Collation Algorithm. For the latest (yet unapproved) draft, see http://www.unicode.org/reports/tr10/proposed.html From debbugs-submit-bounces@debbugs.gnu.org Wed Aug 31 10:46:09 2016 Received: (at 13041) by debbugs.gnu.org; 31 Aug 2016 14:46:09 +0000 Received: from localhost ([127.0.0.1]:45563 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bf6mD-0000L5-6X for submit@debbugs.gnu.org; Wed, 31 Aug 2016 10:46:09 -0400 Received: from mout.gmx.net ([212.227.17.22]:51206) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bf6m7-0000K8-SX for 13041@debbugs.gnu.org; Wed, 31 Aug 2016 10:46:04 -0400 Received: from detlef.gmx.de ([87.146.48.45]) by mail.gmx.com (mrgmx103) with ESMTPSA (Nemesis) id 0MNqcR-1blRUO0VJ5-007VhX; Wed, 31 Aug 2016 16:45:47 +0200 From: Michael Albinus To: Lewis Perin Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> Date: Wed, 31 Aug 2016 16:45:44 +0200 In-Reply-To: <20121130182205.C722F14B8D@panix1.panix.com> (Lewis Perin's message of "Fri, 30 Nov 2012 13:22:05 -0500 (EST)") Message-ID: <87mvjtngiv.fsf@gmx.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K0:3p2ukdBP6jf1vYeboo/zhewpiBeamE2K5J+OXzbwDbwr8jPWC0B FOvrlT1u/XkzIRW2fOYNPj9g3p5m7ssKemNHAiur/mPc98exN4x3E7cdqFovAx0Kw7Sw8kD WVSWOutwaGSnDi3GARCKGMQ7ABn5A76BDdY7TJq7uLOp9l+NM162BAhWXZQ208FeMs4YkSp o77qdIL6O0K7LQIMd5J+g== X-UI-Out-Filterresults: notjunk:1;V01:K0:e1qEVyS8CpU=:1JKhhv9MEEOZ/PR9COcjdt jX/zdf59qCHwbZSYgsDL0taArAvFFNZzEVBkOF3XkWJ6R9GoEnJ3jFgQ1aS6vKbfG+7LO8E+I j8jLJ19b5RcMCMA7Om/VIM/rkJmTMQ5vyCCxMUEcGZzTS58cRPAVSaWb6n7kTgHN8F2Qm+Ns/ bPNgBubHNlJYwIE3QatnY12zVdsAiaGi2MIjxLBX80jbfsCNLNZV80/t3oCNapL0uf0ubuWvG uQA6mPNL6kXm3Ac/TpjzrOgDIGfn65fKKSRN7P5g1B8wpELZxQAuCilx+F6gwAjAsCwS+ytrM 5uqbYeKCC+CqQD1GRm2jGPfhNiPN5iOiTICNUkLJrvcVpESPH2cBAmMLlyp1bxp97kO94qsYH HdJwsJLGHLfK72FxBnvNi3B6XC/forkxuaY5SdMsAJ9rgw5WU9i3Xs0dY3RItyrACauMPDDbl kElyPuaI+UYH8WdB57uBiZ4kS3cP7C3UW/eDaOZ+AnNqxZykQlHq01UOFhMspboitw/sfQdov OjqJxVqIrn0SNTWGUpOw/EVDHmc6cTGwF233FQH3m4x2leRGQMiaLf/HGyfA703n0/2shYT7m QbYimVns0tGL5ySy4YE9QpaUby83KJSHknG1M28xv6Q/cIvMsUlfcRjE6k39l/YfumWS0yfN1 IHV2GgXbK3+GoK4qD2FOYTtf/aWgsTXbQM2TE4lU4EDZ8/altaRelKDHq+zjSOSN3dnNpfdP+ W/t5rWkA04c6qnWuzrAoOPciYosx9H6EavXoG8iOCa9cKhuPbiFrkVbAVNPd2vfOgp9s8EFVZ RdU1k/s X-Spam-Score: 0.3 (/) X-Debbugs-Envelope-To: 13041 Cc: 13041@debbugs.gnu.org, perin@acm.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.3 (/) Lewis Perin writes: > Emacs search has long been able to toggle between (a) ignoring the > distinction between upper- and lower-case characters > (case-fold-search) and (b) searching for only one of the pair. One > could say Climacs offers the choice between (a) searching for all > members of a (2-member) equivalence class and (b) searching for only > one member. > > There are larger equivalence classes of characters with practical use > which Climacs is currently unaware of: the groups of characters > consisting of an unadorned (ASCII) character plus all its > diacritic-adorned versions. Currently, if I want to search for both > =E2=80=9Capres=E2=80=9D and =E2=80=9Capr=C3=A8s=E2=80=9D, I need an addit= ive regular expression. I would > like to do this as easily as I can search for =E2=80=9Capres=E2=80=9D and= =E2=80=9CApres=E2=80=9D. I > would be delighted if Emacs implemented the equivalence classes > spelled out here: > > http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold= .js.html > > I might add that diacritics folding is the default in web search > engines. It is also a feature of at least one Web browser in > searching the text of a displayed page (Chrome.) Emacs 25.1 has introduced the new user option `search-default-mode'. If set to `char-fold-to-regexp', the requested feature is available. See etc/NEWS for further information. So I propose to close this bug. There was a long discussion in the bug's log back in 2012, but AFAICS, all proposals have been implemented. > /Lew Best regards, Michael. From debbugs-submit-bounces@debbugs.gnu.org Sat Sep 03 03:06:45 2016 Received: (at 13041-done) by debbugs.gnu.org; 3 Sep 2016 07:06:45 +0000 Received: from localhost ([127.0.0.1]:48213 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bg52L-0001Af-3f for submit@debbugs.gnu.org; Sat, 03 Sep 2016 03:06:45 -0400 Received: from mout.gmx.net ([212.227.17.22]:57651) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bg52J-0001AR-LG for 13041-done@debbugs.gnu.org; Sat, 03 Sep 2016 03:06:44 -0400 Received: from detlef.gmx.de ([79.195.12.73]) by mail.gmx.com (mrgmx101) with ESMTPSA (Nemesis) id 0LmOLO-1b6T6B36tm-00ZygD; Sat, 03 Sep 2016 09:06:25 +0200 From: Michael Albinus To: perin@acm.org Subject: Re: bug#13041: 24.2; diacritic-fold-search References: <20121130182205.C722F14B8D@panix1.panix.com> <87mvjtngiv.fsf@gmx.de> <22473.57245.883865.68491@panix5.panix.com> Date: Sat, 03 Sep 2016 09:06:21 +0200 In-Reply-To: <22473.57245.883865.68491@panix5.panix.com> (nobody's message of "Fri, 2 Sep 2016 16:22:53 -0400") Message-ID: <87twdxtqc2.fsf@gmx.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.1.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Provags-ID: V03:K0:DJc9SvAqNuTGwsnWt6J1DWSlzd5KTsC/J/MOiE4KoADeVXgZ5xU 2NAdrX7yB9BE/a6pdbuzF65JEtuc85mlHWx2Tl/Iza53ZDSHtxr1uJpWb5Nj3UrmNzUW0bn 6Mq4BLRwuzLJi/pndsLTb0E65aHwdp7aVaSAnSauGZT6pN/wv6ETKHiXkpdoPy68zOdcW0Z g39NK2fykpIysPzU4TRCg== X-UI-Out-Filterresults: notjunk:1;V01:K0:9BAlY17GqbI=:Hf6i0EmKCYcv58pFC8a8zk F8YlqoS5kUjRCEpy/XTxSvbbD7HWidtHetvEEtDldn+gvu8iVmpo2LJGmKe6C44Ev//qWHJbC X014TKVspYnpjiKkWGITyegGzE3tX2wqTDTVhRhkwA9sEx6xHvqAXN5Klo8JdNfWR/rk+UadE WzR77fpKk8taYoSdcUJBtKhcl5NEw+J43zW2o4vv8FZ8wxFwz6exzhW8u/iuLBNeP42krw85t tXsh0HNaYXOduOOtvrcHA3O+q6TL7ZL/iS24GNL0ldFnfkwauZ1n3y1EMOBQuu4ZAABSNKSjG Pc/5tcL8+/niS2A76aGnW0gLxB3WMvxdgiPIhN6PKCbbmPhb8gyZdz40oi/PYIaLdISLEqIw9 NqbTkHedUys+RJB3sJMGORruzFwUCwmwu/RC2cGhNTy1I/wQC8gc8mDEf6MnaLMVqWs7x4lSJ zqWZxpV4i6jrPXjY7hnMPhac6x4wtS66YtQ+1OkWOljnDJFvjOrjsyvyc2DS/zvFVZKl/+2S0 L5MBFTtleUvelgU4WEMB4unAUVYudQAsNTPk4mOSYqd0a9/lBYQWIHc+KqdOtoKKwkUCaozVW 1OvLwxKUQknI3t5MfdSL4DcNGWSqHKk9TuhOWdupi4k1aCJ8ykrwOffhtebr4lEJrC4XuUSMz 8oveLMBeqYpsz8Vo6YvzhZ/TTmJ1SLdOZIvNX1McCkqOrARWADEyWeE8+5/OtlYWxYb+wirmP CDSUfQ4NSb90dA0pWXwtKpxty4fFugBdpyShX9UvzlCTNrL2XS/FhcOPpYL49CFVQGhw5f3Qr rFmSfqo X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 13041-done Cc: 13041-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Version: 25.1 nobody writes: > This is great news! I=E2=80=99m afraid I=E2=80=99m not in a position to = use 25.1 yet, > but I look forward to it eagerly. Closing the bug seems right to me; > if the new functionality has flaws, then they would be *new* bugs. So I'm closing the bug. > Thanks very much for letting me know! > > /Lew Best regards, Michael. From unknown Sat Sep 06 09:27:54 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sat, 01 Oct 2016 11:24:03 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator