From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 08:17:26 2019 Received: (at submit) by debbugs.gnu.org; 15 Aug 2019 12:17:26 +0000 Received: from localhost ([127.0.0.1]:50766 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyEh4-0004Yr-5M for submit@debbugs.gnu.org; Thu, 15 Aug 2019 08:17:26 -0400 Received: from lists.gnu.org ([209.51.188.17]:40893) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyEh1-0004Yh-N0 for submit@debbugs.gnu.org; Thu, 15 Aug 2019 08:17:24 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:37674) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1hyEh0-0000IQ-HA for bug-gnu-emacs@gnu.org; Thu, 15 Aug 2019 08:17:23 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_NONE, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hyEgz-0000xu-9r for bug-gnu-emacs@gnu.org; Thu, 15 Aug 2019 08:17:22 -0400 Received: from mail210c50.megamailservers.eu ([91.136.10.220]:37072 helo=mail194c50.megamailservers.eu) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hyEgy-0000v8-LW for bug-gnu-emacs@gnu.org; Thu, 15 Aug 2019 08:17:21 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1565871437; bh=hEOCdHw46AUZ2gOrAE0D8lRiHZDWsDs6mCQHAVdD1Vo=; h=From:Subject:Date:To:From; b=IBDixW7Hi13kS8JmgeV23VA9pUFfwXndxlISouVOahiHS5yQqporInC0aWXtxxAzU dDFKWBjHUQYHQw2CVBBKQXIpAWDhDx4r59SKR8W3R7SX1Q3tjPVddwqScn+pKb9Z+O r56ebaQU9vRhadAujfcpcuBXSW/S7gEGBBSZRZUw= Feedback-ID: mattiase@acm.or Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail194c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x7FCHFmI020660 for ; Thu, 15 Aug 2019 12:17:17 +0000 From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Content-Type: multipart/mixed; boundary="Apple-Mail=_BC12FC6B-E23E-4F58-8797-6412ED88D3F4" Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: [PATCH] Inconsistent ASCII and Latin char categories Message-Id: Date: Thu, 15 Aug 2019 14:17:15 +0200 To: bug-gnu-emacs@gnu.org X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B020B.5D554D4D.005F, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=Df05VMlW c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=M51BFTxLslgA:10 a=iaYm6Dxja3XvBU-JrQUA:9 a=CjuIK1q_8ugA:10 a=ww0mqciDqbblwU7X_Q4A:9 a=B2y7HmGcmWMA:10 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] X-Received-From: 91.136.10.220 X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --Apple-Mail=_BC12FC6B-E23E-4F58-8797-6412ED88D3F4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii The ASCII (a) and Latin (l) character categories are inconsistent in = what characters they contain. It should be clear what the ASCII category means, but it omits 00-1f = (contrary to a comment in the code). The Latin category isn't exactly defined anywhere but should reasonably = comprise letters from Latin-based scripts. Currently, it also includes = many control characters and symbols from the ASCII and Latin-1 = Supplement blocks, which seems hard to justify. Other changes to Latin could be argued: what modifiers/combining chars = should be included? What about IPA and non-IPA phonetics? Ligatures? = What about Latin-derived forms such as circled letters? &c. The attached = patch does not go there but only fixes the glaring errors in the 00-ff = range. --Apple-Mail=_BC12FC6B-E23E-4F58-8797-6412ED88D3F4 Content-Disposition: attachment; filename=0001-Fix-ASCII-and-Latin-character-categories.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="0001-Fix-ASCII-and-Latin-character-categories.patch" Content-Transfer-Encoding: quoted-printable =46rom=209dbb98c7d2f7856a16efcfacdfae7890db3c45fe=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20=3D?UTF-8?q?Mattias=3D20Engdeg=3DC3=3DA5rd?=3D=20= =0ADate:=20Thu,=2015=20Aug=202019=2014:04:03=20+0200=0A= Subject:=20[PATCH]=20Fix=20ASCII=20and=20Latin=20character=20categories=0A= =0A*=20lisp/international/characters.el:=0AMake=20the=20ASCII=20(a)=20= category=20include=20all=20ASCII=20characters.=0AMake=20the=20Latin=20= (l)=20category=20include=20only=20letters=20from=20the=20range=2000-ff.=0A= ---=0A=20lisp/international/characters.el=20|=2015=20+++++++++------=0A=20= 1=20file=20changed,=209=20insertions(+),=206=20deletions(-)=0A=0Adiff=20= --git=20a/lisp/international/characters.el=20= b/lisp/international/characters.el=0Aindex=20012827ba1c..379a6a170b=20= 100644=0A---=20a/lisp/international/characters.el=0A+++=20= b/lisp/international/characters.el=0A@@=20-127,11=20+127,8=20@@=20?L=0A=20= =0C=0A=20;;;=20Setting=20syntax=20and=20category.=0A=20=0A-;;=20ASCII=0A= -=0A-;;=20All=20ASCII=20characters=20have=20the=20category=20`a'=20= (ASCII)=20and=20`l'=20(Latin).=0A-(modify-category-entry=20'(32=20.=20= 127)=20?a)=0A-(modify-category-entry=20'(32=20.=20127)=20?l)=0A+;;=20All=20= ASCII=20characters=20have=20the=20category=20`a'=20(ASCII).=0A= +(modify-category-entry=20'(0=20.=20127)=20?a)=0A=20=0A=20;;=20Deal=20= with=20the=20CJK=20charsets=20first.=20=20Since=20the=20syntax=20of=20= blocks=20is=0A=20;;=20defined=20per=20charset,=20and=20the=20charsets=20= may=20contain=20e.g.=20Latin=0A@@=20-510,7=20+507,13=20@@=20?L=0A=20=0A=20= ;;=20Latin=0A=20=0A-(modify-category-entry=20'(#x80=20.=20#x024F)=20?l)=0A= +;;=20ASCII=0A+(modify-category-entry=20'(?A=20.=20?Z)=20?l)=0A= +(modify-category-entry=20'(?a=20.=20?z)=20?l)=0A+;;=20Latin-1=20= Supplement=0A+(modify-category-entry=20'(#xc0=20.=20#xd6)=20?l)=0A= +(modify-category-entry=20'(#xd8=20.=20#xf6)=20?l)=0A= +(modify-category-entry=20'(#xf8=20.=20#xff)=20?l)=0A=20=0A=20(let=20= ((tbl=20(standard-case-table))=20c)=0A=20=0A--=20=0A2.20.1=20(Apple=20= Git-117)=0A=0A= --Apple-Mail=_BC12FC6B-E23E-4F58-8797-6412ED88D3F4-- From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 11:27:44 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 15:27:44 +0000 Received: from localhost ([127.0.0.1]:52251 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyHfE-0007Ug-93 for submit@debbugs.gnu.org; Thu, 15 Aug 2019 11:27:44 -0400 Received: from eggs.gnu.org ([209.51.188.92]:53172) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyHfC-0007UR-ES for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 11:27:42 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:49543) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyHf7-0004Pi-6I; Thu, 15 Aug 2019 11:27:37 -0400 Received: from [176.228.60.248] (port=1312 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyHf6-0007EC-2h; Thu, 15 Aug 2019 11:27:36 -0400 Date: Thu, 15 Aug 2019 18:27:28 +0300 Message-Id: <83zhkaphy7.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-reply-to: (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Thu, 15 Aug 2019 14:17:15 +0200) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories References: MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Mattias Engdegård > Date: Thu, 15 Aug 2019 14:17:15 +0200 > > The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain. > > It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code). > > The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify. > > Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range. Did you try moving by words after these changes? What happens in words that consist of ASCII and non-ASCII Latin characters, for example? From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 11:46:41 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 15:46:42 +0000 Received: from localhost ([127.0.0.1]:52295 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyHxZ-0007zM-IP for submit@debbugs.gnu.org; Thu, 15 Aug 2019 11:46:41 -0400 Received: from mail210c50.megamailservers.eu ([91.136.10.220]:47640 helo=mail194c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyHxX-0007zE-Jl for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 11:46:40 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1565883998; bh=H8EAKqc4xU2XW2awtemhpcVJbltmcHIxS9Hc6fOacIE=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=MrjeNzH2NhVr4IovR8kKy1gF6WNQ2oYAA+1kKdYd80Y0yLFXMHDcL8fxlhCQw5wvD UpcJHKh5nRJ9Ptd14DxThG9yjjsc2Nlt4f3R1hc8KQSbz66gPqX4Jis9Ke25OitBNn l3T5WxoJcUnHlvV5OsYB8pui62TEV9LeqN51THug= Feedback-ID: mattiase@acm.or Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail194c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x7FFkZUC003110; Thu, 15 Aug 2019 15:46:37 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <83zhkaphy7.fsf@gnu.org> Date: Thu, 15 Aug 2019 17:46:35 +0200 Content-Transfer-Encoding: 7bit Message-Id: <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> References: <83zhkaphy7.fsf@gnu.org> To: Eli Zaretskii X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B0202.5D557E5E.0001, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=Df05VMlW c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=M51BFTxLslgA:10 a=mDV3o1hIAAAA:8 a=bebL9vecBOuRsUIRFqMA:9 a=CjuIK1q_8ugA:10 a=ncZ9vwaUYPMA:10 a=_FVE-zBwftR9WsbkzFJk:22 X-Spam-Score: 0.3 (/) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) 15 aug. 2019 kl. 17.27 skrev Eli Zaretskii : > > Did you try moving by words after these changes? What happens in > words that consist of ASCII and non-ASCII Latin characters, for > example? No change in behaviour observed in any such case. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 12:23:17 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 16:23:17 +0000 Received: from localhost ([127.0.0.1]:52358 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyIWy-0004pb-TM for submit@debbugs.gnu.org; Thu, 15 Aug 2019 12:23:17 -0400 Received: from eggs.gnu.org ([209.51.188.92]:33281) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyIWx-0004pQ-Iv for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 12:23:16 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:50586) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyIWs-0005Rr-Ax; Thu, 15 Aug 2019 12:23:10 -0400 Received: from [176.228.60.248] (port=4680 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyIWr-00030e-BI; Thu, 15 Aug 2019 12:23:09 -0400 Date: Thu, 15 Aug 2019 19:23:01 +0300 Message-Id: <83v9uypfdm.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-reply-to: <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Thu, 15 Aug 2019 17:46:35 +0200) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Mattias Engdegård > Date: Thu, 15 Aug 2019 17:46:35 +0200 > Cc: 37036@debbugs.gnu.org > > 15 aug. 2019 kl. 17.27 skrev Eli Zaretskii : > > > > Did you try moving by words after these changes? What happens in > > words that consist of ASCII and non-ASCII Latin characters, for > > example? > > No change in behaviour observed in any such case. In any case, how to justify the fact that, say, "naïve", has characters from different scripts? From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 12:30:55 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 16:30:55 +0000 Received: from localhost ([127.0.0.1]:52373 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyIeM-0005rb-Fw for submit@debbugs.gnu.org; Thu, 15 Aug 2019 12:30:54 -0400 Received: from mail74c50.megamailservers.eu ([91.136.10.84]:50564 helo=mail92c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyIeJ-0005o0-TE for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 12:30:52 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1565886650; bh=5Kp6qzahBy9DdDnr4znXc1huZEJOPuOloWgCF1wgoPw=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=gI27sU+YxBy0llLlJhzpZYoym7uPb4+fEIQlVyxGeRmEXJ9wgL03AwcXSPtnUfVGg WjCfKoW/Sui1/0lUBEEbYEM4DGsmdqRVq6l3O0P0zUORZrxdD5c1ixRTnFhlwhjhai LuXb4abzJr538g7shpGL73n1AKJl5Vui6xvj98Fg= Feedback-ID: mattiase@acm.or Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail92c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x7FGUmEi022997; Thu, 15 Aug 2019 16:30:49 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <83v9uypfdm.fsf@gnu.org> Date: Thu, 15 Aug 2019 18:30:47 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> To: Eli Zaretskii X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B0211.5D5588BA.0020, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=NdXIKVL4 c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=IkcTkHD0fZMA:10 a=M51BFTxLslgA:10 a=mDV3o1hIAAAA:8 a=siSaSURnemHgpuQn-MEA:9 a=QEXdDO2ut3YA:10 a=ncZ9vwaUYPMA:10 a=_FVE-zBwftR9WsbkzFJk:22 X-Spam-Score: 0.3 (/) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) 15 aug. 2019 kl. 18.23 skrev Eli Zaretskii : >=20 > In any case, how to justify the fact that, say, "na=C3=AFve", has > characters from different scripts? The proposed change does not change the categories of any character in = that string. Or did you mean something else? From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 13:00:11 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 17:00:11 +0000 Received: from localhost ([127.0.0.1]:52381 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyJ6g-0001fA-PZ for submit@debbugs.gnu.org; Thu, 15 Aug 2019 13:00:11 -0400 Received: from eggs.gnu.org ([209.51.188.92]:38283) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyJ6d-0001dm-RI for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 13:00:08 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:51050) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyJ6Y-0004ZS-Dc; Thu, 15 Aug 2019 13:00:02 -0400 Received: from [176.228.60.248] (port=2955 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyJ6W-0002rf-Lu; Thu, 15 Aug 2019 13:00:01 -0400 Date: Thu, 15 Aug 2019 19:59:53 +0300 Message-Id: <83pnl6pdo6.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-reply-to: (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Thu, 15 Aug 2019 18:30:47 +0200) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Mattias Engdegård > Date: Thu, 15 Aug 2019 18:30:47 +0200 > Cc: 37036@debbugs.gnu.org > > 15 aug. 2019 kl. 18.23 skrev Eli Zaretskii : > > > > In any case, how to justify the fact that, say, "naïve", has > > characters from different scripts? > > The proposed change does not change the categories of any character in that string. What about "abcdef^A^B"? Does M-f stop before the control characters? I guess I don't understand the rationale for the change. Categories are Emacs's invention, and their purpose is mostly to allow us to use regexps for searching certain characters, and other similar subtleties. Your rationale seems to be some attempt to be formally "consistent". But this is not a formal attribute, it is entirely ad-hoc, as can be easily seen by just looking at the list of the categories. So I wonder why would we want to rock that particular boat. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 13:38:02 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 17:38:02 +0000 Received: from localhost ([127.0.0.1]:52419 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyJhJ-0002aG-Os for submit@debbugs.gnu.org; Thu, 15 Aug 2019 13:38:02 -0400 Received: from mail1424c50.megamailservers.eu ([91.136.14.24]:53306 helo=mail102c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyJhG-0002Zh-Og for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 13:37:59 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1565890671; bh=iTdWbgHU0DtNhMpOu1PJcdWIVcif0sKQkq5dGY5aIDU=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=F22wQ4uBhK1Y6z3DUV1WW0XmX0LGNGmEGegkNa4ncvXvYPn3h8YK8nr1IIfGXysfO wugKtsYlfXXOZGgm42hCk4mf5nufJSr5+lxZf8dIxVPIRXPRjIAu/xv5FoKdDNwStn gItOFpIBj+YXYoqIoQHeUrCF++9EAC/WnoMLi0tQ= Feedback-ID: mattiase@acm.or Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail102c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x7FHbnAm001766; Thu, 15 Aug 2019 17:37:51 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <83pnl6pdo6.fsf@gnu.org> Date: Thu, 15 Aug 2019 19:37:49 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> To: Eli Zaretskii X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B0215.5D55986F.003D, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=IrUwjo3g c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=M51BFTxLslgA:10 a=mDV3o1hIAAAA:8 a=ZUCG9ohdz5s2HSUL68QA:9 a=CjuIK1q_8ugA:10 a=_FVE-zBwftR9WsbkzFJk:22 X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii : >=20 > What about "abcdef^A^B"? Does M-f stop before the control characters? Yes. Does forward-word use categories? > I guess I don't understand the rationale for the change. Categories > are Emacs's invention, and their purpose is mostly to allow us to use > regexps for searching certain characters, and other similar > subtleties. Your rationale seems to be some attempt to be formally > "consistent". But this is not a formal attribute, it is entirely > ad-hoc, as can be easily seen by just looking at the list of the > categories. The more categories are arbitrary, the less useful they are. Why would = anyone use categories to discriminate characters if they do not have a = sensible, useful and predictable structure? If 'Latin' means 'Latin = letters, some symbols, some whitespace, some control chars, Indo-Arabic = digits and the occasional Greek letter', which it does today, then who = can use it correctly? Consider the function fill-polish-nobreak-p. It is clearly written with = the assumption of a reasonable definition of the Latin category, and it = doesn't work as expected because of that. Those who reviewed that = function thought it looked reasonable, as did I when I read it. It is perfectly clear that categories have been introduced in an ad-hoc = way to solve problems as they arose, but that doesn't mean that no = mistakes were made even for those narrow purposes. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 15:23:17 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 19:23:17 +0000 Received: from localhost ([127.0.0.1]:52466 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyLLB-0005Ow-2o for submit@debbugs.gnu.org; Thu, 15 Aug 2019 15:23:17 -0400 Received: from eggs.gnu.org ([209.51.188.92]:60379) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyLL8-0005OZ-Ld for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 15:23:15 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:53320) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyLL3-00007U-9X; Thu, 15 Aug 2019 15:23:09 -0400 Received: from [176.228.60.248] (port=3713 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyLL2-0005Xc-M8; Thu, 15 Aug 2019 15:23:09 -0400 Date: Thu, 15 Aug 2019 22:23:00 +0300 Message-Id: <83o90qp71n.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-reply-to: <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Thu, 15 Aug 2019 19:37:49 +0200) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Mattias Engdegård > Date: Thu, 15 Aug 2019 19:37:49 +0200 > Cc: 37036@debbugs.gnu.org > > 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii : > > > > What about "abcdef^A^B"? Does M-f stop before the control characters? > > Yes. Does forward-word use categories? No. Sorry, it was my faulty memory. It uses char-script-table instead. > The more categories are arbitrary, the less useful they are. I think they should become entirely useless, i.e. we should stop using them. We have the entire Unicode database with all the character properties for quite some time now, and should favor using that instead. Categories are an old kludgey hack, which goes back to pre-Unicode Emacs; it can never be anything but arbitrary, and we will never be able to fix that anywhere near completely. > Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure? I don't know why anyone should. My recommendation is to just say no. > Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. Can you tell the details of where this function doesn't work? I'd like to understand why fixing it needs to change the categories. > It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes. I don't think we should fix those mistakes, because that's an impossible goal. We should instead gradually stop using categories for anything serious, certainly for any new code. We should use the UCD properties and the various char-tables built upon that instead. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 15:47:07 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 19:47:07 +0000 Received: from localhost ([127.0.0.1]:52487 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyLiE-00060j-Pf for submit@debbugs.gnu.org; Thu, 15 Aug 2019 15:47:07 -0400 Received: from eggs.gnu.org ([209.51.188.92]:34729) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyLiD-00060G-0u for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 15:47:05 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:53680) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyLi7-0000s8-Ja; Thu, 15 Aug 2019 15:46:59 -0400 Received: from [176.228.60.248] (port=1187 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyLi7-0007Z5-4a; Thu, 15 Aug 2019 15:46:59 -0400 Date: Thu, 15 Aug 2019 22:46:52 +0300 Message-Id: <83k1bep5xv.fsf@gnu.org> From: Eli Zaretskii To: mattiase@acm.org In-reply-to: <83o90qp71n.fsf@gnu.org> (message from Eli Zaretskii on Thu, 15 Aug 2019 22:23:00 +0300) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> <83o90qp71n.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > Date: Thu, 15 Aug 2019 22:23:00 +0300 > From: Eli Zaretskii > Cc: 37036@debbugs.gnu.org > > > From: Mattias Engdegård > > Date: Thu, 15 Aug 2019 19:37:49 +0200 > > Cc: 37036@debbugs.gnu.org > > > > 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii : > > > > > > What about "abcdef^A^B"? Does M-f stop before the control characters? > > > > Yes. Does forward-word use categories? > > No. Sorry, it was my faulty memory. It uses char-script-table > instead. Actually, it uses categories indirectly, via word-combining-categories and word-separating-categories. From debbugs-submit-bounces@debbugs.gnu.org Thu Aug 15 18:19:56 2019 Received: (at 37036) by debbugs.gnu.org; 15 Aug 2019 22:19:56 +0000 Received: from localhost ([127.0.0.1]:52589 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyO67-0002No-NX for submit@debbugs.gnu.org; Thu, 15 Aug 2019 18:19:55 -0400 Received: from mail1430c50.megamailservers.eu ([91.136.14.30]:44248 helo=mail118c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyO64-0002NZ-Jg for 37036@debbugs.gnu.org; Thu, 15 Aug 2019 18:19:54 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1565907586; bh=T58pz2vh6mCNYWvFhuZzQDdQb6WrDChjluGQgLn8Hnk=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=CYys2uaCIgt5IkbK4XcEBKU62uolG/YI6DFcfFXttnTo+fYK1JIX/riDZv+SZ5yPD KRKh5sU4/RdFAMqpdkwwWFCtoovcGvhq0WsgHuh5i2gjsEpebnY+OECj8ol00BNKzl FTAKzBEdJEyx8zD9wnoMe5C2WpgjZFUR58lvO/zk= Feedback-ID: mattiase@acm.or Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail118c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x7FMJhio013928; Thu, 15 Aug 2019 22:19:45 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <83o90qp71n.fsf@gnu.org> Date: Fri, 16 Aug 2019 00:19:43 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> <83o90qp71n.fsf@gnu.org> To: Eli Zaretskii X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B0215.5D55DA82.0006, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=Mqx8FVSe c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=M51BFTxLslgA:10 a=mDV3o1hIAAAA:8 a=OeO-1JlaohBvpmAH3J8A:9 a=CjuIK1q_8ugA:10 a=_FVE-zBwftR9WsbkzFJk:22 X-Spam-Score: 0.3 (/) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) 15 aug. 2019 kl. 21.23 skrev Eli Zaretskii : > I think they should become entirely useless, i.e. we should stop using > them. We have the entire Unicode database with all the character > properties for quite some time now, and should favor using that > instead. Categories are an old kludgey hack, which goes back to > pre-Unicode Emacs; it can never be anything but arbitrary, and we will > never be able to fix that anywhere near completely. Thank you, I see what you mean, and I agree that Unicode properties = probably are better for most purposes. In any case, I wasn't aiming for perfection; that is indeed a fool's = errand. It was just a discovery of a rather obvious mistake, and = evidence of code that doesn't work properly because of it. I thought the = patch would be rather uncontroversial. >> Consider the function fill-polish-nobreak-p. It is clearly written = with the assumption of a reasonable definition of the Latin category, = and it doesn't work as expected because of that. >=20 > Can you tell the details of where this function doesn't work? I'd > like to understand why fixing it needs to change the categories. Right: it attempts to match a single-character word before point, with = the assumption that \cl would match any Latin(-script) letter. However, = since that expression matches most of ASCII as well, the function = incorrectly says that line-breaking would be disallowed after "In my = dreams..." or "(She smiles!)" or "He died in 1951." (well, the = equivalents in Polish). Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D20871 = . Of course it doesn't require the categories to be fixed. The point is = that if there is some code that doesn't work because of the broken = categories, there may very well be more. > I don't think we should fix those mistakes, because that's an > impossible goal. We should instead gradually stop using categories > for anything serious, certainly for any new code. We should use the > UCD properties and the various char-tables built upon that instead. Perhaps, but categories still have one thing going for them: they have = fairly good regexp support. From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 16 05:33:25 2019 Received: (at 37036) by debbugs.gnu.org; 16 Aug 2019 09:33:25 +0000 Received: from localhost ([127.0.0.1]:53066 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyYbs-0006BL-M1 for submit@debbugs.gnu.org; Fri, 16 Aug 2019 05:33:25 -0400 Received: from eggs.gnu.org ([209.51.188.92]:46580) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyYbr-0006B9-Cs for 37036@debbugs.gnu.org; Fri, 16 Aug 2019 05:33:23 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:37751) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hyYbk-00022F-8q; Fri, 16 Aug 2019 05:33:18 -0400 Received: from [176.228.60.248] (port=3671 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1hyYbi-0004wz-Mk; Fri, 16 Aug 2019 05:33:15 -0400 Date: Fri, 16 Aug 2019 12:33:08 +0300 Message-Id: <835zmxpi97.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-reply-to: (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Fri, 16 Aug 2019 00:19:43 +0200) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> <83o90qp71n.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Mattias Engdegård > Date: Fri, 16 Aug 2019 00:19:43 +0200 > Cc: 37036@debbugs.gnu.org > > In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial. AFAIU, the patch made all the non-letter characters excluded from the Latin category, is that right? If so, it's a pretty significant change IMO; who knows what it could break, including outside of the core Emacs. The fact that the Latin category is not well defined doesn't yet mean we are at liberty of changing that (implied) definition at will. Categories are currently used for a small number of core Emacs features, and AFAIR were created incrementally as the ad-hoc need for each one of them arose, so we also risk breaking our own code. Do we really have a good reason to wake those sleeping dogs? > >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. > > > > Can you tell the details of where this function doesn't work? I'd > > like to understand why fixing it needs to change the categories. > > Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish). > Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 . So you are saying that function fails to consider punctuation and symbols that are part of the Latin blocks? That just means it shouldn't use \cl in the first place (and yes, my suggestion to use that in the bug discussion was wrong, sorry), it should use the general-category Unicode property to filter out punctuation characters. Or it could use explicit ranges of codepoints. Or we could extend [:punct:] to support non-ASCII punctuation in a more meaningful way. Either way, that's not a reason good enough to make significant changes in how the categories are defined. If any extensions are needed, I'd rather we made it in more modern and less ad-hoc features. > The point is that if there is some code that doesn't work because of the broken categories, there may very well be more. This argument goes both ways: there could be code out there which relies on the current "broken" definition of the Latin category. > > I don't think we should fix those mistakes, because that's an > > impossible goal. We should instead gradually stop using categories > > for anything serious, certainly for any new code. We should use the > > UCD properties and the various char-tables built upon that instead. > > Perhaps, but categories still have one thing going for them: they have fairly good regexp support. I think this is in many cases an illusory advantage: specifying \cFOO in a regexp just makes the code access some char-table. But the same is true for get-char-code-property and for accessing char-script-table from Lisp, to mention just two alternatives. And we all know that using regular expressions for solving a problem sometimes _adds_ a problem instead of solving one. If we have some functionality in regular expressions that's supported by categories, but is unavailable or inconvenient with Unicode properties, I'd rather we extended our regex engine to support the likes of \p{Po} and \p{script=greek}, see http://unicode.org/reports/tr18/, instead of wasting our resources on "fixing" the categories. From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 16 06:48:48 2019 Received: (at 37036) by debbugs.gnu.org; 16 Aug 2019 10:48:48 +0000 Received: from localhost ([127.0.0.1]:53180 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyZmp-0008Ca-MD for submit@debbugs.gnu.org; Fri, 16 Aug 2019 06:48:47 -0400 Received: from mail1427c50.megamailservers.eu ([91.136.14.27]:37074 helo=mail118c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hyZmm-0008CF-7Z; Fri, 16 Aug 2019 06:48:45 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1565952517; bh=savIJunojn17JqTmpPRi6XfwyoAJv/XKKhH6gaspsJI=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From; b=j1B/eVTipuVHHdkCX4XCR7zoLbx5Tetnuh9vpTJ4KEuKd/jo8Qkt80LTbyvE2mhYi Y7hiDII+Y8CDjam23gpP7Vbuzvd2quS9WFfKyRbCrQNCjyjf6PmLyxTXfqffHhkgbk ycd/UyoXSvX0nJtoln7C5zZlBM0YcGOkPquNjypc= Feedback-ID: mattiase@acm.or Received: from [192.168.0.4] ([188.150.171.71]) (authenticated bits=0) by mail118c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id x7GAmYwi024025; Fri, 16 Aug 2019 10:48:36 +0000 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <835zmxpi97.fsf@gnu.org> Date: Fri, 16 Aug 2019 12:48:34 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <0B532B75-600B-4FA0-AE9C-BF1980295363@acm.org> References: <83zhkaphy7.fsf@gnu.org> <183B7811-9B30-4D6B-BFCA-36A13CE8B6DB@acm.org> <83v9uypfdm.fsf@gnu.org> <83pnl6pdo6.fsf@gnu.org> <2B0EDC85-CAAE-4658-AA6D-85AF4842BFCF@acm.org> <83o90qp71n.fsf@gnu.org> <835zmxpi97.fsf@gnu.org> To: Eli Zaretskii X-Mailer: Apple Mail (2.3445.104.11) X-CTCH-RefID: str=0001.0A0B0211.5D568A05.0024, ss=1, re=0.000, recu=0.000, reip=0.000, cl=1, cld=1, fgs=0 X-CTCH-VOD: Unknown X-CTCH-Spam: Unknown X-CTCH-Score: 0.000 X-CTCH-Rules: X-CTCH-Flags: 0 X-CTCH-ScoreCust: 0.000 X-CSC: 0 X-CHA: v=2.3 cv=Mqx8FVSe c=1 sm=1 tr=0 a=SF+I6pRkHZhrawxbOkkvaA==:117 a=SF+I6pRkHZhrawxbOkkvaA==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=M51BFTxLslgA:10 a=mDV3o1hIAAAA:8 a=SmCIVH_NMYko8KMcmZMA:9 a=CjuIK1q_8ugA:10 a=_FVE-zBwftR9WsbkzFJk:22 X-Spam-Score: 0.3 (/) X-Debbugs-Envelope-To: 37036 Cc: 37036@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) tags 37036 wontfix close 37036 stop 16 aug. 2019 kl. 11.33 skrev Eli Zaretskii : >=20 >> The point is that if there is some code that doesn't work because of = the broken categories, there may very well be more. >=20 > This argument goes both ways: there could be code out there which > relies on the current "broken" definition of the Latin category. Well, that's an argument against fixing any bug. In general, code is = more likely to depend on correctness than on errors. That said, this is nothing I feel strongly about; let's not waste any = more time. Maybe the manual section about categories should be amended = to discourage would-be users. From unknown Fri Jun 13 10:28:23 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Fri, 13 Sep 2019 11:24:08 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator