From unknown Fri Aug 15 17:20:35 2025 X-Loop: help-debbugs@gnu.org Subject: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?= as letter Resent-From: mohammad.mahmoudi@gmail.com Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 15 Feb 2015 19:25:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 19878 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: 19878@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.142402829929792 (code B ref -1); Sun, 15 Feb 2015 19:25:02 +0000 Received: (at submit) by debbugs.gnu.org; 15 Feb 2015 19:24:59 +0000 Received: from localhost ([127.0.0.1]:44919 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YN4oM-0007kS-LX for submit@debbugs.gnu.org; Sun, 15 Feb 2015 14:24:59 -0500 Received: from eggs.gnu.org ([208.118.235.92]:43140) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YN1Wm-00031g-Hw for submit@debbugs.gnu.org; Sun, 15 Feb 2015 10:54:37 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YN1Wc-0002aB-Dz for submit@debbugs.gnu.org; Sun, 15 Feb 2015 10:54:31 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_05,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:48696) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YN1Wc-0002a7-BH for submit@debbugs.gnu.org; Sun, 15 Feb 2015 10:54:26 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44861) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YN1WX-0007qf-9m for bug-gnu-emacs@gnu.org; Sun, 15 Feb 2015 10:54:26 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YN1WS-0002ZO-Ab for bug-gnu-emacs@gnu.org; Sun, 15 Feb 2015 10:54:21 -0500 Received: from mail-pd0-f180.google.com ([209.85.192.180]:32877) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YN1WS-0002YZ-4M for bug-gnu-emacs@gnu.org; Sun, 15 Feb 2015 10:54:16 -0500 Received: by pdjz10 with SMTP id z10so29872312pdj.0 for ; Sun, 15 Feb 2015 07:54:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-type :content-transfer-encoding; bh=AIcquOe5gYDorBVXKWTVdRgf3nvAjHcOk+Hz1LOXpqM=; b=rBoDCQ40lQnT9Q+U/4SAOODGpSMWe/Pa0NH+GzXewih1fVtK7D1W6ymvJ1f0Bh6ST2 hNBiamyviENWquOBxdrx0biwDK94Qe7OHOtI4QqIHgjCWEm+bGLzKbTligiFmP2ceK1D isiXbvW6Rt2dinyZ7Kv2Fik8TkQUQ5LaLz6j6CKlAU7VWqTkVrgEOoTQgsrEJdCDeU/l AhlVoi20Tc9m2QqdsNpHnTnIrniDYN714Aja0Ikbx3Yjqk9IgZr3B2vajldOvtn7gEaE YpZI5Z/SSeuB/pjwA5Ng4EwnWTm2B2Af3nj7IsYaB1+nX0oJHQ+VAA3wIZewrmp296ef YCLw== X-Received: by 10.70.25.228 with SMTP id f4mr26878865pdg.90.1424015654483; Sun, 15 Feb 2015 07:54:14 -0800 (PST) Received: from name ([81.31.164.154]) by mx.google.com with ESMTPSA id ge7sm12162947pbc.16.2015.02.15.07.54.11 for (version=TLSv1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 15 Feb 2015 07:54:14 -0800 (PST) Date: Sun, 15 Feb 2015 19:14:57 +0330 (Iran Standard Time) From: mohammad.mahmoudi@gmail.com Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Mailman-Approved-At: Sun, 15 Feb 2015 14:24:58 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) This is to report that the Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter. In GNU Emacs 24.4.1 (i686-pc-mingw32) of 2014-10-24 on LEG570 Windowing system distributor `Microsoft Corp.', version 6.1.7601 Configured using: `configure --prefix=/c/usr' Important settings: value of $LANG: ENU locale-coding-system: cp1256 From unknown Fri Aug 15 17:20:35 2025 X-Loop: help-debbugs@gnu.org Subject: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?= as letter Resent-From: Andreas Politz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Sun, 15 Feb 2015 20:17:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 19878 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: mohammad.mahmoudi@gmail.com Cc: 19878@debbugs.gnu.org Received: via spool by 19878-submit@debbugs.gnu.org id=B19878.14240313901780 (code B ref 19878); Sun, 15 Feb 2015 20:17:02 +0000 Received: (at 19878) by debbugs.gnu.org; 15 Feb 2015 20:16:30 +0000 Received: from localhost ([127.0.0.1]:44928 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YN5cD-0000Sd-TD for submit@debbugs.gnu.org; Sun, 15 Feb 2015 15:16:30 -0500 Received: from gateway-b.fh-trier.de ([143.93.54.182]:60567) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YN5cB-0000SO-QS for 19878@debbugs.gnu.org; Sun, 15 Feb 2015 15:16:28 -0500 X-Virus-Scanned: by Amavisd-new + McAfee uvscan + ClamAV [Rechenzentrum Hochschule Trier] Received: from luca (dslb-092-074-088-169.092.074.pools.vodafone-ip.de [92.74.88.169]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: politza) by gateway-b.fh-trier.de (Postfix) with ESMTPSA id BFD5517B4A4; Sun, 15 Feb 2015 21:16:13 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha1; c=simple/simple; d=hochschule-trier.de; s=default; t=1424031373; bh=fYDz6T25b4cT9j+dgrUNEFD5H4s=; h=From:To:Cc:Subject:References:Date:In-Reply-To:Message-ID: MIME-Version:Content-Type; b=axsF8/YtweCmNb+Xua4Rr4el6sW2dxbBdZJR94f10PhcYp7mMQDwD6vj4vQviVsXt 96M4ATgCT8zidBx1f7wf0pR+TThQ+Q/yPrHISm7tJ4GPDm5dYtNPXVvibIpxImdJ4C kZkt2sLJpomd020w1qT7iULGE3dnKQwyZ3cEo/Hc= Received: from politza by luca with local (Exim 4.80) (envelope-from ) id 1YN5bx-0006sm-41; Sun, 15 Feb 2015 21:16:13 +0100 From: Andreas Politz References: Date: Sun, 15 Feb 2015 21:16:13 +0100 In-Reply-To: (mohammad mahmoudi's message of "Sun, 15 Feb 2015 19:14:57 +0330 (Iran Standard Time)") Message-ID: <87k2zjj5gy.fsf@hochschule-trier.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) I think this is supposed to be: ,----[ (info "(elisp) Char Classes") ] | `[:alpha:]' | This matches any letter. (At present, for multibyte characters, it | matches anything that has word syntax.) `---- -ap From unknown Fri Aug 15 17:20:35 2025 X-Loop: help-debbugs@gnu.org Subject: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?= as letter Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 17 Feb 2015 16:14:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 19878 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Andreas Politz Cc: mohammad.mahmoudi@gmail.com, 19878@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 19878-submit@debbugs.gnu.org id=B19878.142418958820982 (code B ref 19878); Tue, 17 Feb 2015 16:14:01 +0000 Received: (at 19878) by debbugs.gnu.org; 17 Feb 2015 16:13:08 +0000 Received: from localhost ([127.0.0.1]:46474 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YNkln-0005SM-73 for submit@debbugs.gnu.org; Tue, 17 Feb 2015 11:13:07 -0500 Received: from mtaout21.012.net.il ([80.179.55.169]:48004) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YNklj-0005Rl-5I for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 11:13:04 -0500 Received: from conversion-daemon.a-mtaout21.012.net.il by a-mtaout21.012.net.il (HyperSendmail v2007.08) id <0NJX00M00BJP0P00@a-mtaout21.012.net.il> for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 18:12:56 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by a-mtaout21.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NJX00LJ8BPKZ840@a-mtaout21.012.net.il>; Tue, 17 Feb 2015 18:12:56 +0200 (IST) Date: Tue, 17 Feb 2015 18:13:05 +0200 From: Eli Zaretskii In-reply-to: <87k2zjj5gy.fsf@hochschule-trier.de> X-012-Sender: halo1@inter.net.il Message-id: <838ufw7bzi.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: 8bit References: <87k2zjj5gy.fsf@hochschule-trier.de> X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) > From: Andreas Politz > Date: Sun, 15 Feb 2015 21:16:13 +0100 > Cc: 19878@debbugs.gnu.org > > > I think this is supposed to be: > > ,----[ (info "(elisp) Char Classes") ] > | `[:alpha:]' > | This matches any letter. (At present, for multibyte characters, it > | matches anything that has word syntax.) > `---- Indeed, which doesn't sound very nice. Does someone object to the changes below (to be installed on master)? They make [:alpha:] and [:alnum:] closer to the Unicode recommendations in UTS #18, although we are still very far from supporting even Level 1 of conformance. But these two seem like low-hanging fruit to me. The modified definitions of these two sets are not 100% compatible with the old ones for the multibyte characters. However, if it turns out that some code used these to get word-constituent characters, those places should simply be changed to use \sw instead. Also, does someone see any potential problem to make [:digit:] be a superset of the current ASCII-only set, to match UTS #18 as well? The comment in regex.c says it is "only used for single-byte characters", but it isn't clear to me whether this is a requirement, i.e. there's some code in Emacs that relies on that, or just a statement of facts. Please note that this is my first serious change in regex.c, so I'd appreciate review from people "in the know". TIA. --- src/regex.c~0 2015-01-04 10:44:36 +0200 +++ src/regex.c 2015-02-17 17:40:56 +0200 @@ -324,12 +324,12 @@ enum syntaxcode { Swhitespace = 0, Sword ? (((c) >= 'a' && (c) <= 'z') \ || ((c) >= 'A' && (c) <= 'Z') \ || ((c) >= '0' && (c) <= '9')) \ - : SYNTAX (c) == Sword) + : (alphabeticp (c) || decimalnump (c))) # define ISALPHA(c) (IS_REAL_ASCII (c) \ ? (((c) >= 'a' && (c) <= 'z') \ || ((c) >= 'A' && (c) <= 'Z')) \ - : SYNTAX (c) == Sword) + : alphabeticp (c)) # define ISLOWER(c) lowercasep (c) @@ -1872,6 +1872,8 @@ struct range_table_work_area #define BIT_SPACE 0x8 #define BIT_UPPER 0x10 #define BIT_MULTIBYTE 0x20 +#define BIT_ALPHA 0x40 +#define BIT_ALNUM 0x80 /* Set the bit for character C in a list. */ @@ -2072,7 +2074,9 @@ re_wctype_to_bit (re_wctype_t cc) { case RECC_NONASCII: case RECC_PRINT: case RECC_GRAPH: case RECC_MULTIBYTE: return BIT_MULTIBYTE; - case RECC_ALPHA: case RECC_ALNUM: case RECC_WORD: return BIT_WORD; + case RECC_ALPHA: return BIT_ALPHA; + case RECC_ALNUM: return BIT_ALNUM; + case RECC_WORD: return BIT_WORD; case RECC_LOWER: return BIT_LOWER; case RECC_UPPER: return BIT_UPPER; case RECC_PUNCT: return BIT_PUNCT; @@ -2930,7 +2934,7 @@ regex_compile (const_re_char *pattern, s #endif /* emacs */ /* In most cases the matching rule for char classes only uses the syntax table for multibyte chars, - so that the content of the syntax-table it is not + so that the content of the syntax-table is not hardcoded in the range_table. SPACE and WORD are the two exceptions. */ if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD))) @@ -2945,7 +2949,7 @@ regex_compile (const_re_char *pattern, s p = class_beg; SET_LIST_BIT ('['); - /* Because the `:' may starts the range, we + /* Because the `:' may start the range, we can't simply set bit and repeat the loop. Instead, just set it to C and handle below. */ c = ':'; @@ -5513,7 +5517,9 @@ re_match_2_internal (struct re_pattern_b | (class_bits & BIT_PUNCT && ISPUNCT (c)) | (class_bits & BIT_SPACE && ISSPACE (c)) | (class_bits & BIT_UPPER && ISUPPER (c)) - | (class_bits & BIT_WORD && ISWORD (c))) + | (class_bits & BIT_WORD && ISWORD (c)) + | (class_bits & BIT_ALPHA && ISALPHA (c)) + | (class_bits & BIT_ALNUM && ISALNUM (c))) not = !not; else CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); --- src/character.c~0 2015-01-13 06:48:01 +0200 +++ src/character.c 2015-02-17 17:05:20 +0200 @@ -984,6 +984,48 @@ character is not ASCII nor 8-bit charact #ifdef emacs +/* Return 'true' if C is an alphabetic character as defined by its + Unicode properties. */ +bool +alphabeticp (int c) +{ + Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c); + + if (INTEGERP (category)) + { + unicode_category_t gen_cat = XINT (category); + + /* See UTS #18. There are additional characters that should be + here, those designated as Other_uppercase, Other_lowercase, + and Other_alphabetic; FIXME. */ + return (gen_cat == UNICODE_CATEGORY_Lu + || gen_cat == UNICODE_CATEGORY_Ll + || gen_cat == UNICODE_CATEGORY_Lt + || gen_cat == UNICODE_CATEGORY_Lm + || gen_cat == UNICODE_CATEGORY_Lo + || gen_cat == UNICODE_CATEGORY_Mn + || gen_cat == UNICODE_CATEGORY_Mc + || gen_cat == UNICODE_CATEGORY_Me + || gen_cat == UNICODE_CATEGORY_Nl) ? true : false; + } +} + +/* Return 'true' if C is an decimal-number character as defined by its + Unicode properties. */ +bool +decimalnump (int c) +{ + Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c); + + if (INTEGERP (category)) + { + unicode_category_t gen_cat = XINT (category); + + /* See UTS #18. */ + return (gen_cat == UNICODE_CATEGORY_Nd) ? true : false; + } +} + void syms_of_character (void) { --- src/character.h~0 2015-01-06 10:15:13 +0200 +++ src/character.h 2015-02-17 17:05:33 +0200 @@ -660,6 +660,9 @@ extern Lisp_Object Vchar_unify_table; extern Lisp_Object string_escape_byte8 (Lisp_Object); +extern bool alphabeticp (int); +extern bool decimalnump (int); + /* Return a translation table of id number ID. */ #define GET_TRANSLATION_TABLE(id) \ (XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)])) From unknown Fri Aug 15 17:20:35 2025 X-Loop: help-debbugs@gnu.org Subject: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?= as letter Resent-From: Ivan Shmakov Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 17 Feb 2015 18:16:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 19878 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: 19878@debbugs.gnu.org Received: via spool by 19878-submit@debbugs.gnu.org id=B19878.142419692532467 (code B ref 19878); Tue, 17 Feb 2015 18:16:02 +0000 Received: (at 19878) by debbugs.gnu.org; 17 Feb 2015 18:15:25 +0000 Received: from localhost ([127.0.0.1]:46539 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YNmg8-0008Rb-B5 for submit@debbugs.gnu.org; Tue, 17 Feb 2015 13:15:24 -0500 Received: from fely.am-1.org ([78.47.74.50]:44304) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YNmg6-0008RT-HK for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 13:15:22 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=siamics.net; s=a2013295; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-ID:In-Reply-To:Date:Sender:References:Subject:To:From; bh=u2hklHOMCLyBuKEH/CE+dXAf3q/86otdyRV3hC4o0M8=; b=p+p70L4aw3lF53ae/36ObFrQ9ErxuvQija/6roWunvjO0etlBU7dWgFga19XvpAhUaraFvT4VFBu7BqVBuinfKSZ7Nf/dEOOaUebzF+xtkKsxEGbW2kKnUzBNaVjA00r4MFU+edzQ6GRsmBf5XhC4uILgz5Xm7v26xL6CW2JwUM=; Received: from [2a02:2560:6d4:26ca::1:1d] (helo=violet.siamics.net) by fely.am-1.org with esmtps (TLS1.2:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from ) id 1YNmg5-0003v1-5p for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 18:15:21 +0000 Received: from localhost ([::1] helo=violet.siamics.net) by violet.siamics.net with esmtps (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.80) (envelope-from ) id 1YNmft-0006Fd-Vw for 19878@debbugs.gnu.org; Wed, 18 Feb 2015 01:15:10 +0700 From: Ivan Shmakov References: <87k2zjj5gy.fsf@hochschule-trier.de> <838ufw7bzi.fsf@gnu.org> Mail-Followup-To: 19878@debbugs.gnu.org Date: Tue, 17 Feb 2015 18:15:09 +0000 In-Reply-To: <838ufw7bzi.fsf@gnu.org> (Eli Zaretskii's message of "Tue, 17 Feb 2015 18:13:05 +0200") Message-ID: <87k2zg8kwi.fsf_-_@violet.siamics.net> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.7 (/) >>>>> Eli Zaretskii writes: [=E2=80=A6] > Also, does someone see any potential problem to make [:digit:] be a > superset of the current ASCII-only set, to match UTS #18 as well? > The comment in regex.c says it is "only used for single-byte > characters", but it isn't clear to me whether this is a requirement, > i. e. there's some code in Emacs that relies on that, or just a > statement of facts. Just for a random data point, my own preference was to always use [0-9] when the intent is to discern a number for a later use of number-to-string, etc. Frankly, I can=E2=80=99t even readily suggest any reasonable examples where one=E2=80=99d want to use [:digit:] in the first place. [=E2=80=A6] --=20 FSF associate member #7257 http://boycottsystemd.org/ =E2=80=A6 3013 B6A0= 230E 334A From unknown Fri Aug 15 17:20:35 2025 X-Loop: help-debbugs@gnu.org Subject: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?= as letter Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 17 Feb 2015 18:46:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 19878 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: To: Ivan Shmakov Cc: 19878@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 19878-submit@debbugs.gnu.org id=B19878.14241987432663 (code B ref 19878); Tue, 17 Feb 2015 18:46:02 +0000 Received: (at 19878) by debbugs.gnu.org; 17 Feb 2015 18:45:43 +0000 Received: from localhost ([127.0.0.1]:46543 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YNn9T-0000gt-AO for submit@debbugs.gnu.org; Tue, 17 Feb 2015 13:45:43 -0500 Received: from mtaout27.012.net.il ([80.179.55.183]:37049) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YNn9O-0000ga-8M for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 13:45:40 -0500 Received: from conversion-daemon.mtaout27.012.net.il by mtaout27.012.net.il (HyperSendmail v2007.08) id <0NJX00F00II9VQ00@mtaout27.012.net.il> for 19878@debbugs.gnu.org; Tue, 17 Feb 2015 20:39:51 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout27.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NJX009H6IIFGY50@mtaout27.012.net.il>; Tue, 17 Feb 2015 20:39:51 +0200 (IST) Date: Tue, 17 Feb 2015 20:45:40 +0200 From: Eli Zaretskii In-reply-to: <87k2zg8kwi.fsf_-_@violet.siamics.net> X-012-Sender: halo1@inter.net.il Message-id: <83mw4c5qcr.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: 8BIT References: <87k2zjj5gy.fsf@hochschule-trier.de> <838ufw7bzi.fsf@gnu.org> <87k2zg8kwi.fsf_-_@violet.siamics.net> X-Spam-Score: 1.0 (+) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) > From: Ivan Shmakov > Date: Tue, 17 Feb 2015 18:15:09 +0000 > > Frankly, I can’t even readily suggest any reasonable examples > where one’d want to use [:digit:] in the first place. Interactive search is one obvious use case, I think. From unknown Fri Aug 15 17:20:35 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.503 (Entity 5.503) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: mohammad.mahmoudi@gmail.com Subject: bug#19878: closed (Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?UTF-8?Q?=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0?= as letter) Message-ID: References: <83bnkete0v.fsf@gnu.org> X-Gnu-PR-Message: they-closed 19878 X-Gnu-PR-Package: emacs Reply-To: 19878@debbugs.gnu.org Date: Sat, 28 Feb 2015 12:31:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1425126663-16984-1" This is a multi-part message in MIME format... ------------=_1425126663-16984-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =DB= =B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0 as letter which was filed against the emacs package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 19878@debbugs.gnu.org. --=20 19878: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D19878 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1425126663-16984-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 19878-done) by debbugs.gnu.org; 28 Feb 2015 12:30:08 +0000 Received: from localhost ([127.0.0.1]:60427 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YRgX1-0003P3-2M for submit@debbugs.gnu.org; Sat, 28 Feb 2015 07:30:07 -0500 Received: from mtaout27.012.net.il ([80.179.55.183]:53335) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YRgWw-0003MV-ER for 19878-done@debbugs.gnu.org; Sat, 28 Feb 2015 07:30:04 -0500 Received: from conversion-daemon.mtaout27.012.net.il by mtaout27.012.net.il (HyperSendmail v2007.08) id <0NKH00600DK78Z00@mtaout27.012.net.il> for 19878-done@debbugs.gnu.org; Sat, 28 Feb 2015 14:24:27 +0200 (IST) Received: from HOME-C4E4A596F7 ([87.69.4.28]) by mtaout27.012.net.il (HyperSendmail v2007.08) with ESMTPA id <0NKH00OWNEGRFS80@mtaout27.012.net.il>; Sat, 28 Feb 2015 14:24:27 +0200 (IST) Date: Sat, 28 Feb 2015 14:29:52 +0200 From: Eli Zaretskii Subject: Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits =?utf-8?B?27Hbstuz27Tbtdu227fbuNu527A=?= as letter In-reply-to: <838ufw7bzi.fsf@gnu.org> X-012-Sender: halo1@inter.net.il To: politza@hochschule-trier.de, mohammad.mahmoudi@gmail.com Message-id: <83bnkete0v.fsf@gnu.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-transfer-encoding: 8bit References: <87k2zjj5gy.fsf@hochschule-trier.de> <838ufw7bzi.fsf@gnu.org> X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: 19878-done Cc: 19878-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: Eli Zaretskii List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 1.0 (+) > Date: Tue, 17 Feb 2015 18:13:05 +0200 > From: Eli Zaretskii > Cc: mohammad.mahmoudi@gmail.com, 19878@debbugs.gnu.org > > > From: Andreas Politz > > Date: Sun, 15 Feb 2015 21:16:13 +0100 > > Cc: 19878@debbugs.gnu.org > > > > > > I think this is supposed to be: > > > > ,----[ (info "(elisp) Char Classes") ] > > | `[:alpha:]' > > | This matches any letter. (At present, for multibyte characters, it > > | matches anything that has word syntax.) > > `---- > > Indeed, which doesn't sound very nice. > > Does someone object to the changes below (to be installed on master)? > They make [:alpha:] and [:alnum:] closer to the Unicode > recommendations in UTS #18, although we are still very far from > supporting even Level 1 of conformance. But these two seem like > low-hanging fruit to me. > > The modified definitions of these two sets are not 100% compatible > with the old ones for the multibyte characters. However, if it turns > out that some code used these to get word-constituent characters, > those places should simply be changed to use \sw instead. No further comments, so I pushed the changes as commit 1a50945 on the master branch, and I'm marking this bug closed. > Also, does someone see any potential problem to make [:digit:] be a > superset of the current ASCII-only set, to match UTS #18 as well? The > comment in regex.c says it is "only used for single-byte characters", > but it isn't clear to me whether this is a requirement, i.e. there's > some code in Emacs that relies on that, or just a statement of facts. I'd still like to hear an answer and/or opinions about this. If I hear no comments, I will look into making a similar change to [:digit:] soon. ------------=_1425126663-16984-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 15 Feb 2015 19:24:59 +0000 Received: from localhost ([127.0.0.1]:44919 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YN4oM-0007kS-LX for submit@debbugs.gnu.org; Sun, 15 Feb 2015 14:24:59 -0500 Received: from eggs.gnu.org ([208.118.235.92]:43140) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YN1Wm-00031g-Hw for submit@debbugs.gnu.org; Sun, 15 Feb 2015 10:54:37 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YN1Wc-0002aB-Dz for submit@debbugs.gnu.org; Sun, 15 Feb 2015 10:54:31 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_05,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:48696) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YN1Wc-0002a7-BH for submit@debbugs.gnu.org; Sun, 15 Feb 2015 10:54:26 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:44861) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YN1WX-0007qf-9m for bug-gnu-emacs@gnu.org; Sun, 15 Feb 2015 10:54:26 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YN1WS-0002ZO-Ab for bug-gnu-emacs@gnu.org; Sun, 15 Feb 2015 10:54:21 -0500 Received: from mail-pd0-f180.google.com ([209.85.192.180]:32877) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YN1WS-0002YZ-4M for bug-gnu-emacs@gnu.org; Sun, 15 Feb 2015 10:54:16 -0500 Received: by pdjz10 with SMTP id z10so29872312pdj.0 for ; Sun, 15 Feb 2015 07:54:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-type :content-transfer-encoding; bh=AIcquOe5gYDorBVXKWTVdRgf3nvAjHcOk+Hz1LOXpqM=; b=rBoDCQ40lQnT9Q+U/4SAOODGpSMWe/Pa0NH+GzXewih1fVtK7D1W6ymvJ1f0Bh6ST2 hNBiamyviENWquOBxdrx0biwDK94Qe7OHOtI4QqIHgjCWEm+bGLzKbTligiFmP2ceK1D isiXbvW6Rt2dinyZ7Kv2Fik8TkQUQ5LaLz6j6CKlAU7VWqTkVrgEOoTQgsrEJdCDeU/l AhlVoi20Tc9m2QqdsNpHnTnIrniDYN714Aja0Ikbx3Yjqk9IgZr3B2vajldOvtn7gEaE YpZI5Z/SSeuB/pjwA5Ng4EwnWTm2B2Af3nj7IsYaB1+nX0oJHQ+VAA3wIZewrmp296ef YCLw== X-Received: by 10.70.25.228 with SMTP id f4mr26878865pdg.90.1424015654483; Sun, 15 Feb 2015 07:54:14 -0800 (PST) Received: from name ([81.31.164.154]) by mx.google.com with ESMTPSA id ge7sm12162947pbc.16.2015.02.15.07.54.11 for (version=TLSv1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Sun, 15 Feb 2015 07:54:14 -0800 (PST) Date: Sun, 15 Feb 2015 19:14:57 +0330 (Iran Standard Time) From: mohammad.mahmoudi@gmail.com To: bug-gnu-emacs@gnu.org Subject: =?UTF-8?Q?24=2E4=3B_Syntax_class_=5B=3Aalpha=3A=5D_wrongly_matches_the_Indian_digits_=DB=B1=DB=B2=DB=B3=DB=B4=DB=B5=DB=B6=DB=B7=DB=B8=DB=B9=DB=B0_as_letter?= Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=UTF-8 Content-Transfer-Encoding: 8BIT X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sun, 15 Feb 2015 14:24:58 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) This is to report that the Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter. In GNU Emacs 24.4.1 (i686-pc-mingw32) of 2014-10-24 on LEG570 Windowing system distributor `Microsoft Corp.', version 6.1.7601 Configured using: `configure --prefix=/c/usr' Important settings: value of $LANG: ENU locale-coding-system: cp1256 ------------=_1425126663-16984-1--