GNU bug report logs - #19878
24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter

Package: emacs;

Reported by: mohammad.mahmoudi <at> gmail.com

Date: Sun, 15 Feb 2015 19:25:02 UTC

Severity: normal

Found in version 24.4

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19878 in the body.
You can then email your comments to 19878 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#19878; Package emacs. (Sun, 15 Feb 2015 19:25:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to mohammad.mahmoudi <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 15 Feb 2015 19:25:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: mohammad.mahmoudi <at> gmail.com
To: bug-gnu-emacs <at> gnu.org
Subject: 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
Date: Sun, 15 Feb 2015 19:14:57 +0330 (Iran Standard Time)

This is to report that the Syntax class [:alpha:] wrongly matches the 
Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter.


In GNU Emacs 24.4.1 (i686-pc-mingw32)
 of 2014-10-24 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.1.7601
 Configured using:
 `configure --prefix=/c/usr'

 Important settings:
  value of $LANG: ENU
  locale-coding-system: cp1256

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19878; Package emacs. (Sun, 15 Feb 2015 20:17:02 GMT) Full text and rfc822 format available.

Message #8 received at 19878 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Politz <politza <at> hochschule-trier.de>
To: mohammad.mahmoudi <at> gmail.com
Cc: 19878 <at> debbugs.gnu.org
Subject: Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the
 Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
Date: Sun, 15 Feb 2015 21:16:13 +0100

I think this is supposed to be:

,----[ (info "(elisp) Char Classes") ]
| `[:alpha:]'
|      This matches any letter.  (At present, for multibyte characters, it
|      matches anything that has word syntax.)
`----

-ap

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19878; Package emacs. (Tue, 17 Feb 2015 16:14:01 GMT) Full text and rfc822 format available.

Message #11 received at 19878 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Andreas Politz <politza <at> hochschule-trier.de>
Cc: mohammad.mahmoudi <at> gmail.com, 19878 <at> debbugs.gnu.org
Subject: Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian
 digits	۱۲۳۴۵۶۷۸۹۰	as letter
Date: Tue, 17 Feb 2015 18:13:05 +0200

> From: Andreas Politz <politza <at> hochschule-trier.de>
> Date: Sun, 15 Feb 2015 21:16:13 +0100
> Cc: 19878 <at> debbugs.gnu.org
> 
> 
> I think this is supposed to be:
> 
> ,----[ (info "(elisp) Char Classes") ]
> | `[:alpha:]'
> |      This matches any letter.  (At present, for multibyte characters, it
> |      matches anything that has word syntax.)
> `----

Indeed, which doesn't sound very nice.

Does someone object to the changes below (to be installed on master)?
They make [:alpha:] and [:alnum:] closer to the Unicode
recommendations in UTS #18, although we are still very far from
supporting even Level 1 of conformance.  But these two seem like
low-hanging fruit to me.

The modified definitions of these two sets are not 100% compatible
with the old ones for the multibyte characters.  However, if it turns
out that some code used these to get word-constituent characters,
those places should simply be changed to use \sw instead.

Also, does someone see any potential problem to make [:digit:] be a
superset of the current ASCII-only set, to match UTS #18 as well?  The
comment in regex.c says it is "only used for single-byte characters",
but it isn't clear to me whether this is a requirement, i.e. there's
some code in Emacs that relies on that, or just a statement of facts.

Please note that this is my first serious change in regex.c, so I'd
appreciate review from people "in the know".  TIA.

--- src/regex.c~0	2015-01-04 10:44:36 +0200
+++ src/regex.c	2015-02-17 17:40:56 +0200
@@ -324,12 +324,12 @@ enum syntaxcode { Swhitespace = 0, Sword
 		    ? (((c) >= 'a' && (c) <= 'z')	\
 		       || ((c) >= 'A' && (c) <= 'Z')	\
 		       || ((c) >= '0' && (c) <= '9'))	\
-		    : SYNTAX (c) == Sword)
+		    : (alphabeticp (c) || decimalnump (c)))
 
 # define ISALPHA(c) (IS_REAL_ASCII (c)			\
 		    ? (((c) >= 'a' && (c) <= 'z')	\
 		       || ((c) >= 'A' && (c) <= 'Z'))	\
-		    : SYNTAX (c) == Sword)
+		    : alphabeticp (c))
 
 # define ISLOWER(c) lowercasep (c)
 
@@ -1872,6 +1872,8 @@ struct range_table_work_area
 #define BIT_SPACE	0x8
 #define BIT_UPPER	0x10
 #define BIT_MULTIBYTE	0x20
+#define BIT_ALPHA	0x40
+#define BIT_ALNUM	0x80
 
 
 /* Set the bit for character C in a list.  */
@@ -2072,7 +2074,9 @@ re_wctype_to_bit (re_wctype_t cc)
     {
     case RECC_NONASCII: case RECC_PRINT: case RECC_GRAPH:
     case RECC_MULTIBYTE: return BIT_MULTIBYTE;
-    case RECC_ALPHA: case RECC_ALNUM: case RECC_WORD: return BIT_WORD;
+    case RECC_ALPHA: return BIT_ALPHA;
+    case RECC_ALNUM: return BIT_ALNUM;
+    case RECC_WORD: return BIT_WORD;
     case RECC_LOWER: return BIT_LOWER;
     case RECC_UPPER: return BIT_UPPER;
     case RECC_PUNCT: return BIT_PUNCT;
@@ -2930,7 +2934,7 @@ regex_compile (const_re_char *pattern, s
 #endif	/* emacs */
 			/* In most cases the matching rule for char classes
 			   only uses the syntax table for multibyte chars,
-			   so that the content of the syntax-table it is not
+			   so that the content of the syntax-table is not
 			   hardcoded in the range_table.  SPACE and WORD are
 			   the two exceptions.  */
 			if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
@@ -2945,7 +2949,7 @@ regex_compile (const_re_char *pattern, s
 			p = class_beg;
 			SET_LIST_BIT ('[');
 
-			/* Because the `:' may starts the range, we
+			/* Because the `:' may start the range, we
 			   can't simply set bit and repeat the loop.
 			   Instead, just set it to C and handle below.  */
 			c = ':';
@@ -5513,7 +5517,9 @@ re_match_2_internal (struct re_pattern_b
 		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
 		    | (class_bits & BIT_SPACE && ISSPACE (c))
 		    | (class_bits & BIT_UPPER && ISUPPER (c))
-		    | (class_bits & BIT_WORD  && ISWORD (c)))
+		    | (class_bits & BIT_WORD  && ISWORD  (c))
+		    | (class_bits & BIT_ALPHA && ISALPHA (c))
+		    | (class_bits & BIT_ALNUM && ISALNUM (c)))
 		  not = !not;
 		else
 		  CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);

--- src/character.c~0	2015-01-13 06:48:01 +0200
+++ src/character.c	2015-02-17 17:05:20 +0200
@@ -984,6 +984,48 @@ character is not ASCII nor 8-bit charact
 
 #ifdef emacs
 
+/* Return 'true' if C is an alphabetic character as defined by its
+   Unicode properties.  */
+bool
+alphabeticp (int c)
+{
+  Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+  if (INTEGERP (category))
+    {
+      unicode_category_t gen_cat = XINT (category);
+
+      /* See UTS #18.  There are additional characters that should be
+	 here, those designated as Other_uppercase, Other_lowercase,
+	 and Other_alphabetic; FIXME.  */
+      return (gen_cat == UNICODE_CATEGORY_Lu
+	      || gen_cat == UNICODE_CATEGORY_Ll
+	      || gen_cat == UNICODE_CATEGORY_Lt
+	      || gen_cat == UNICODE_CATEGORY_Lm
+	      || gen_cat == UNICODE_CATEGORY_Lo
+	      || gen_cat == UNICODE_CATEGORY_Mn
+	      || gen_cat == UNICODE_CATEGORY_Mc
+	      || gen_cat == UNICODE_CATEGORY_Me
+	      || gen_cat == UNICODE_CATEGORY_Nl) ? true : false;
+    }
+}
+
+/* Return 'true' if C is an decimal-number character as defined by its
+   Unicode properties.  */
+bool
+decimalnump (int c)
+{
+  Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+  if (INTEGERP (category))
+    {
+      unicode_category_t gen_cat = XINT (category);
+
+      /* See UTS #18.  */
+      return (gen_cat == UNICODE_CATEGORY_Nd) ? true : false;
+    }
+}
+
 void
 syms_of_character (void)
 {


--- src/character.h~0	2015-01-06 10:15:13 +0200
+++ src/character.h	2015-02-17 17:05:33 +0200
@@ -660,6 +660,9 @@
 extern Lisp_Object Vchar_unify_table;
 extern Lisp_Object string_escape_byte8 (Lisp_Object);
 
+extern bool alphabeticp (int);
+extern bool decimalnump (int);
+
 /* Return a translation table of id number ID.  */
 #define GET_TRANSLATION_TABLE(id) \
   (XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)]))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19878; Package emacs. (Tue, 17 Feb 2015 18:16:02 GMT) Full text and rfc822 format available.

Message #14 received at 19878 <at> debbugs.gnu.org (full text, mbox):

From: Ivan Shmakov <ivan <at> siamics.net>
To: 19878 <at> debbugs.gnu.org
Subject: Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the
 Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter 
Date: Tue, 17 Feb 2015 18:15:09 +0000

>>>>> Eli Zaretskii <eliz <at> gnu.org> writes:

[…]

 > Also, does someone see any potential problem to make [:digit:] be a
 > superset of the current ASCII-only set, to match UTS #18 as well?
 > The comment in regex.c says it is "only used for single-byte
 > characters", but it isn't clear to me whether this is a requirement,
 > i. e. there's some code in Emacs that relies on that, or just a
 > statement of facts.

	Just for a random data point, my own preference was to always
	use [0-9] when the intent is to discern a number for a later use
	of number-to-string, etc.  Frankly, I can’t even readily suggest
	any reasonable examples where one’d want to use [:digit:] in the
	first place.

[…]

-- 
FSF associate member #7257  http://boycottsystemd.org/  … 3013 B6A0 230E 334A

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19878; Package emacs. (Tue, 17 Feb 2015 18:46:02 GMT) Full text and rfc822 format available.

Message #17 received at 19878 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Ivan Shmakov <ivan <at> siamics.net>
Cc: 19878 <at> debbugs.gnu.org
Subject: Re: bug#19878: 24.4; Syntax class [:alpha:] wrongly matches the Indian
 digits	۱۲۳۴۵۶۷۸۹۰	as letter
Date: Tue, 17 Feb 2015 20:45:40 +0200

> From: Ivan Shmakov <ivan <at> siamics.net>
> Date: Tue, 17 Feb 2015 18:15:09 +0000
> 
> 	Frankly, I can’t even readily suggest any reasonable examples
> 	where one’d want to use [:digit:] in the first place.

Interactive search is one obvious use case, I think.

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 28 Feb 2015 12:31:02 GMT) Full text and rfc822 format available.

Notification sent to mohammad.mahmoudi <at> gmail.com:
bug acknowledged by developer. (Sat, 28 Feb 2015 12:31:03 GMT) Full text and rfc822 format available.

Message #22 received at 19878-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: politza <at> hochschule-trier.de, mohammad.mahmoudi <at> gmail.com
Cc: 19878-done <at> debbugs.gnu.org
Subject: Re: bug#19878: 24.4;
 Syntax class [:alpha:] wrongly matches the Indian	digits
 ۱۲۳۴۵۶۷۸۹۰ as letter
Date: Sat, 28 Feb 2015 14:29:52 +0200

> Date: Tue, 17 Feb 2015 18:13:05 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: mohammad.mahmoudi <at> gmail.com, 19878 <at> debbugs.gnu.org
> 
> > From: Andreas Politz <politza <at> hochschule-trier.de>
> > Date: Sun, 15 Feb 2015 21:16:13 +0100
> > Cc: 19878 <at> debbugs.gnu.org
> > 
> > 
> > I think this is supposed to be:
> > 
> > ,----[ (info "(elisp) Char Classes") ]
> > | `[:alpha:]'
> > |      This matches any letter.  (At present, for multibyte characters, it
> > |      matches anything that has word syntax.)
> > `----
> 
> Indeed, which doesn't sound very nice.
> 
> Does someone object to the changes below (to be installed on master)?
> They make [:alpha:] and [:alnum:] closer to the Unicode
> recommendations in UTS #18, although we are still very far from
> supporting even Level 1 of conformance.  But these two seem like
> low-hanging fruit to me.
> 
> The modified definitions of these two sets are not 100% compatible
> with the old ones for the multibyte characters.  However, if it turns
> out that some code used these to get word-constituent characters,
> those places should simply be changed to use \sw instead.

No further comments, so I pushed the changes as commit 1a50945 on the
master branch, and I'm marking this bug closed.

> Also, does someone see any potential problem to make [:digit:] be a
> superset of the current ASCII-only set, to match UTS #18 as well?  The
> comment in regex.c says it is "only used for single-byte characters",
> but it isn't clear to me whether this is a requirement, i.e. there's
> some code in Emacs that relies on that, or just a statement of facts.

I'd still like to hear an answer and/or opinions about this.  If I
hear no comments, I will look into making a similar change to
[:digit:] soon.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 29 Mar 2015 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 136 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #19878 24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter

GNU bug report logs - #19878
24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter