GNU bug report logs -
#19878
24.4; Syntax class [:alpha:] wrongly matches the Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter
Previous Next
Reported by: mohammad.mahmoudi <at> gmail.com
Date: Sun, 15 Feb 2015 19:25:02 UTC
Severity: normal
Found in version 24.4
Done: Eli Zaretskii <eliz <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19878 in the body.
You can then email your comments to 19878 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#19878
; Package
emacs
.
(Sun, 15 Feb 2015 19:25:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
mohammad.mahmoudi <at> gmail.com
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Sun, 15 Feb 2015 19:25:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
This is to report that the Syntax class [:alpha:] wrongly matches the
Indian digits ۱۲۳۴۵۶۷۸۹۰ as letter.
In GNU Emacs 24.4.1 (i686-pc-mingw32)
of 2014-10-24 on LEG570
Windowing system distributor `Microsoft Corp.', version 6.1.7601
Configured using:
`configure --prefix=/c/usr'
Important settings:
value of $LANG: ENU
locale-coding-system: cp1256
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#19878
; Package
emacs
.
(Sun, 15 Feb 2015 20:17:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 19878 <at> debbugs.gnu.org (full text, mbox):
I think this is supposed to be:
,----[ (info "(elisp) Char Classes") ]
| `[:alpha:]'
| This matches any letter. (At present, for multibyte characters, it
| matches anything that has word syntax.)
`----
-ap
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#19878
; Package
emacs
.
(Tue, 17 Feb 2015 16:14:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 19878 <at> debbugs.gnu.org (full text, mbox):
> From: Andreas Politz <politza <at> hochschule-trier.de>
> Date: Sun, 15 Feb 2015 21:16:13 +0100
> Cc: 19878 <at> debbugs.gnu.org
>
>
> I think this is supposed to be:
>
> ,----[ (info "(elisp) Char Classes") ]
> | `[:alpha:]'
> | This matches any letter. (At present, for multibyte characters, it
> | matches anything that has word syntax.)
> `----
Indeed, which doesn't sound very nice.
Does someone object to the changes below (to be installed on master)?
They make [:alpha:] and [:alnum:] closer to the Unicode
recommendations in UTS #18, although we are still very far from
supporting even Level 1 of conformance. But these two seem like
low-hanging fruit to me.
The modified definitions of these two sets are not 100% compatible
with the old ones for the multibyte characters. However, if it turns
out that some code used these to get word-constituent characters,
those places should simply be changed to use \sw instead.
Also, does someone see any potential problem to make [:digit:] be a
superset of the current ASCII-only set, to match UTS #18 as well? The
comment in regex.c says it is "only used for single-byte characters",
but it isn't clear to me whether this is a requirement, i.e. there's
some code in Emacs that relies on that, or just a statement of facts.
Please note that this is my first serious change in regex.c, so I'd
appreciate review from people "in the know". TIA.
--- src/regex.c~0 2015-01-04 10:44:36 +0200
+++ src/regex.c 2015-02-17 17:40:56 +0200
@@ -324,12 +324,12 @@ enum syntaxcode { Swhitespace = 0, Sword
? (((c) >= 'a' && (c) <= 'z') \
|| ((c) >= 'A' && (c) <= 'Z') \
|| ((c) >= '0' && (c) <= '9')) \
- : SYNTAX (c) == Sword)
+ : (alphabeticp (c) || decimalnump (c)))
# define ISALPHA(c) (IS_REAL_ASCII (c) \
? (((c) >= 'a' && (c) <= 'z') \
|| ((c) >= 'A' && (c) <= 'Z')) \
- : SYNTAX (c) == Sword)
+ : alphabeticp (c))
# define ISLOWER(c) lowercasep (c)
@@ -1872,6 +1872,8 @@ struct range_table_work_area
#define BIT_SPACE 0x8
#define BIT_UPPER 0x10
#define BIT_MULTIBYTE 0x20
+#define BIT_ALPHA 0x40
+#define BIT_ALNUM 0x80
/* Set the bit for character C in a list. */
@@ -2072,7 +2074,9 @@ re_wctype_to_bit (re_wctype_t cc)
{
case RECC_NONASCII: case RECC_PRINT: case RECC_GRAPH:
case RECC_MULTIBYTE: return BIT_MULTIBYTE;
- case RECC_ALPHA: case RECC_ALNUM: case RECC_WORD: return BIT_WORD;
+ case RECC_ALPHA: return BIT_ALPHA;
+ case RECC_ALNUM: return BIT_ALNUM;
+ case RECC_WORD: return BIT_WORD;
case RECC_LOWER: return BIT_LOWER;
case RECC_UPPER: return BIT_UPPER;
case RECC_PUNCT: return BIT_PUNCT;
@@ -2930,7 +2934,7 @@ regex_compile (const_re_char *pattern, s
#endif /* emacs */
/* In most cases the matching rule for char classes
only uses the syntax table for multibyte chars,
- so that the content of the syntax-table it is not
+ so that the content of the syntax-table is not
hardcoded in the range_table. SPACE and WORD are
the two exceptions. */
if ((1 << cc) & ((1 << RECC_SPACE) | (1 << RECC_WORD)))
@@ -2945,7 +2949,7 @@ regex_compile (const_re_char *pattern, s
p = class_beg;
SET_LIST_BIT ('[');
- /* Because the `:' may starts the range, we
+ /* Because the `:' may start the range, we
can't simply set bit and repeat the loop.
Instead, just set it to C and handle below. */
c = ':';
@@ -5513,7 +5517,9 @@ re_match_2_internal (struct re_pattern_b
| (class_bits & BIT_PUNCT && ISPUNCT (c))
| (class_bits & BIT_SPACE && ISSPACE (c))
| (class_bits & BIT_UPPER && ISUPPER (c))
- | (class_bits & BIT_WORD && ISWORD (c)))
+ | (class_bits & BIT_WORD && ISWORD (c))
+ | (class_bits & BIT_ALPHA && ISALPHA (c))
+ | (class_bits & BIT_ALNUM && ISALNUM (c)))
not = !not;
else
CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);
--- src/character.c~0 2015-01-13 06:48:01 +0200
+++ src/character.c 2015-02-17 17:05:20 +0200
@@ -984,6 +984,48 @@ character is not ASCII nor 8-bit charact
#ifdef emacs
+/* Return 'true' if C is an alphabetic character as defined by its
+ Unicode properties. */
+bool
+alphabeticp (int c)
+{
+ Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+ if (INTEGERP (category))
+ {
+ unicode_category_t gen_cat = XINT (category);
+
+ /* See UTS #18. There are additional characters that should be
+ here, those designated as Other_uppercase, Other_lowercase,
+ and Other_alphabetic; FIXME. */
+ return (gen_cat == UNICODE_CATEGORY_Lu
+ || gen_cat == UNICODE_CATEGORY_Ll
+ || gen_cat == UNICODE_CATEGORY_Lt
+ || gen_cat == UNICODE_CATEGORY_Lm
+ || gen_cat == UNICODE_CATEGORY_Lo
+ || gen_cat == UNICODE_CATEGORY_Mn
+ || gen_cat == UNICODE_CATEGORY_Mc
+ || gen_cat == UNICODE_CATEGORY_Me
+ || gen_cat == UNICODE_CATEGORY_Nl) ? true : false;
+ }
+}
+
+/* Return 'true' if C is an decimal-number character as defined by its
+ Unicode properties. */
+bool
+decimalnump (int c)
+{
+ Lisp_Object category = CHAR_TABLE_REF (Vunicode_category_table, c);
+
+ if (INTEGERP (category))
+ {
+ unicode_category_t gen_cat = XINT (category);
+
+ /* See UTS #18. */
+ return (gen_cat == UNICODE_CATEGORY_Nd) ? true : false;
+ }
+}
+
void
syms_of_character (void)
{
--- src/character.h~0 2015-01-06 10:15:13 +0200
+++ src/character.h 2015-02-17 17:05:33 +0200
@@ -660,6 +660,9 @@
extern Lisp_Object Vchar_unify_table;
extern Lisp_Object string_escape_byte8 (Lisp_Object);
+extern bool alphabeticp (int);
+extern bool decimalnump (int);
+
/* Return a translation table of id number ID. */
#define GET_TRANSLATION_TABLE(id) \
(XCDR (XVECTOR (Vtranslation_table_vector)->contents[(id)]))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#19878
; Package
emacs
.
(Tue, 17 Feb 2015 18:16:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 19878 <at> debbugs.gnu.org (full text, mbox):
>>>>> Eli Zaretskii <eliz <at> gnu.org> writes:
[…]
> Also, does someone see any potential problem to make [:digit:] be a
> superset of the current ASCII-only set, to match UTS #18 as well?
> The comment in regex.c says it is "only used for single-byte
> characters", but it isn't clear to me whether this is a requirement,
> i. e. there's some code in Emacs that relies on that, or just a
> statement of facts.
Just for a random data point, my own preference was to always
use [0-9] when the intent is to discern a number for a later use
of number-to-string, etc. Frankly, I can’t even readily suggest
any reasonable examples where one’d want to use [:digit:] in the
first place.
[…]
--
FSF associate member #7257 http://boycottsystemd.org/ … 3013 B6A0 230E 334A
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#19878
; Package
emacs
.
(Tue, 17 Feb 2015 18:46:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 19878 <at> debbugs.gnu.org (full text, mbox):
> From: Ivan Shmakov <ivan <at> siamics.net>
> Date: Tue, 17 Feb 2015 18:15:09 +0000
>
> Frankly, I can’t even readily suggest any reasonable examples
> where one’d want to use [:digit:] in the first place.
Interactive search is one obvious use case, I think.
Reply sent
to
Eli Zaretskii <eliz <at> gnu.org>
:
You have taken responsibility.
(Sat, 28 Feb 2015 12:31:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
mohammad.mahmoudi <at> gmail.com
:
bug acknowledged by developer.
(Sat, 28 Feb 2015 12:31:03 GMT)
Full text and
rfc822 format available.
Message #22 received at 19878-done <at> debbugs.gnu.org (full text, mbox):
> Date: Tue, 17 Feb 2015 18:13:05 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: mohammad.mahmoudi <at> gmail.com, 19878 <at> debbugs.gnu.org
>
> > From: Andreas Politz <politza <at> hochschule-trier.de>
> > Date: Sun, 15 Feb 2015 21:16:13 +0100
> > Cc: 19878 <at> debbugs.gnu.org
> >
> >
> > I think this is supposed to be:
> >
> > ,----[ (info "(elisp) Char Classes") ]
> > | `[:alpha:]'
> > | This matches any letter. (At present, for multibyte characters, it
> > | matches anything that has word syntax.)
> > `----
>
> Indeed, which doesn't sound very nice.
>
> Does someone object to the changes below (to be installed on master)?
> They make [:alpha:] and [:alnum:] closer to the Unicode
> recommendations in UTS #18, although we are still very far from
> supporting even Level 1 of conformance. But these two seem like
> low-hanging fruit to me.
>
> The modified definitions of these two sets are not 100% compatible
> with the old ones for the multibyte characters. However, if it turns
> out that some code used these to get word-constituent characters,
> those places should simply be changed to use \sw instead.
No further comments, so I pushed the changes as commit 1a50945 on the
master branch, and I'm marking this bug closed.
> Also, does someone see any potential problem to make [:digit:] be a
> superset of the current ASCII-only set, to match UTS #18 as well? The
> comment in regex.c says it is "only used for single-byte characters",
> but it isn't clear to me whether this is a requirement, i.e. there's
> some code in Emacs that relies on that, or just a statement of facts.
I'd still like to hear an answer and/or opinions about this. If I
hear no comments, I will look into making a similar change to
[:digit:] soon.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 29 Mar 2015 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 10 years and 136 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.