GNU bug report logs - #24020
[PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’

Package: emacs;

Reported by: Michal Nazarewicz <mina86 <at> mina86.com>

Date: Mon, 18 Jul 2016 14:06:01 UTC

Severity: normal

Tags: patch

Done: Michal Nazarewicz <mina86 <at> mina86.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24020 in the body.
You can then email your comments to 24020 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Mon, 18 Jul 2016 14:06:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michal Nazarewicz <mina86 <at> mina86.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 18 Jul 2016 14:06:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’
Date: Mon, 18 Jul 2016 16:04:44 +0200

mutually_exclusive_p did not check for the claass bits of an charset
opcode when comparing it with an exactn which resulted in situation
where it thought a multibyte character could not match the character
class.

This assumption caused incorrect optimisation of the regular expression
and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’.

The issue affects all multibyte word characters as well as other
character classes which may match multibyte characters.

* src/regex.c (executing_charset): A new function for executing the
charset and charset_not opcodes.  It performs check on the character
taking into consideration existing bitmap, rang table and class bits.
It also advances the pointer in the regex bytecode past the parsed
opcode.
(CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed.
Code now included in executing_charset.
(mutually_exclusive_p, re_match_2_internal): Changed to take advantage
of executing_charset function.

* test/src/regex-tests.el: New file with tests for the character class
matching.
---
 Unless there are objections I’ll push it within a week or so.

 src/regex.c             | 209 +++++++++++++++++++++---------------------------
 test/src/regex-tests.el |  75 +++++++++++++++++
 2 files changed, 168 insertions(+), 116 deletions(-)
 create mode 100644 test/src/regex-tests.el

diff --git a/src/regex.c b/src/regex.c
index f92bcb7..9f999a7 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source)
    and end.  */
 #define CHARSET_RANGE_TABLE_END(range_table, count)	\
   ((range_table) + (count) * 2 * 3)
-
-/* Test if C is in RANGE_TABLE.  A flag NOT is negated if C is in.
-   COUNT is number of ranges in RANGE_TABLE.  */
-#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count)	\
-  do									\
-    {									\
-      re_wchar_t range_start, range_end;				\
-      re_char *rtp;							\
-      re_char *range_table_end						\
-	= CHARSET_RANGE_TABLE_END ((range_table), (count));		\
-									\
-      for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3)	\
-	{								\
-	  EXTRACT_CHARACTER (range_start, rtp);				\
-	  EXTRACT_CHARACTER (range_end, rtp + 3);			\
-									\
-	  if (range_start <= (c) && (c) <= range_end)			\
-	    {								\
-	      (not) = !(not);						\
-	      break;							\
-	    }								\
-	}								\
-    }									\
-  while (0)
-
-/* Test if C is in range table of CHARSET.  The flag NOT is negated if
-   C is listed in it.  */
-#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset)			\
-  do									\
-    {									\
-      /* Number of ranges in range table. */				\
-      int count;							\
-      re_char *range_table = CHARSET_RANGE_TABLE (charset);		\
-      									\
-      EXTRACT_NUMBER_AND_INCR (count, range_table);			\
-      CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count);	\
-    }									\
-  while (0)
 
 /* If DEBUG is defined, Regex prints many voluminous messages about what
    it is doing (if the variable `debug' is nonzero).  If linked with the
@@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend)
   return p;
 }
 
+/* Test if C matches charset op.  *PP points to the charset or chraset_not
+   opcode.  When the function finishes, *PP will be advanced past that opcode.
+   C is character to test (possibly after translations) and CORIG is original
+   character (i.e. without any translations).  UNIBYTE denotes whether c is
+   unibyte or multibyte character. */
+static bool
+execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
+{
+  re_char *p = *pp, *rtp = NULL;
+  bool not = (re_opcode_t) *p == charset_not;
+
+  if (CHARSET_RANGE_TABLE_EXISTS_P (p))
+    {
+      int count;
+      rtp = CHARSET_RANGE_TABLE (p);
+      EXTRACT_NUMBER_AND_INCR (count, rtp);
+      *pp = CHARSET_RANGE_TABLE_END ((rtp), (count));
+    }
+  else
+    *pp += 2 + CHARSET_BITMAP_SIZE (p);
+
+  if (unibyte && c < (1 << BYTEWIDTH))
+    {			/* Lookup bitmap.  */
+      /* Cast to `unsigned' instead of `unsigned char' in
+	 case the bit list is a full 32 bytes long.  */
+      if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH)
+	  && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
+	return !not;
+    }
+#ifdef emacs
+  else if (rtp)
+    {
+      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
+      re_wchar_t range_start, range_end;
+
+  /* Sort tests by the most commonly used classes with some adjustment to which
+     tests are easiest to perform.  Frequencies of character class names as of
+     2016-07-15:
+
+     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
+           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
+         213 [:alnum:]
+         104 [:alpha:]
+          62 [:space:]
+          39 [:digit:]
+          36 [:blank:]
+          26 [:upper:]
+          24 [:word:]
+          21 [:lower:]
+          10 [:punct:]
+          10 [:ascii:]
+           9 [:xdigit:]
+           4 [:nonascii:]
+           4 [:graph:]
+           2 [:print:]
+           2 [:cntrl:]
+           1 [:ff:]
+   */
+
+      if ((class_bits & BIT_MULTIBYTE) ||
+	  (class_bits & BIT_ALNUM && ISALNUM (c)) ||
+	  (class_bits & BIT_ALPHA && ISALPHA (c)) ||
+	  (class_bits & BIT_SPACE && ISSPACE (c)) ||
+	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
+	  ((class_bits & BIT_UPPER) &&
+	   (ISUPPER (c) || (corig != c &&
+			    c == downcase (corig) && ISLOWER (c)))) ||
+	  ((class_bits & BIT_LOWER) &&
+	   (ISLOWER (c) || (corig != c &&
+			    c == upcase (corig) && ISUPPER(c)))) ||
+	  (class_bits & BIT_PUNCT && ISPUNCT (c)) ||
+	  (class_bits & BIT_GRAPH && ISGRAPH (c)) ||
+	  (class_bits & BIT_PRINT && ISPRINT (c)))
+	return !not;
+
+      for (p = *pp; rtp < p; rtp += 2 * 3)
+	{
+	  EXTRACT_CHARACTER (range_start, rtp);
+	  EXTRACT_CHARACTER (range_end, rtp + 3);
+	  if (range_start <= c && c <= range_end)
+	    return !not;
+	}
+    }
+#endif /* emacs */
+  return not;
+}
+
 /* Non-zero if "p1 matches something" implies "p2 fails".  */
 static int
 mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
@@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
 	else if ((re_opcode_t) *p1 == charset
 		 || (re_opcode_t) *p1 == charset_not)
 	  {
-	    int not = (re_opcode_t) *p1 == charset_not;
-
-	    /* Test if C is listed in charset (or charset_not)
-	       at `p1'.  */
-	    if (! multibyte || IS_REAL_ASCII (c))
-	      {
-		if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH
-		    && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
-		  not = !not;
-	      }
-	    else if (CHARSET_RANGE_TABLE_EXISTS_P (p1))
-	      CHARSET_LOOKUP_RANGE_TABLE (not, c, p1);
-
-	    /* `not' is equal to 1 if c would match, which means
-	       that we can't change to pop_failure_jump.  */
-	    if (!not)
+	    if (!execute_charset (&p1, c, c, !multibyte))
 	      {
 		DEBUG_PRINT ("	 No match => fast loop.\n");
 		return 1;
@@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	case charset_not:
 	  {
 	    register unsigned int c, corig;
-	    boolean not = (re_opcode_t) *(p - 1) == charset_not;
 	    int len;
 
-	    /* Start of actual range_table, or end of bitmap if there is no
-	       range table.  */
-	    re_char *range_table UNINIT;
-
-	    /* Nonzero if there is a range table.  */
-	    int range_table_exists;
-
-	    /* Number of ranges of range table.  This is not included
-	       in the initial byte-length of the command.  */
-	    int count = 0;
-
 	    /* Whether matching against a unibyte character.  */
 	    boolean unibyte_char = false;
 
-	    DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : "");
-
-	    range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]);
-
-	    if (range_table_exists)
-	      {
-		range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap.  */
-		EXTRACT_NUMBER_AND_INCR (count, range_table);
-	      }
+	    DEBUG_PRINT ("EXECUTING charset%s.\n",
+			 (re_opcode_t) *(p - 1) == charset_not ? "_not" : "");
 
 	    PREFETCH ();
 	    corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
@@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 		  unibyte_char = true;
 	      }
 
-	    if (unibyte_char && c < (1 << BYTEWIDTH))
-	      {			/* Lookup bitmap.  */
-		/* Cast to `unsigned' instead of `unsigned char' in
-		   case the bit list is a full 32 bytes long.  */
-		if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH)
-		    && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
-		  not = !not;
-	      }
-#ifdef emacs
-	    else if (range_table_exists)
-	      {
-		int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
-
-		if (  (class_bits & BIT_LOWER
-		       && (ISLOWER (c)
-			   || (corig != c
-			       && c == upcase (corig) && ISUPPER(c))))
-		    | (class_bits & BIT_MULTIBYTE)
-		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
-		    | (class_bits & BIT_SPACE && ISSPACE (c))
-		    | (class_bits & BIT_UPPER
-		       && (ISUPPER (c)
-			   || (corig != c
-			       && c == downcase (corig) && ISLOWER (c))))
-		    | (class_bits & BIT_WORD  && ISWORD  (c))
-		    | (class_bits & BIT_ALPHA && ISALPHA (c))
-		    | (class_bits & BIT_ALNUM && ISALNUM (c))
-		    | (class_bits & BIT_GRAPH && ISGRAPH (c))
-		    | (class_bits & BIT_PRINT && ISPRINT (c)))
-		  not = !not;
-		else
-		  CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);
-	      }
-#endif /* emacs */
-
-	    if (range_table_exists)
-	      p = CHARSET_RANGE_TABLE_END (range_table, count);
-	    else
-	      p += CHARSET_BITMAP_SIZE (&p[-1]) + 1;
-
-	    if (!not) goto fail;
+	    p -= 1;
+	    if (!execute_charset (&p, c, corig, unibyte_char))
+	      goto fail;
 
 	    d += len;
 	  }
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
new file mode 100644
index 0000000..a2dd4f0
--- /dev/null
+++ b/test/src/regex-tests.el
@@ -0,0 +1,75 @@
+;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
+
+;; Copyright (C) 2015-2016 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+
+(ert-deftest regex-word-cc-fallback-test ()
+  (dolist (class '("[[:word:]]" "\\sw"))
+    (dolist (repeat '("*" "+"))
+      (dolist (suffix '("" "b" "bar" "\u2620"))
+        (should (string-match (concat "^" class repeat suffix "$")
+                              (concat "foo" suffix)))))))
+
+(defun regex--test-cc (name matching not-matching)
+  (should (string-match-p (concat "^[[:" name ":]]*$") matching))
+  (should (string-match-p (concat "^[[:" name ":]]*?\u2622$")
+                          (concat matching "\u2622")))
+  (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching))
+  (should (string-match-p (concat "^[^[:" name ":]]*\u2622$")
+                          (concat not-matching "\u2622")))
+  (with-temp-buffer
+    (insert matching)
+    (let ((p (point)))
+      (insert not-matching)
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]"))
+      (should (equal (point) p))
+      (skip-chars-forward (concat "^[:" name ":]"))
+      (should (equal (point) (point-max)))
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]\u2622"))
+      (should (or (equal (point) p) (equal (point) (1+ p)))))))
+
+(ert-deftest regex-character-classes ()
+  (let (case-fold-search)
+    (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n")
+    (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n")
+    (regex--test-cc "digit" "012" "abcABCłąka-, \t\n")
+    (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n")
+    (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n")
+    (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n")
+
+    (regex--test-cc "word" "abcABC012\u2620" "-, \t\n")
+
+    (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n")
+    (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ")
+    (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1")
+    (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1")
+
+    (regex--test-cc "space" " \t\n\u2001" "abcABCł0123")
+    (regex--test-cc "blank" " \t" "\n\u2001")
+
+    (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620")
+    (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1")
+    (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622")
+    (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1")))
+
+;;; buffer-tests.el ends here
-- 
2.8.0.rc3.226.g39d4020

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Mon, 18 Jul 2016 15:04:02 GMT) Full text and rfc822 format available.

Message #8 received at 24020 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Michal Nazarewicz <mina86 <at> mina86.com>
Cc: 24020 <at> debbugs.gnu.org
Subject: Re: bug#24020: [PATCH] Fix
 ‘[[:word:]]*\u2620’ failing to
 match ‘foo\u2620’
Date: Mon, 18 Jul 2016 18:03:13 +0300

> From: Michal Nazarewicz <mina86 <at> mina86.com>
> Date: Mon, 18 Jul 2016 16:04:44 +0200
> 
> mutually_exclusive_p did not check for the claass bits of an charset
> opcode when comparing it with an exactn which resulted in situation
> where it thought a multibyte character could not match the character
> class.
> 
> This assumption caused incorrect optimisation of the regular expression
> and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’.
> 
> The issue affects all multibyte word characters as well as other
> character classes which may match multibyte characters.

Thanks.

Unfortunately, the above description is too terse for me to understand
the issue and the way you propose to fix it.  Could you please provide
more details, including what problems you saw in classes other than
[:word:]?

Note that some of the classes deliberately don't work on multibyte
characters, and are documented as such.  So if we are changing that,
there should be documentation changes and an entry in NEWS as well
(but I suggest not to make such changes too easily, not without
measuring the impact on performance, if any).

> * src/regex.c (executing_charset): A new function for executing the
> charset and charset_not opcodes.  It performs check on the character
> taking into consideration existing bitmap, rang table and class bits.
                                             ^^^^
A typo.

> +#ifdef emacs
> +  else if (rtp)
> +    {
> +      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
> +      re_wchar_t range_start, range_end;
> +
> +  /* Sort tests by the most commonly used classes with some adjustment to which
> +     tests are easiest to perform.  Frequencies of character class names as of
> +     2016-07-15:

Not sure what files you used for this.  Are those Emacs source files?

> diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
> new file mode 100644
> index 0000000..a2dd4f0
> --- /dev/null
> +++ b/test/src/regex-tests.el
> @@ -0,0 +1,75 @@
> +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
       ^^^^^^^^^^^^^^^

Copy-paste error.

> +;;; buffer-tests.el ends here

And another one.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Mon, 18 Jul 2016 18:08:02 GMT) Full text and rfc822 format available.

Message #11 received at 24020 <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 24020 <at> debbugs.gnu.org
Subject: Re: bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’
Date: Mon, 18 Jul 2016 20:07:18 +0200

On Mon, Jul 18 2016, Eli Zaretskii wrote:
>> From: Michal Nazarewicz <mina86 <at> mina86.com>
>> Date: Mon, 18 Jul 2016 16:04:44 +0200
>> 
>> mutually_exclusive_p did not check for the claass bits of an charset
>> opcode when comparing it with an exactn which resulted in situation
>> where it thought a multibyte character could not match the character
>> class.
>> 
>> This assumption caused incorrect optimisation of the regular expression
>> and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’.
>> 
>> The issue affects all multibyte word characters as well as other
>> character classes which may match multibyte characters.
>
> Thanks.
>
> Unfortunately, the above description is too terse for me to understand
> the issue and the way you propose to fix it.  Could you please provide
> more details,

‘[[:word:]]*\u2620’ ends up being as:

0:	/on_failure_keep_string_jump to 28
3:	/charset [ !extends past end of pattern! $-%0-9A-Za-z]has-range-table
25:	/jump to 3
28:	/exactn/3/â// 
33:	/succeed
34:	end of pattern.

while ‘\sw*\u2620’ as:

0:	/on_failure_jump to 8
3:	/syntaxspec/2
5:	/jump to 0
8:	/exactn/3/â// 
13:	/succeed
14:	end of pattern.

Apart from a different opcode to match the word class the crux of the
difference is the first opcode: on_failure_keep_string_jump
vs. on_failure_jump.

As a matter of fact, regex_compile puts a on_failure_jump_smart opcode
at the beginning which is optimised by re_match_2_internal (debug code
removed for brevity):

	/* This operation is used for greedy *.
	   Compare the beginning of the repeat with what in the
	   pattern follows its end. If we can establish that there
	   is nothing that they would both match, i.e., that we
	   would have to backtrack because of (as in, e.g., `a*a')
	   then we can use a non-backtracking loop based on
	   on_failure_keep_string_jump instead of on_failure_jump.  */
	case on_failure_jump_smart:
	  EXTRACT_NUMBER_AND_INCR (mcnt, p);
	  {
	    re_char *p1 = p; /* Next operation.  */
	    /* Here, we discard `const', making re_match non-reentrant.  */
	    unsigned char *p2 = (unsigned char*) p + mcnt; /* Jump dest.  */
	    unsigned char *p3 = (unsigned char*) p - 3; /* opcode location.  */

	    p -= 3;		/* Reset so that we will re-execute the
				   instruction once it's been changed. */

	    EXTRACT_NUMBER (mcnt, p2 - 2);

	    /* Ensure this is a indeed the trivial kind of loop
	       we are expecting.  */
	    if (mutually_exclusive_p (bufp, p1, p2))
	      {
		/* Use a fast `on_failure_keep_string_jump' loop.  */
		*p3 = (unsigned char) on_failure_keep_string_jump;
		STORE_NUMBER (p2 - 2, mcnt + 3);
	      }
	    else
	      {
		/* Default to a safe `on_failure_jump' loop.  */
		*p3 = (unsigned char) on_failure_jump;
	      }
	  }

In other words, in our example, the code checks whether ‘[[:word:]]’ can
match ‘💀’.  If it cannot than we can be greedy about matching
‘[[:word:]]*’ and never backtrace looking for a shorter match; if it
can, we may need to backtrace if the overall matching fails.

mutually_exclusive_p concludes that ‘[[:word:]]’ cannot match ‘💀’ (or
any non-ASCII characters really) but as a matter of fact, word class
does match skull character.

So when ‘[[:word:]]*💀’ matches ‘foo💀’ the following happens:
1. ‘[[:word:]]*’ matches the whole string.
2. String is now empty so ‘💀’ doesn’t match.
3. Because of incorrect assumptions, the engine does not shorten the
   initial ‘[[:word:]]*’ match.

(I may be butchering the exact terms and algorithm that is being applied
but the general idea is, I hope, shown).

> Note that some of the classes deliberately don't work on multibyte
> characters, and are documented as such.

This is irrelevant.  ‘[[:word:]]*’ matches ‘foo’ thus ‘[[:word:]]*b’
must match ‘foob’ (which it does) and ‘[[:word:]]*☠’ must match ‘foo☠’
(which it doesn’t).

> including what problems you saw in classes other than [:word:]?

The problem happens for any class which matches multibyte characters.
For example:

    (string-match-p "[[:alpha:]]*" "żółć")
    => 0
    (string-match-p "[[:alpha:]]*w" "żółw")
    => 0
    (string-match-p "[[:alpha:]]*ć" "żółć")
    => nil  (should be 0)
    ;; or even more simply:
    (string-match-p "[[:alpha:]]*ć" "ż")
    => nil  (should be 0)

In general, for a class FOO, if a multibyte character C matches that
class regex "[[:FOO]]*C") should match the character itself but it
doesn’t.

> So if we are changing that, there should be documentation changes and
> an entry in NEWS as well (but I suggest not to make such changes too
> easily, not without measuring the impact on performance, if any).
>
>> * src/regex.c (executing_charset): A new function for executing the
>> charset and charset_not opcodes.  It performs check on the character
>> taking into consideration existing bitmap, rang table and class bits.
>                                              ^^^^
> A typo.
>
>> +#ifdef emacs
>> +  else if (rtp)
>> +    {
>> +      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
>> +      re_wchar_t range_start, range_end;
>> +
>> +  /* Sort tests by the most commonly used classes with some adjustment to which
>> +     tests are easiest to perform.  Frequencies of character class names as of
>> +     2016-07-15:
>
> Not sure what files you used for this.  Are those Emacs source files?
>
>> diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
>> new file mode 100644
>> index 0000000..a2dd4f0
>> --- /dev/null
>> +++ b/test/src/regex-tests.el
>> @@ -0,0 +1,75 @@
>> +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
>        ^^^^^^^^^^^^^^^
>
> Copy-paste error.
>
>> +;;; buffer-tests.el ends here
>
> And another one.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Mon, 18 Jul 2016 23:31:01 GMT) Full text and rfc822 format available.

Message #14 received at 24020 <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: 24020 <at> debbugs.gnu.org
Cc: eliz <at> gnu.org
Subject: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020)
Date: Tue, 19 Jul 2016 01:30:01 +0200

The regex engine tries to optimise (greedy) Kleene star by avoiding
backtracking when it can detect that portion of the expression after
the star cannot match if the repeated portion does match.

For example, take regular expression ‘[[:alpha:]]*1’ trying to match
a string ‘foo’.  Since the Kleene star is greedy, the engine will test
the shortest match for ‘[[:alpha:]]*’ which is ‘foo’.  At this point
though, the string being matched is empty while there’s still
a literal digit one in the pattern.

The engine will not, however, attempt to back-trace testing a shorter
match for the character class (namely ‘fo’ leaving ‘o’ in the string)
because it knows that whatever will be left in the string cannot match
literal digit one.

In the regexes of the form ‘[[:CC:]]*X’, the optimisation can be
applied if and only if the regex engine can prove that the character
class CC does not match character X (as is the case with alpha
character class not matching digit 1).

In the code, the proof is performed by mutually_exclusive_p function.

However, it did not check class bits of a charset opcode which
resulted in it assuming that character classes cannot match multibyte
characters.  For example, it would assume that [[:alpha:]] cannot
match ‘ż’ even though ‘ż’ is indeed an alphanumeric character matching
the alpha character class.

This assumption caused incorrect optimisation of the regular
expression and eventually failure of ‘[[:alpha:]]*żółw’ to match
‘żółw’.

This issue affects any character class witch matches multibyte
characters.

* src/regex.c (executing_charset): A new function for executing the
charset and charset_not opcodes.  It performs check on the character
taking into consideration existing bitmap, range table and class bits.
It also advances the pointer in the regex bytecode past the parsed
opcode.
(CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed.
Code now included in executing_charset.
(mutually_exclusive_p, re_match_2_internal): Changed to take advantage
of executing_charset function.

* test/src/regex-tests.el: New file with tests for the character class
matching.
---
 src/regex.c             | 209 +++++++++++++++++++++---------------------------
 test/src/regex-tests.el |  92 +++++++++++++++++++++
 2 files changed, 185 insertions(+), 116 deletions(-)
 create mode 100644 test/src/regex-tests.el

diff --git a/src/regex.c b/src/regex.c
index f92bcb7..297bf71 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source)
    and end.  */
 #define CHARSET_RANGE_TABLE_END(range_table, count)	\
   ((range_table) + (count) * 2 * 3)
-
-/* Test if C is in RANGE_TABLE.  A flag NOT is negated if C is in.
-   COUNT is number of ranges in RANGE_TABLE.  */
-#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count)	\
-  do									\
-    {									\
-      re_wchar_t range_start, range_end;				\
-      re_char *rtp;							\
-      re_char *range_table_end						\
-	= CHARSET_RANGE_TABLE_END ((range_table), (count));		\
-									\
-      for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3)	\
-	{								\
-	  EXTRACT_CHARACTER (range_start, rtp);				\
-	  EXTRACT_CHARACTER (range_end, rtp + 3);			\
-									\
-	  if (range_start <= (c) && (c) <= range_end)			\
-	    {								\
-	      (not) = !(not);						\
-	      break;							\
-	    }								\
-	}								\
-    }									\
-  while (0)
-
-/* Test if C is in range table of CHARSET.  The flag NOT is negated if
-   C is listed in it.  */
-#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset)			\
-  do									\
-    {									\
-      /* Number of ranges in range table. */				\
-      int count;							\
-      re_char *range_table = CHARSET_RANGE_TABLE (charset);		\
-      									\
-      EXTRACT_NUMBER_AND_INCR (count, range_table);			\
-      CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count);	\
-    }									\
-  while (0)
 
 /* If DEBUG is defined, Regex prints many voluminous messages about what
    it is doing (if the variable `debug' is nonzero).  If linked with the
@@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend)
   return p;
 }
 
+/* Test if C matches charset op.  *PP points to the charset or chraset_not
+   opcode.  When the function finishes, *PP will be advanced past that opcode.
+   C is character to test (possibly after translations) and CORIG is original
+   character (i.e. without any translations).  UNIBYTE denotes whether c is
+   unibyte or multibyte character. */
+static bool
+execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte)
+{
+  re_char *p = *pp, *rtp = NULL;
+  bool not = (re_opcode_t) *p == charset_not;
+
+  if (CHARSET_RANGE_TABLE_EXISTS_P (p))
+    {
+      int count;
+      rtp = CHARSET_RANGE_TABLE (p);
+      EXTRACT_NUMBER_AND_INCR (count, rtp);
+      *pp = CHARSET_RANGE_TABLE_END ((rtp), (count));
+    }
+  else
+    *pp += 2 + CHARSET_BITMAP_SIZE (p);
+
+  if (unibyte && c < (1 << BYTEWIDTH))
+    {			/* Lookup bitmap.  */
+      /* Cast to `unsigned' instead of `unsigned char' in
+	 case the bit list is a full 32 bytes long.  */
+      if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH)
+	  && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
+	return !not;
+    }
+#ifdef emacs
+  else if (rtp)
+    {
+      int class_bits = CHARSET_RANGE_TABLE_BITS (p);
+      re_wchar_t range_start, range_end;
+
+  /* Sort tests by the most commonly used classes with some adjustment to which
+     tests are easiest to perform.  Frequencies of character class names used in
+     Emacs sources as of 2016-07-15:
+
+     $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + |
+           sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr
+         213 [:alnum:]
+         104 [:alpha:]
+          62 [:space:]
+          39 [:digit:]
+          36 [:blank:]
+          26 [:upper:]
+          24 [:word:]
+          21 [:lower:]
+          10 [:punct:]
+          10 [:ascii:]
+           9 [:xdigit:]
+           4 [:nonascii:]
+           4 [:graph:]
+           2 [:print:]
+           2 [:cntrl:]
+           1 [:ff:]
+   */
+
+      if ((class_bits & BIT_MULTIBYTE) ||
+	  (class_bits & BIT_ALNUM && ISALNUM (c)) ||
+	  (class_bits & BIT_ALPHA && ISALPHA (c)) ||
+	  (class_bits & BIT_SPACE && ISSPACE (c)) ||
+	  (class_bits & BIT_WORD  && ISWORD  (c)) ||
+	  ((class_bits & BIT_UPPER) &&
+	   (ISUPPER (c) || (corig != c &&
+			    c == downcase (corig) && ISLOWER (c)))) ||
+	  ((class_bits & BIT_LOWER) &&
+	   (ISLOWER (c) || (corig != c &&
+			    c == upcase (corig) && ISUPPER(c)))) ||
+	  (class_bits & BIT_PUNCT && ISPUNCT (c)) ||
+	  (class_bits & BIT_GRAPH && ISGRAPH (c)) ||
+	  (class_bits & BIT_PRINT && ISPRINT (c)))
+	return !not;
+
+      for (p = *pp; rtp < p; rtp += 2 * 3)
+	{
+	  EXTRACT_CHARACTER (range_start, rtp);
+	  EXTRACT_CHARACTER (range_end, rtp + 3);
+	  if (range_start <= c && c <= range_end)
+	    return !not;
+	}
+    }
+#endif /* emacs */
+  return not;
+}
+
 /* Non-zero if "p1 matches something" implies "p2 fails".  */
 static int
 mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
@@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
 	else if ((re_opcode_t) *p1 == charset
 		 || (re_opcode_t) *p1 == charset_not)
 	  {
-	    int not = (re_opcode_t) *p1 == charset_not;
-
-	    /* Test if C is listed in charset (or charset_not)
-	       at `p1'.  */
-	    if (! multibyte || IS_REAL_ASCII (c))
-	      {
-		if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH
-		    && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
-		  not = !not;
-	      }
-	    else if (CHARSET_RANGE_TABLE_EXISTS_P (p1))
-	      CHARSET_LOOKUP_RANGE_TABLE (not, c, p1);
-
-	    /* `not' is equal to 1 if c would match, which means
-	       that we can't change to pop_failure_jump.  */
-	    if (!not)
+	    if (!execute_charset (&p1, c, c, !multibyte))
 	      {
 		DEBUG_PRINT ("	 No match => fast loop.\n");
 		return 1;
@@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	case charset_not:
 	  {
 	    register unsigned int c, corig;
-	    boolean not = (re_opcode_t) *(p - 1) == charset_not;
 	    int len;
 
-	    /* Start of actual range_table, or end of bitmap if there is no
-	       range table.  */
-	    re_char *range_table UNINIT;
-
-	    /* Nonzero if there is a range table.  */
-	    int range_table_exists;
-
-	    /* Number of ranges of range table.  This is not included
-	       in the initial byte-length of the command.  */
-	    int count = 0;
-
 	    /* Whether matching against a unibyte character.  */
 	    boolean unibyte_char = false;
 
-	    DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : "");
-
-	    range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]);
-
-	    if (range_table_exists)
-	      {
-		range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap.  */
-		EXTRACT_NUMBER_AND_INCR (count, range_table);
-	      }
+	    DEBUG_PRINT ("EXECUTING charset%s.\n",
+			 (re_opcode_t) *(p - 1) == charset_not ? "_not" : "");
 
 	    PREFETCH ();
 	    corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
@@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 		  unibyte_char = true;
 	      }
 
-	    if (unibyte_char && c < (1 << BYTEWIDTH))
-	      {			/* Lookup bitmap.  */
-		/* Cast to `unsigned' instead of `unsigned char' in
-		   case the bit list is a full 32 bytes long.  */
-		if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH)
-		    && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH)))
-		  not = !not;
-	      }
-#ifdef emacs
-	    else if (range_table_exists)
-	      {
-		int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
-
-		if (  (class_bits & BIT_LOWER
-		       && (ISLOWER (c)
-			   || (corig != c
-			       && c == upcase (corig) && ISUPPER(c))))
-		    | (class_bits & BIT_MULTIBYTE)
-		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
-		    | (class_bits & BIT_SPACE && ISSPACE (c))
-		    | (class_bits & BIT_UPPER
-		       && (ISUPPER (c)
-			   || (corig != c
-			       && c == downcase (corig) && ISLOWER (c))))
-		    | (class_bits & BIT_WORD  && ISWORD  (c))
-		    | (class_bits & BIT_ALPHA && ISALPHA (c))
-		    | (class_bits & BIT_ALNUM && ISALNUM (c))
-		    | (class_bits & BIT_GRAPH && ISGRAPH (c))
-		    | (class_bits & BIT_PRINT && ISPRINT (c)))
-		  not = !not;
-		else
-		  CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count);
-	      }
-#endif /* emacs */
-
-	    if (range_table_exists)
-	      p = CHARSET_RANGE_TABLE_END (range_table, count);
-	    else
-	      p += CHARSET_BITMAP_SIZE (&p[-1]) + 1;
-
-	    if (!not) goto fail;
+	    p -= 1;
+	    if (!execute_charset (&p, c, corig, unibyte_char))
+	      goto fail;
 
 	    d += len;
 	  }
diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el
new file mode 100644
index 0000000..00165ab
--- /dev/null
+++ b/test/src/regex-tests.el
@@ -0,0 +1,92 @@
+;;; regex-tests.el --- tests for regex.c functions -*- lexical-binding: t -*-
+
+;; Copyright (C) 2015-2016 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <http://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+
+(ert-deftest regex-word-cc-fallback-test ()
+  "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020).
+
+Test that a regex of the form \"[[:cc:]]*x\" where CC is
+a character class which matches a multibyte character X, matches
+string \"x\".
+
+For example, ‘[[:word:]]*\u2620’ regex (note: \u2620 is a word
+character) must match a string \"\u2420\"."
+  (dolist (class '("[[:word:]]" "\\sw"))
+    (dolist (repeat '("*" "+"))
+      (dolist (suffix '("" "b" "bar" "\u2620"))
+        (dolist (string '("" "foo"))
+          (when (not (and (string-equal repeat "+")
+                          (string-equal string "")))
+            (should (string-match (concat "^" class repeat suffix "$")
+                                  (concat string suffix)))))))))
+
+(defun regex--test-cc (name matching not-matching)
+  (should (string-match-p (concat "^[[:" name ":]]*$") matching))
+  (should (string-match-p (concat "^[[:" name ":]]*?\u2622$")
+                          (concat matching "\u2622")))
+  (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching))
+  (should (string-match-p (concat "^[^[:" name ":]]*\u2622$")
+                          (concat not-matching "\u2622")))
+  (with-temp-buffer
+    (insert matching)
+    (let ((p (point)))
+      (insert not-matching)
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]"))
+      (should (equal (point) p))
+      (skip-chars-forward (concat "^[:" name ":]"))
+      (should (equal (point) (point-max)))
+      (goto-char (point-min))
+      (skip-chars-forward (concat "[:" name ":]\u2622"))
+      (should (or (equal (point) p) (equal (point) (1+ p)))))))
+
+(ert-deftest regex-character-classes ()
+  "Perform sanity test of regexes using character classes.
+
+Go over all the supported character classes and test whether the
+classes and their inversions match what they are supposed to
+match.  The test is done using `string-match-p' as well as
+`skip-chars-forward'."
+  (let (case-fold-search)
+    (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n")
+    (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n")
+    (regex--test-cc "digit" "012" "abcABCłąka-, \t\n")
+    (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n")
+    (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n")
+    (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n")
+
+    (regex--test-cc "word" "abcABC012\u2620" "-, \t\n")
+
+    (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n")
+    (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ")
+    (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1")
+    (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1")
+
+    (regex--test-cc "space" " \t\n\u2001" "abcABCł0123")
+    (regex--test-cc "blank" " \t" "\n\u2001")
+
+    (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620")
+    (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1")
+    (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622")
+    (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1")))
+
+;;; regex-tests.el ends here
-- 
2.8.0.rc3.226.g39d4020

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Tue, 19 Jul 2016 08:01:01 GMT) Full text and rfc822 format available.

Message #17 received at 24020 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> suse.de>
To: Michal Nazarewicz <mina86 <at> mina86.com>
Cc: 24020 <at> debbugs.gnu.org
Subject: Re: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020)
Date: Tue, 19 Jul 2016 10:00:46 +0200

Michal Nazarewicz <mina86 <at> mina86.com> writes:

> For example, take regular expression ‘[[:alpha:]]*1’ trying to match
> a string ‘foo’.  Since the Kleene star is greedy, the engine will test
> the shortest match for ‘[[:alpha:]]*’ which is ‘foo’.  At this point

Did you mean "the longest match"?

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab <at> suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Wed, 20 Jul 2016 12:37:01 GMT) Full text and rfc822 format available.

Message #20 received at 24020 <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: Andreas Schwab <schwab <at> suse.de>
Cc: 24020 <at> debbugs.gnu.org
Subject: Re: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020)
Date: Wed, 20 Jul 2016 14:36:39 +0200

> Michal Nazarewicz <mina86 <at> mina86.com> writes:
>> For example, take regular expression ‘[[:alpha:]]*1’ trying to match
>> a string ‘foo’.  Since the Kleene star is greedy, the engine will test
>> the shortest match for ‘[[:alpha:]]*’ which is ‘foo’.  At this point

On Tue, Jul 19 2016, Andreas Schwab wrote:
> Did you mean "the longest match"?

Yep, thanks for spotting this.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

Reply sent to Michal Nazarewicz <mina86 <at> mina86.com>:
You have taken responsibility. (Mon, 25 Jul 2016 21:55:01 GMT) Full text and rfc822 format available.

Notification sent to Michal Nazarewicz <mina86 <at> mina86.com>:
bug acknowledged by developer. (Mon, 25 Jul 2016 21:55:01 GMT) Full text and rfc822 format available.

Message #25 received at 24020-done <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: 24020-done <at> debbugs.gnu.org
Subject: Re: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020)
Date: Mon, 25 Jul 2016 23:54:08 +0200

Pushed.

-- 
Best regards
ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ
«If at first you don’t succeed, give up skydiving»

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#24020; Package emacs. (Wed, 27 Jul 2016 16:23:01 GMT) Full text and rfc822 format available.

Message #28 received at 24020 <at> debbugs.gnu.org (full text, mbox):

From: Michal Nazarewicz <mina86 <at> mina86.com>
To: 24020 <at> debbugs.gnu.org
Subject: [PATCH] Fix ‘is multibyte’ test regex.c’s mutually_exclusive_p (bug#24020)
Date: Wed, 27 Jul 2016 18:22:17 +0200

* src/regex.c (mutually_exclusive_p): Fix how whether character is
unibyte is tested when calling execute_charset function.  This bug has
been introduced by [6dc6b00: Fix ‘[[:cc:]]*literal’ regex failing to
match ‘literal’] which dropped a call to IS_REAL_ASCII (c) macro.
Reinstitute it.
---
 src/regex.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

 FYI, I’ve just submitted this patch.  The bug has been caught by
 tests from dima_regex_embedded_modifiers branch which I’m working on
 pulling into master branch so they will come in another commits.

diff --git a/src/regex.c b/src/regex.c
index 297bf71..1f2a1f08 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -4767,7 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1,
 	else if ((re_opcode_t) *p1 == charset
 		 || (re_opcode_t) *p1 == charset_not)
 	  {
-	    if (!execute_charset (&p1, c, c, !multibyte))
+	    if (!execute_charset (&p1, c, c, !multibyte || IS_REAL_ASCII (c)))
 	      {
 		DEBUG_PRINT ("	 No match => fast loop.\n");
 		return 1;
-- 
2.8.0.rc3.226.g39d4020

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 25 Aug 2016 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 358 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #24020 [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’

GNU bug report logs - #24020
[PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’