Package: emacs;
Reported by: Michal Nazarewicz <mina86 <at> mina86.com>
Date: Mon, 18 Jul 2016 14:06:01 UTC
Severity: normal
Tags: patch
Done: Michal Nazarewicz <mina86 <at> mina86.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 24020 in the body.
You can then email your comments to 24020 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Mon, 18 Jul 2016 14:06:02 GMT) Full text and rfc822 format available.Michal Nazarewicz <mina86 <at> mina86.com>
:bug-gnu-emacs <at> gnu.org
.
(Mon, 18 Jul 2016 14:06:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Michal Nazarewicz <mina86 <at> mina86.com> To: bug-gnu-emacs <at> gnu.org Subject: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ Date: Mon, 18 Jul 2016 16:04:44 +0200
mutually_exclusive_p did not check for the claass bits of an charset opcode when comparing it with an exactn which resulted in situation where it thought a multibyte character could not match the character class. This assumption caused incorrect optimisation of the regular expression and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. The issue affects all multibyte word characters as well as other character classes which may match multibyte characters. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, rang table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching. --- Unless there are objections I’ll push it within a week or so. src/regex.c | 209 +++++++++++++++++++++--------------------------- test/src/regex-tests.el | 75 +++++++++++++++++ 2 files changed, 168 insertions(+), 116 deletions(-) create mode 100644 test/src/regex-tests.el diff --git a/src/regex.c b/src/regex.c index f92bcb7..9f999a7 100644 --- a/src/regex.c +++ b/src/regex.c @@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source) and end. */ #define CHARSET_RANGE_TABLE_END(range_table, count) \ ((range_table) + (count) * 2 * 3) - -/* Test if C is in RANGE_TABLE. A flag NOT is negated if C is in. - COUNT is number of ranges in RANGE_TABLE. */ -#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count) \ - do \ - { \ - re_wchar_t range_start, range_end; \ - re_char *rtp; \ - re_char *range_table_end \ - = CHARSET_RANGE_TABLE_END ((range_table), (count)); \ - \ - for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3) \ - { \ - EXTRACT_CHARACTER (range_start, rtp); \ - EXTRACT_CHARACTER (range_end, rtp + 3); \ - \ - if (range_start <= (c) && (c) <= range_end) \ - { \ - (not) = !(not); \ - break; \ - } \ - } \ - } \ - while (0) - -/* Test if C is in range table of CHARSET. The flag NOT is negated if - C is listed in it. */ -#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset) \ - do \ - { \ - /* Number of ranges in range table. */ \ - int count; \ - re_char *range_table = CHARSET_RANGE_TABLE (charset); \ - \ - EXTRACT_NUMBER_AND_INCR (count, range_table); \ - CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count); \ - } \ - while (0) /* If DEBUG is defined, Regex prints many voluminous messages about what it is doing (if the variable `debug' is nonzero). If linked with the @@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend) return p; } +/* Test if C matches charset op. *PP points to the charset or chraset_not + opcode. When the function finishes, *PP will be advanced past that opcode. + C is character to test (possibly after translations) and CORIG is original + character (i.e. without any translations). UNIBYTE denotes whether c is + unibyte or multibyte character. */ +static bool +execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +{ + re_char *p = *pp, *rtp = NULL; + bool not = (re_opcode_t) *p == charset_not; + + if (CHARSET_RANGE_TABLE_EXISTS_P (p)) + { + int count; + rtp = CHARSET_RANGE_TABLE (p); + EXTRACT_NUMBER_AND_INCR (count, rtp); + *pp = CHARSET_RANGE_TABLE_END ((rtp), (count)); + } + else + *pp += 2 + CHARSET_BITMAP_SIZE (p); + + if (unibyte && c < (1 << BYTEWIDTH)) + { /* Lookup bitmap. */ + /* Cast to `unsigned' instead of `unsigned char' in + case the bit list is a full 32 bytes long. */ + if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH) + && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + return !not; + } +#ifdef emacs + else if (rtp) + { + int class_bits = CHARSET_RANGE_TABLE_BITS (p); + re_wchar_t range_start, range_end; + + /* Sort tests by the most commonly used classes with some adjustment to which + tests are easiest to perform. Frequencies of character class names as of + 2016-07-15: + + $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + | + sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr + 213 [:alnum:] + 104 [:alpha:] + 62 [:space:] + 39 [:digit:] + 36 [:blank:] + 26 [:upper:] + 24 [:word:] + 21 [:lower:] + 10 [:punct:] + 10 [:ascii:] + 9 [:xdigit:] + 4 [:nonascii:] + 4 [:graph:] + 2 [:print:] + 2 [:cntrl:] + 1 [:ff:] + */ + + if ((class_bits & BIT_MULTIBYTE) || + (class_bits & BIT_ALNUM && ISALNUM (c)) || + (class_bits & BIT_ALPHA && ISALPHA (c)) || + (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_WORD && ISWORD (c)) || + ((class_bits & BIT_UPPER) && + (ISUPPER (c) || (corig != c && + c == downcase (corig) && ISLOWER (c)))) || + ((class_bits & BIT_LOWER) && + (ISLOWER (c) || (corig != c && + c == upcase (corig) && ISUPPER(c)))) || + (class_bits & BIT_PUNCT && ISPUNCT (c)) || + (class_bits & BIT_GRAPH && ISGRAPH (c)) || + (class_bits & BIT_PRINT && ISPRINT (c))) + return !not; + + for (p = *pp; rtp < p; rtp += 2 * 3) + { + EXTRACT_CHARACTER (range_start, rtp); + EXTRACT_CHARACTER (range_end, rtp + 3); + if (range_start <= c && c <= range_end) + return !not; + } + } +#endif /* emacs */ + return not; +} + /* Non-zero if "p1 matches something" implies "p2 fails". */ static int mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, @@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - int not = (re_opcode_t) *p1 == charset_not; - - /* Test if C is listed in charset (or charset_not) - at `p1'. */ - if (! multibyte || IS_REAL_ASCII (c)) - { - if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH - && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } - else if (CHARSET_RANGE_TABLE_EXISTS_P (p1)) - CHARSET_LOOKUP_RANGE_TABLE (not, c, p1); - - /* `not' is equal to 1 if c would match, which means - that we can't change to pop_failure_jump. */ - if (!not) + if (!execute_charset (&p1, c, c, !multibyte)) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset_not: { register unsigned int c, corig; - boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; - /* Start of actual range_table, or end of bitmap if there is no - range table. */ - re_char *range_table UNINIT; - - /* Nonzero if there is a range table. */ - int range_table_exists; - - /* Number of ranges of range table. This is not included - in the initial byte-length of the command. */ - int count = 0; - /* Whether matching against a unibyte character. */ boolean unibyte_char = false; - DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : ""); - - range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]); - - if (range_table_exists) - { - range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap. */ - EXTRACT_NUMBER_AND_INCR (count, range_table); - } + DEBUG_PRINT ("EXECUTING charset%s.\n", + (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); @@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, unibyte_char = true; } - if (unibyte_char && c < (1 << BYTEWIDTH)) - { /* Lookup bitmap. */ - /* Cast to `unsigned' instead of `unsigned char' in - case the bit list is a full 32 bytes long. */ - if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH) - && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } -#ifdef emacs - else if (range_table_exists) - { - int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - - if ( (class_bits & BIT_LOWER - && (ISLOWER (c) - || (corig != c - && c == upcase (corig) && ISUPPER(c)))) - | (class_bits & BIT_MULTIBYTE) - | (class_bits & BIT_PUNCT && ISPUNCT (c)) - | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER - && (ISUPPER (c) - || (corig != c - && c == downcase (corig) && ISLOWER (c)))) - | (class_bits & BIT_WORD && ISWORD (c)) - | (class_bits & BIT_ALPHA && ISALPHA (c)) - | (class_bits & BIT_ALNUM && ISALNUM (c)) - | (class_bits & BIT_GRAPH && ISGRAPH (c)) - | (class_bits & BIT_PRINT && ISPRINT (c))) - not = !not; - else - CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); - } -#endif /* emacs */ - - if (range_table_exists) - p = CHARSET_RANGE_TABLE_END (range_table, count); - else - p += CHARSET_BITMAP_SIZE (&p[-1]) + 1; - - if (!not) goto fail; + p -= 1; + if (!execute_charset (&p, c, corig, unibyte_char)) + goto fail; d += len; } diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el new file mode 100644 index 0000000..a2dd4f0 --- /dev/null +++ b/test/src/regex-tests.el @@ -0,0 +1,75 @@ +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- + +;; Copyright (C) 2015-2016 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. + +;;; Code: + +(require 'ert) + +(ert-deftest regex-word-cc-fallback-test () + (dolist (class '("[[:word:]]" "\\sw")) + (dolist (repeat '("*" "+")) + (dolist (suffix '("" "b" "bar" "\u2620")) + (should (string-match (concat "^" class repeat suffix "$") + (concat "foo" suffix))))))) + +(defun regex--test-cc (name matching not-matching) + (should (string-match-p (concat "^[[:" name ":]]*$") matching)) + (should (string-match-p (concat "^[[:" name ":]]*?\u2622$") + (concat matching "\u2622"))) + (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching)) + (should (string-match-p (concat "^[^[:" name ":]]*\u2622$") + (concat not-matching "\u2622"))) + (with-temp-buffer + (insert matching) + (let ((p (point))) + (insert not-matching) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]")) + (should (equal (point) p)) + (skip-chars-forward (concat "^[:" name ":]")) + (should (equal (point) (point-max))) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]\u2622")) + (should (or (equal (point) p) (equal (point) (1+ p))))))) + +(ert-deftest regex-character-classes () + (let (case-fold-search) + (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n") + (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n") + (regex--test-cc "digit" "012" "abcABCłąka-, \t\n") + (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n") + (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n") + (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n") + + (regex--test-cc "word" "abcABC012\u2620" "-, \t\n") + + (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n") + (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ") + (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1") + (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1") + + (regex--test-cc "space" " \t\n\u2001" "abcABCł0123") + (regex--test-cc "blank" " \t" "\n\u2001") + + (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620") + (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1") + (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622") + (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1"))) + +;;; buffer-tests.el ends here -- 2.8.0.rc3.226.g39d4020
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Mon, 18 Jul 2016 15:04:02 GMT) Full text and rfc822 format available.Message #8 received at 24020 <at> debbugs.gnu.org (full text, mbox):
From: Eli Zaretskii <eliz <at> gnu.org> To: Michal Nazarewicz <mina86 <at> mina86.com> Cc: 24020 <at> debbugs.gnu.org Subject: Re: bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ Date: Mon, 18 Jul 2016 18:03:13 +0300
> From: Michal Nazarewicz <mina86 <at> mina86.com> > Date: Mon, 18 Jul 2016 16:04:44 +0200 > > mutually_exclusive_p did not check for the claass bits of an charset > opcode when comparing it with an exactn which resulted in situation > where it thought a multibyte character could not match the character > class. > > This assumption caused incorrect optimisation of the regular expression > and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. > > The issue affects all multibyte word characters as well as other > character classes which may match multibyte characters. Thanks. Unfortunately, the above description is too terse for me to understand the issue and the way you propose to fix it. Could you please provide more details, including what problems you saw in classes other than [:word:]? Note that some of the classes deliberately don't work on multibyte characters, and are documented as such. So if we are changing that, there should be documentation changes and an entry in NEWS as well (but I suggest not to make such changes too easily, not without measuring the impact on performance, if any). > * src/regex.c (executing_charset): A new function for executing the > charset and charset_not opcodes. It performs check on the character > taking into consideration existing bitmap, rang table and class bits. ^^^^ A typo. > +#ifdef emacs > + else if (rtp) > + { > + int class_bits = CHARSET_RANGE_TABLE_BITS (p); > + re_wchar_t range_start, range_end; > + > + /* Sort tests by the most commonly used classes with some adjustment to which > + tests are easiest to perform. Frequencies of character class names as of > + 2016-07-15: Not sure what files you used for this. Are those Emacs source files? > diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el > new file mode 100644 > index 0000000..a2dd4f0 > --- /dev/null > +++ b/test/src/regex-tests.el > @@ -0,0 +1,75 @@ > +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- ^^^^^^^^^^^^^^^ Copy-paste error. > +;;; buffer-tests.el ends here And another one.
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Mon, 18 Jul 2016 18:08:02 GMT) Full text and rfc822 format available.Message #11 received at 24020 <at> debbugs.gnu.org (full text, mbox):
From: Michal Nazarewicz <mina86 <at> mina86.com> To: Eli Zaretskii <eliz <at> gnu.org> Cc: 24020 <at> debbugs.gnu.org Subject: Re: bug#24020: [PATCH] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ Date: Mon, 18 Jul 2016 20:07:18 +0200
On Mon, Jul 18 2016, Eli Zaretskii wrote: >> From: Michal Nazarewicz <mina86 <at> mina86.com> >> Date: Mon, 18 Jul 2016 16:04:44 +0200 >> >> mutually_exclusive_p did not check for the claass bits of an charset >> opcode when comparing it with an exactn which resulted in situation >> where it thought a multibyte character could not match the character >> class. >> >> This assumption caused incorrect optimisation of the regular expression >> and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. >> >> The issue affects all multibyte word characters as well as other >> character classes which may match multibyte characters. > > Thanks. > > Unfortunately, the above description is too terse for me to understand > the issue and the way you propose to fix it. Could you please provide > more details, ‘[[:word:]]*\u2620’ ends up being as: 0: /on_failure_keep_string_jump to 28 3: /charset [ !extends past end of pattern! $-%0-9A-Za-z]has-range-table 25: /jump to 3 28: /exactn/3/â// 33: /succeed 34: end of pattern. while ‘\sw*\u2620’ as: 0: /on_failure_jump to 8 3: /syntaxspec/2 5: /jump to 0 8: /exactn/3/â// 13: /succeed 14: end of pattern. Apart from a different opcode to match the word class the crux of the difference is the first opcode: on_failure_keep_string_jump vs. on_failure_jump. As a matter of fact, regex_compile puts a on_failure_jump_smart opcode at the beginning which is optimised by re_match_2_internal (debug code removed for brevity): /* This operation is used for greedy *. Compare the beginning of the repeat with what in the pattern follows its end. If we can establish that there is nothing that they would both match, i.e., that we would have to backtrack because of (as in, e.g., `a*a') then we can use a non-backtracking loop based on on_failure_keep_string_jump instead of on_failure_jump. */ case on_failure_jump_smart: EXTRACT_NUMBER_AND_INCR (mcnt, p); { re_char *p1 = p; /* Next operation. */ /* Here, we discard `const', making re_match non-reentrant. */ unsigned char *p2 = (unsigned char*) p + mcnt; /* Jump dest. */ unsigned char *p3 = (unsigned char*) p - 3; /* opcode location. */ p -= 3; /* Reset so that we will re-execute the instruction once it's been changed. */ EXTRACT_NUMBER (mcnt, p2 - 2); /* Ensure this is a indeed the trivial kind of loop we are expecting. */ if (mutually_exclusive_p (bufp, p1, p2)) { /* Use a fast `on_failure_keep_string_jump' loop. */ *p3 = (unsigned char) on_failure_keep_string_jump; STORE_NUMBER (p2 - 2, mcnt + 3); } else { /* Default to a safe `on_failure_jump' loop. */ *p3 = (unsigned char) on_failure_jump; } } In other words, in our example, the code checks whether ‘[[:word:]]’ can match ‘💀’. If it cannot than we can be greedy about matching ‘[[:word:]]*’ and never backtrace looking for a shorter match; if it can, we may need to backtrace if the overall matching fails. mutually_exclusive_p concludes that ‘[[:word:]]’ cannot match ‘💀’ (or any non-ASCII characters really) but as a matter of fact, word class does match skull character. So when ‘[[:word:]]*💀’ matches ‘foo💀’ the following happens: 1. ‘[[:word:]]*’ matches the whole string. 2. String is now empty so ‘💀’ doesn’t match. 3. Because of incorrect assumptions, the engine does not shorten the initial ‘[[:word:]]*’ match. (I may be butchering the exact terms and algorithm that is being applied but the general idea is, I hope, shown). > Note that some of the classes deliberately don't work on multibyte > characters, and are documented as such. This is irrelevant. ‘[[:word:]]*’ matches ‘foo’ thus ‘[[:word:]]*b’ must match ‘foob’ (which it does) and ‘[[:word:]]*☠’ must match ‘foo☠’ (which it doesn’t). > including what problems you saw in classes other than [:word:]? The problem happens for any class which matches multibyte characters. For example: (string-match-p "[[:alpha:]]*" "żółć") => 0 (string-match-p "[[:alpha:]]*w" "żółw") => 0 (string-match-p "[[:alpha:]]*ć" "żółć") => nil (should be 0) ;; or even more simply: (string-match-p "[[:alpha:]]*ć" "ż") => nil (should be 0) In general, for a class FOO, if a multibyte character C matches that class regex "[[:FOO]]*C") should match the character itself but it doesn’t. > So if we are changing that, there should be documentation changes and > an entry in NEWS as well (but I suggest not to make such changes too > easily, not without measuring the impact on performance, if any). > >> * src/regex.c (executing_charset): A new function for executing the >> charset and charset_not opcodes. It performs check on the character >> taking into consideration existing bitmap, rang table and class bits. > ^^^^ > A typo. > >> +#ifdef emacs >> + else if (rtp) >> + { >> + int class_bits = CHARSET_RANGE_TABLE_BITS (p); >> + re_wchar_t range_start, range_end; >> + >> + /* Sort tests by the most commonly used classes with some adjustment to which >> + tests are easiest to perform. Frequencies of character class names as of >> + 2016-07-15: > > Not sure what files you used for this. Are those Emacs source files? > >> diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el >> new file mode 100644 >> index 0000000..a2dd4f0 >> --- /dev/null >> +++ b/test/src/regex-tests.el >> @@ -0,0 +1,75 @@ >> +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- > ^^^^^^^^^^^^^^^ > > Copy-paste error. > >> +;;; buffer-tests.el ends here > > And another one. -- Best regards ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ «If at first you don’t succeed, give up skydiving»
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Mon, 18 Jul 2016 23:31:01 GMT) Full text and rfc822 format available.Message #14 received at 24020 <at> debbugs.gnu.org (full text, mbox):
From: Michal Nazarewicz <mina86 <at> mina86.com> To: 24020 <at> debbugs.gnu.org Cc: eliz <at> gnu.org Subject: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020) Date: Tue, 19 Jul 2016 01:30:01 +0200
The regex engine tries to optimise (greedy) Kleene star by avoiding backtracking when it can detect that portion of the expression after the star cannot match if the repeated portion does match. For example, take regular expression ‘[[:alpha:]]*1’ trying to match a string ‘foo’. Since the Kleene star is greedy, the engine will test the shortest match for ‘[[:alpha:]]*’ which is ‘foo’. At this point though, the string being matched is empty while there’s still a literal digit one in the pattern. The engine will not, however, attempt to back-trace testing a shorter match for the character class (namely ‘fo’ leaving ‘o’ in the string) because it knows that whatever will be left in the string cannot match literal digit one. In the regexes of the form ‘[[:CC:]]*X’, the optimisation can be applied if and only if the regex engine can prove that the character class CC does not match character X (as is the case with alpha character class not matching digit 1). In the code, the proof is performed by mutually_exclusive_p function. However, it did not check class bits of a charset opcode which resulted in it assuming that character classes cannot match multibyte characters. For example, it would assume that [[:alpha:]] cannot match ‘ż’ even though ‘ż’ is indeed an alphanumeric character matching the alpha character class. This assumption caused incorrect optimisation of the regular expression and eventually failure of ‘[[:alpha:]]*żółw’ to match ‘żółw’. This issue affects any character class witch matches multibyte characters. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, range table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching. --- src/regex.c | 209 +++++++++++++++++++++--------------------------- test/src/regex-tests.el | 92 +++++++++++++++++++++ 2 files changed, 185 insertions(+), 116 deletions(-) create mode 100644 test/src/regex-tests.el diff --git a/src/regex.c b/src/regex.c index f92bcb7..297bf71 100644 --- a/src/regex.c +++ b/src/regex.c @@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source) and end. */ #define CHARSET_RANGE_TABLE_END(range_table, count) \ ((range_table) + (count) * 2 * 3) - -/* Test if C is in RANGE_TABLE. A flag NOT is negated if C is in. - COUNT is number of ranges in RANGE_TABLE. */ -#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count) \ - do \ - { \ - re_wchar_t range_start, range_end; \ - re_char *rtp; \ - re_char *range_table_end \ - = CHARSET_RANGE_TABLE_END ((range_table), (count)); \ - \ - for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3) \ - { \ - EXTRACT_CHARACTER (range_start, rtp); \ - EXTRACT_CHARACTER (range_end, rtp + 3); \ - \ - if (range_start <= (c) && (c) <= range_end) \ - { \ - (not) = !(not); \ - break; \ - } \ - } \ - } \ - while (0) - -/* Test if C is in range table of CHARSET. The flag NOT is negated if - C is listed in it. */ -#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset) \ - do \ - { \ - /* Number of ranges in range table. */ \ - int count; \ - re_char *range_table = CHARSET_RANGE_TABLE (charset); \ - \ - EXTRACT_NUMBER_AND_INCR (count, range_table); \ - CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count); \ - } \ - while (0) /* If DEBUG is defined, Regex prints many voluminous messages about what it is doing (if the variable `debug' is nonzero). If linked with the @@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend) return p; } +/* Test if C matches charset op. *PP points to the charset or chraset_not + opcode. When the function finishes, *PP will be advanced past that opcode. + C is character to test (possibly after translations) and CORIG is original + character (i.e. without any translations). UNIBYTE denotes whether c is + unibyte or multibyte character. */ +static bool +execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +{ + re_char *p = *pp, *rtp = NULL; + bool not = (re_opcode_t) *p == charset_not; + + if (CHARSET_RANGE_TABLE_EXISTS_P (p)) + { + int count; + rtp = CHARSET_RANGE_TABLE (p); + EXTRACT_NUMBER_AND_INCR (count, rtp); + *pp = CHARSET_RANGE_TABLE_END ((rtp), (count)); + } + else + *pp += 2 + CHARSET_BITMAP_SIZE (p); + + if (unibyte && c < (1 << BYTEWIDTH)) + { /* Lookup bitmap. */ + /* Cast to `unsigned' instead of `unsigned char' in + case the bit list is a full 32 bytes long. */ + if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH) + && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + return !not; + } +#ifdef emacs + else if (rtp) + { + int class_bits = CHARSET_RANGE_TABLE_BITS (p); + re_wchar_t range_start, range_end; + + /* Sort tests by the most commonly used classes with some adjustment to which + tests are easiest to perform. Frequencies of character class names used in + Emacs sources as of 2016-07-15: + + $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + | + sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr + 213 [:alnum:] + 104 [:alpha:] + 62 [:space:] + 39 [:digit:] + 36 [:blank:] + 26 [:upper:] + 24 [:word:] + 21 [:lower:] + 10 [:punct:] + 10 [:ascii:] + 9 [:xdigit:] + 4 [:nonascii:] + 4 [:graph:] + 2 [:print:] + 2 [:cntrl:] + 1 [:ff:] + */ + + if ((class_bits & BIT_MULTIBYTE) || + (class_bits & BIT_ALNUM && ISALNUM (c)) || + (class_bits & BIT_ALPHA && ISALPHA (c)) || + (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_WORD && ISWORD (c)) || + ((class_bits & BIT_UPPER) && + (ISUPPER (c) || (corig != c && + c == downcase (corig) && ISLOWER (c)))) || + ((class_bits & BIT_LOWER) && + (ISLOWER (c) || (corig != c && + c == upcase (corig) && ISUPPER(c)))) || + (class_bits & BIT_PUNCT && ISPUNCT (c)) || + (class_bits & BIT_GRAPH && ISGRAPH (c)) || + (class_bits & BIT_PRINT && ISPRINT (c))) + return !not; + + for (p = *pp; rtp < p; rtp += 2 * 3) + { + EXTRACT_CHARACTER (range_start, rtp); + EXTRACT_CHARACTER (range_end, rtp + 3); + if (range_start <= c && c <= range_end) + return !not; + } + } +#endif /* emacs */ + return not; +} + /* Non-zero if "p1 matches something" implies "p2 fails". */ static int mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, @@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - int not = (re_opcode_t) *p1 == charset_not; - - /* Test if C is listed in charset (or charset_not) - at `p1'. */ - if (! multibyte || IS_REAL_ASCII (c)) - { - if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH - && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } - else if (CHARSET_RANGE_TABLE_EXISTS_P (p1)) - CHARSET_LOOKUP_RANGE_TABLE (not, c, p1); - - /* `not' is equal to 1 if c would match, which means - that we can't change to pop_failure_jump. */ - if (!not) + if (!execute_charset (&p1, c, c, !multibyte)) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset_not: { register unsigned int c, corig; - boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; - /* Start of actual range_table, or end of bitmap if there is no - range table. */ - re_char *range_table UNINIT; - - /* Nonzero if there is a range table. */ - int range_table_exists; - - /* Number of ranges of range table. This is not included - in the initial byte-length of the command. */ - int count = 0; - /* Whether matching against a unibyte character. */ boolean unibyte_char = false; - DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : ""); - - range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]); - - if (range_table_exists) - { - range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap. */ - EXTRACT_NUMBER_AND_INCR (count, range_table); - } + DEBUG_PRINT ("EXECUTING charset%s.\n", + (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); @@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, unibyte_char = true; } - if (unibyte_char && c < (1 << BYTEWIDTH)) - { /* Lookup bitmap. */ - /* Cast to `unsigned' instead of `unsigned char' in - case the bit list is a full 32 bytes long. */ - if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH) - && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } -#ifdef emacs - else if (range_table_exists) - { - int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - - if ( (class_bits & BIT_LOWER - && (ISLOWER (c) - || (corig != c - && c == upcase (corig) && ISUPPER(c)))) - | (class_bits & BIT_MULTIBYTE) - | (class_bits & BIT_PUNCT && ISPUNCT (c)) - | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER - && (ISUPPER (c) - || (corig != c - && c == downcase (corig) && ISLOWER (c)))) - | (class_bits & BIT_WORD && ISWORD (c)) - | (class_bits & BIT_ALPHA && ISALPHA (c)) - | (class_bits & BIT_ALNUM && ISALNUM (c)) - | (class_bits & BIT_GRAPH && ISGRAPH (c)) - | (class_bits & BIT_PRINT && ISPRINT (c))) - not = !not; - else - CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); - } -#endif /* emacs */ - - if (range_table_exists) - p = CHARSET_RANGE_TABLE_END (range_table, count); - else - p += CHARSET_BITMAP_SIZE (&p[-1]) + 1; - - if (!not) goto fail; + p -= 1; + if (!execute_charset (&p, c, corig, unibyte_char)) + goto fail; d += len; } diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el new file mode 100644 index 0000000..00165ab --- /dev/null +++ b/test/src/regex-tests.el @@ -0,0 +1,92 @@ +;;; regex-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- + +;; Copyright (C) 2015-2016 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see <http://www.gnu.org/licenses/>. + +;;; Code: + +(require 'ert) + +(ert-deftest regex-word-cc-fallback-test () + "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020). + +Test that a regex of the form \"[[:cc:]]*x\" where CC is +a character class which matches a multibyte character X, matches +string \"x\". + +For example, ‘[[:word:]]*\u2620’ regex (note: \u2620 is a word +character) must match a string \"\u2420\"." + (dolist (class '("[[:word:]]" "\\sw")) + (dolist (repeat '("*" "+")) + (dolist (suffix '("" "b" "bar" "\u2620")) + (dolist (string '("" "foo")) + (when (not (and (string-equal repeat "+") + (string-equal string ""))) + (should (string-match (concat "^" class repeat suffix "$") + (concat string suffix))))))))) + +(defun regex--test-cc (name matching not-matching) + (should (string-match-p (concat "^[[:" name ":]]*$") matching)) + (should (string-match-p (concat "^[[:" name ":]]*?\u2622$") + (concat matching "\u2622"))) + (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching)) + (should (string-match-p (concat "^[^[:" name ":]]*\u2622$") + (concat not-matching "\u2622"))) + (with-temp-buffer + (insert matching) + (let ((p (point))) + (insert not-matching) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]")) + (should (equal (point) p)) + (skip-chars-forward (concat "^[:" name ":]")) + (should (equal (point) (point-max))) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]\u2622")) + (should (or (equal (point) p) (equal (point) (1+ p))))))) + +(ert-deftest regex-character-classes () + "Perform sanity test of regexes using character classes. + +Go over all the supported character classes and test whether the +classes and their inversions match what they are supposed to +match. The test is done using `string-match-p' as well as +`skip-chars-forward'." + (let (case-fold-search) + (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n") + (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n") + (regex--test-cc "digit" "012" "abcABCłąka-, \t\n") + (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n") + (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n") + (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n") + + (regex--test-cc "word" "abcABC012\u2620" "-, \t\n") + + (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n") + (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ") + (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1") + (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1") + + (regex--test-cc "space" " \t\n\u2001" "abcABCł0123") + (regex--test-cc "blank" " \t" "\n\u2001") + + (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620") + (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1") + (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622") + (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1"))) + +;;; regex-tests.el ends here -- 2.8.0.rc3.226.g39d4020
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Tue, 19 Jul 2016 08:01:01 GMT) Full text and rfc822 format available.Message #17 received at 24020 <at> debbugs.gnu.org (full text, mbox):
From: Andreas Schwab <schwab <at> suse.de> To: Michal Nazarewicz <mina86 <at> mina86.com> Cc: 24020 <at> debbugs.gnu.org Subject: Re: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020) Date: Tue, 19 Jul 2016 10:00:46 +0200
Michal Nazarewicz <mina86 <at> mina86.com> writes: > For example, take regular expression ‘[[:alpha:]]*1’ trying to match > a string ‘foo’. Since the Kleene star is greedy, the engine will test > the shortest match for ‘[[:alpha:]]*’ which is ‘foo’. At this point Did you mean "the longest match"? Andreas. -- Andreas Schwab, SUSE Labs, schwab <at> suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different."
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Wed, 20 Jul 2016 12:37:01 GMT) Full text and rfc822 format available.Message #20 received at 24020 <at> debbugs.gnu.org (full text, mbox):
From: Michal Nazarewicz <mina86 <at> mina86.com> To: Andreas Schwab <schwab <at> suse.de> Cc: 24020 <at> debbugs.gnu.org Subject: Re: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020) Date: Wed, 20 Jul 2016 14:36:39 +0200
> Michal Nazarewicz <mina86 <at> mina86.com> writes: >> For example, take regular expression ‘[[:alpha:]]*1’ trying to match >> a string ‘foo’. Since the Kleene star is greedy, the engine will test >> the shortest match for ‘[[:alpha:]]*’ which is ‘foo’. At this point On Tue, Jul 19 2016, Andreas Schwab wrote: > Did you mean "the longest match"? Yep, thanks for spotting this. -- Best regards ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ «If at first you don’t succeed, give up skydiving»
Michal Nazarewicz <mina86 <at> mina86.com>
:Michal Nazarewicz <mina86 <at> mina86.com>
:Message #25 received at 24020-done <at> debbugs.gnu.org (full text, mbox):
From: Michal Nazarewicz <mina86 <at> mina86.com> To: 24020-done <at> debbugs.gnu.org Subject: Re: bug#24020: [PATCHv2] Fix ‘[[:word:]]*\u2620’ failing to match ‘foo\u2620’ (bug#24020) Date: Mon, 25 Jul 2016 23:54:08 +0200
Pushed. -- Best regards ミハウ “𝓶𝓲𝓷𝓪86” ナザレヴイツ «If at first you don’t succeed, give up skydiving»
bug-gnu-emacs <at> gnu.org
:bug#24020
; Package emacs
.
(Wed, 27 Jul 2016 16:23:01 GMT) Full text and rfc822 format available.Message #28 received at 24020 <at> debbugs.gnu.org (full text, mbox):
From: Michal Nazarewicz <mina86 <at> mina86.com> To: 24020 <at> debbugs.gnu.org Subject: [PATCH] Fix ‘is multibyte’ test regex.c’s mutually_exclusive_p (bug#24020) Date: Wed, 27 Jul 2016 18:22:17 +0200
* src/regex.c (mutually_exclusive_p): Fix how whether character is unibyte is tested when calling execute_charset function. This bug has been introduced by [6dc6b00: Fix ‘[[:cc:]]*literal’ regex failing to match ‘literal’] which dropped a call to IS_REAL_ASCII (c) macro. Reinstitute it. --- src/regex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) FYI, I’ve just submitted this patch. The bug has been caught by tests from dima_regex_embedded_modifiers branch which I’m working on pulling into master branch so they will come in another commits. diff --git a/src/regex.c b/src/regex.c index 297bf71..1f2a1f08 100644 --- a/src/regex.c +++ b/src/regex.c @@ -4767,7 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - if (!execute_charset (&p1, c, c, !multibyte)) + if (!execute_charset (&p1, c, c, !multibyte || IS_REAL_ASCII (c))) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; -- 2.8.0.rc3.226.g39d4020
Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Thu, 25 Aug 2016 11:24:04 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.