From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCH] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Jul 2016 14:06:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: 24020@debbugs.gnu.org X-Debbugs-Original-To: bug-gnu-emacs@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.146885073228133 (code B ref -1); Mon, 18 Jul 2016 14:06:01 +0000 Received: (at submit) by debbugs.gnu.org; 18 Jul 2016 14:05:32 +0000 Received: from localhost ([127.0.0.1]:55852 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bP9Al-0007JZ-MS for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:31 -0400 Received: from eggs.gnu.org ([208.118.235.92]:52871) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bP9Ah-0007JI-1G for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9AZ-00086X-S8 for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:17 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:46365) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AZ-000867-PD for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:15 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43768) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AT-0001tv-3v for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9AN-0007y8-SZ for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:08 -0400 Received: from mail-wm0-x235.google.com ([2a00:1450:400c:c09::235]:37997) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AN-0007xv-FK for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:03 -0400 Received: by mail-wm0-x235.google.com with SMTP id o80so118557409wme.1 for ; Mon, 18 Jul 2016 07:05:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=gj8lQHRmO27VzkOaKA84h29qN91LttkfA7ZPPHSnzMQ=; b=FZHaLYAD1k1xL/0wz2gDRMSYnhbXV4uNFjOYq0wD0J0KjNvx8ANQQlCtV2sR8HrYN9 EQm8YYFmvpGmUh06CTsjmkyDDNOJKp8Wh33ppSewZWttWU43H0rVSLbPP05EevZ8rp2K qUYuQLM4DRBFe0oFF8IxB4K7fTlVbXh95EM7wqdTV+2ErK2L68sw8b9/Hso7OFeSwTmL r5pwzu8KwJPm7zLh9GNCOm8AWqesZy0snIrgyvsG1Vx37WUj5iIzx21n3Mz4XkPEJ+4V rPpdgymnApPT6We8dIOW6qvhKZPBNTDSaKrM94CuQfRBPkLUp1DJc3U3uovebnu7Shbc evCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=gj8lQHRmO27VzkOaKA84h29qN91LttkfA7ZPPHSnzMQ=; b=gJJm2EUQ96Mr/6F05JGpXzhJfoda/EKWXdxDpXmxS7c7bZ7RWXnKRbRzT5qmGTbv8r gHAkp1ANwgw9v5r0I8k1x+M1oBo0+GOOyKf1tD6DDScpdFkRI8TsU3rEnqMIAI6fNseP FDNK0bjEKvWLeVR0D0xaPdjQ9VFUnObG1bNXpazF7WBFmtCb9OLPzjdBXIfaJQ3Wa+2X 3wBn7L5sKDZ4oWVzswIdKwUx2HlF9U+ipFxRG2TDyzq4peUSIjofqqtX/Xk8N/kgKzhw T27chptcr3eNhhkd4gZA9BFcSDsG/GNGQqtOG8+zt4P/08VNOyzV1PT5b4AZ9/a1XGU5 jCCA== X-Gm-Message-State: ALyK8tJJbC3Pcrblo2cfC9GnaMuSEWw5aFLcEMWZBnLMPgmWHjVyPn1rL+MgMyuIjZdJN2fj X-Received: by 10.194.222.230 with SMTP id qp6mr1419132wjc.102.1468850701839; Mon, 18 Jul 2016 07:05:01 -0700 (PDT) Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id z5sm16880107wme.5.2016.07.18.07.04.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Jul 2016 07:05:00 -0700 (PDT) Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 2CA881E0270; Mon, 18 Jul 2016 16:04:59 +0200 (CEST) From: Michal Nazarewicz Date: Mon, 18 Jul 2016 16:04:44 +0200 Message-Id: <1468850684-17867-1-git-send-email-mina86@mina86.com> X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) mutually_exclusive_p did not check for the claass bits of an charset opcode when comparing it with an exactn which resulted in situation where it thought a multibyte character could not match the character class. This assumption caused incorrect optimisation of the regular expression and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. The issue affects all multibyte word characters as well as other character classes which may match multibyte characters. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, rang table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching. --- Unless there are objections I’ll push it within a week or so. src/regex.c | 209 +++++++++++++++++++++--------------------------- test/src/regex-tests.el | 75 +++++++++++++++++ 2 files changed, 168 insertions(+), 116 deletions(-) create mode 100644 test/src/regex-tests.el diff --git a/src/regex.c b/src/regex.c index f92bcb7..9f999a7 100644 --- a/src/regex.c +++ b/src/regex.c @@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source) and end. */ #define CHARSET_RANGE_TABLE_END(range_table, count) \ ((range_table) + (count) * 2 * 3) - -/* Test if C is in RANGE_TABLE. A flag NOT is negated if C is in. - COUNT is number of ranges in RANGE_TABLE. */ -#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count) \ - do \ - { \ - re_wchar_t range_start, range_end; \ - re_char *rtp; \ - re_char *range_table_end \ - = CHARSET_RANGE_TABLE_END ((range_table), (count)); \ - \ - for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3) \ - { \ - EXTRACT_CHARACTER (range_start, rtp); \ - EXTRACT_CHARACTER (range_end, rtp + 3); \ - \ - if (range_start <= (c) && (c) <= range_end) \ - { \ - (not) = !(not); \ - break; \ - } \ - } \ - } \ - while (0) - -/* Test if C is in range table of CHARSET. The flag NOT is negated if - C is listed in it. */ -#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset) \ - do \ - { \ - /* Number of ranges in range table. */ \ - int count; \ - re_char *range_table = CHARSET_RANGE_TABLE (charset); \ - \ - EXTRACT_NUMBER_AND_INCR (count, range_table); \ - CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count); \ - } \ - while (0) /* If DEBUG is defined, Regex prints many voluminous messages about what it is doing (if the variable `debug' is nonzero). If linked with the @@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend) return p; } +/* Test if C matches charset op. *PP points to the charset or chraset_not + opcode. When the function finishes, *PP will be advanced past that opcode. + C is character to test (possibly after translations) and CORIG is original + character (i.e. without any translations). UNIBYTE denotes whether c is + unibyte or multibyte character. */ +static bool +execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +{ + re_char *p = *pp, *rtp = NULL; + bool not = (re_opcode_t) *p == charset_not; + + if (CHARSET_RANGE_TABLE_EXISTS_P (p)) + { + int count; + rtp = CHARSET_RANGE_TABLE (p); + EXTRACT_NUMBER_AND_INCR (count, rtp); + *pp = CHARSET_RANGE_TABLE_END ((rtp), (count)); + } + else + *pp += 2 + CHARSET_BITMAP_SIZE (p); + + if (unibyte && c < (1 << BYTEWIDTH)) + { /* Lookup bitmap. */ + /* Cast to `unsigned' instead of `unsigned char' in + case the bit list is a full 32 bytes long. */ + if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH) + && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + return !not; + } +#ifdef emacs + else if (rtp) + { + int class_bits = CHARSET_RANGE_TABLE_BITS (p); + re_wchar_t range_start, range_end; + + /* Sort tests by the most commonly used classes with some adjustment to which + tests are easiest to perform. Frequencies of character class names as of + 2016-07-15: + + $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + | + sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr + 213 [:alnum:] + 104 [:alpha:] + 62 [:space:] + 39 [:digit:] + 36 [:blank:] + 26 [:upper:] + 24 [:word:] + 21 [:lower:] + 10 [:punct:] + 10 [:ascii:] + 9 [:xdigit:] + 4 [:nonascii:] + 4 [:graph:] + 2 [:print:] + 2 [:cntrl:] + 1 [:ff:] + */ + + if ((class_bits & BIT_MULTIBYTE) || + (class_bits & BIT_ALNUM && ISALNUM (c)) || + (class_bits & BIT_ALPHA && ISALPHA (c)) || + (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_WORD && ISWORD (c)) || + ((class_bits & BIT_UPPER) && + (ISUPPER (c) || (corig != c && + c == downcase (corig) && ISLOWER (c)))) || + ((class_bits & BIT_LOWER) && + (ISLOWER (c) || (corig != c && + c == upcase (corig) && ISUPPER(c)))) || + (class_bits & BIT_PUNCT && ISPUNCT (c)) || + (class_bits & BIT_GRAPH && ISGRAPH (c)) || + (class_bits & BIT_PRINT && ISPRINT (c))) + return !not; + + for (p = *pp; rtp < p; rtp += 2 * 3) + { + EXTRACT_CHARACTER (range_start, rtp); + EXTRACT_CHARACTER (range_end, rtp + 3); + if (range_start <= c && c <= range_end) + return !not; + } + } +#endif /* emacs */ + return not; +} + /* Non-zero if "p1 matches something" implies "p2 fails". */ static int mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, @@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - int not = (re_opcode_t) *p1 == charset_not; - - /* Test if C is listed in charset (or charset_not) - at `p1'. */ - if (! multibyte || IS_REAL_ASCII (c)) - { - if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH - && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } - else if (CHARSET_RANGE_TABLE_EXISTS_P (p1)) - CHARSET_LOOKUP_RANGE_TABLE (not, c, p1); - - /* `not' is equal to 1 if c would match, which means - that we can't change to pop_failure_jump. */ - if (!not) + if (!execute_charset (&p1, c, c, !multibyte)) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset_not: { register unsigned int c, corig; - boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; - /* Start of actual range_table, or end of bitmap if there is no - range table. */ - re_char *range_table UNINIT; - - /* Nonzero if there is a range table. */ - int range_table_exists; - - /* Number of ranges of range table. This is not included - in the initial byte-length of the command. */ - int count = 0; - /* Whether matching against a unibyte character. */ boolean unibyte_char = false; - DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : ""); - - range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]); - - if (range_table_exists) - { - range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap. */ - EXTRACT_NUMBER_AND_INCR (count, range_table); - } + DEBUG_PRINT ("EXECUTING charset%s.\n", + (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); @@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, unibyte_char = true; } - if (unibyte_char && c < (1 << BYTEWIDTH)) - { /* Lookup bitmap. */ - /* Cast to `unsigned' instead of `unsigned char' in - case the bit list is a full 32 bytes long. */ - if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH) - && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } -#ifdef emacs - else if (range_table_exists) - { - int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - - if ( (class_bits & BIT_LOWER - && (ISLOWER (c) - || (corig != c - && c == upcase (corig) && ISUPPER(c)))) - | (class_bits & BIT_MULTIBYTE) - | (class_bits & BIT_PUNCT && ISPUNCT (c)) - | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER - && (ISUPPER (c) - || (corig != c - && c == downcase (corig) && ISLOWER (c)))) - | (class_bits & BIT_WORD && ISWORD (c)) - | (class_bits & BIT_ALPHA && ISALPHA (c)) - | (class_bits & BIT_ALNUM && ISALNUM (c)) - | (class_bits & BIT_GRAPH && ISGRAPH (c)) - | (class_bits & BIT_PRINT && ISPRINT (c))) - not = !not; - else - CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); - } -#endif /* emacs */ - - if (range_table_exists) - p = CHARSET_RANGE_TABLE_END (range_table, count); - else - p += CHARSET_BITMAP_SIZE (&p[-1]) + 1; - - if (!not) goto fail; + p -= 1; + if (!execute_charset (&p, c, corig, unibyte_char)) + goto fail; d += len; } diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el new file mode 100644 index 0000000..a2dd4f0 --- /dev/null +++ b/test/src/regex-tests.el @@ -0,0 +1,75 @@ +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- + +;; Copyright (C) 2015-2016 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see . + +;;; Code: + +(require 'ert) + +(ert-deftest regex-word-cc-fallback-test () + (dolist (class '("[[:word:]]" "\\sw")) + (dolist (repeat '("*" "+")) + (dolist (suffix '("" "b" "bar" "\u2620")) + (should (string-match (concat "^" class repeat suffix "$") + (concat "foo" suffix))))))) + +(defun regex--test-cc (name matching not-matching) + (should (string-match-p (concat "^[[:" name ":]]*$") matching)) + (should (string-match-p (concat "^[[:" name ":]]*?\u2622$") + (concat matching "\u2622"))) + (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching)) + (should (string-match-p (concat "^[^[:" name ":]]*\u2622$") + (concat not-matching "\u2622"))) + (with-temp-buffer + (insert matching) + (let ((p (point))) + (insert not-matching) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]")) + (should (equal (point) p)) + (skip-chars-forward (concat "^[:" name ":]")) + (should (equal (point) (point-max))) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]\u2622")) + (should (or (equal (point) p) (equal (point) (1+ p))))))) + +(ert-deftest regex-character-classes () + (let (case-fold-search) + (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n") + (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n") + (regex--test-cc "digit" "012" "abcABCłąka-, \t\n") + (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n") + (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n") + (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n") + + (regex--test-cc "word" "abcABC012\u2620" "-, \t\n") + + (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n") + (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ") + (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1") + (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1") + + (regex--test-cc "space" " \t\n\u2001" "abcABCł0123") + (regex--test-cc "blank" " \t" "\n\u2001") + + (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620") + (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1") + (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622") + (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1"))) + +;;; buffer-tests.el ends here -- 2.8.0.rc3.226.g39d4020 From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCH] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= Resent-From: Eli Zaretskii Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Jul 2016 15:04:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Michal Nazarewicz Cc: 24020@debbugs.gnu.org Reply-To: Eli Zaretskii Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.14688542271273 (code B ref 24020); Mon, 18 Jul 2016 15:04:02 +0000 Received: (at 24020) by debbugs.gnu.org; 18 Jul 2016 15:03:47 +0000 Received: from localhost ([127.0.0.1]:55885 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPA5D-0000KT-2x for submit@debbugs.gnu.org; Mon, 18 Jul 2016 11:03:47 -0400 Received: from eggs.gnu.org ([208.118.235.92]:42381) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPA5B-0000KG-OY for 24020@debbugs.gnu.org; Mon, 18 Jul 2016 11:03:46 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bPA50-0005Zg-40 for 24020@debbugs.gnu.org; Mon, 18 Jul 2016 11:03:40 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_50,RP_MATCHES_RCVD autolearn=disabled version=3.3.2 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:54260) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bPA50-0005ZY-0x; Mon, 18 Jul 2016 11:03:34 -0400 Received: from 84.94.185.246.cable.012.net.il ([84.94.185.246]:2705 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_128_CBC_SHA1:128) (Exim 4.82) (envelope-from ) id 1bPA4r-00070F-Bp; Mon, 18 Jul 2016 11:03:33 -0400 Date: Mon, 18 Jul 2016 18:03:13 +0300 Message-Id: <83r3ar0z0u.fsf@gnu.org> From: Eli Zaretskii In-reply-to: <1468850684-17867-1-git-send-email-mina86@mina86.com> (message from Michal Nazarewicz on Mon, 18 Jul 2016 16:04:44 +0200) References: <1468850684-17867-1-git-send-email-mina86@mina86.com> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-Spam-Score: -6.3 (------) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.3 (------) > From: Michal Nazarewicz > Date: Mon, 18 Jul 2016 16:04:44 +0200 > > mutually_exclusive_p did not check for the claass bits of an charset > opcode when comparing it with an exactn which resulted in situation > where it thought a multibyte character could not match the character > class. > > This assumption caused incorrect optimisation of the regular expression > and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. > > The issue affects all multibyte word characters as well as other > character classes which may match multibyte characters. Thanks. Unfortunately, the above description is too terse for me to understand the issue and the way you propose to fix it. Could you please provide more details, including what problems you saw in classes other than [:word:]? Note that some of the classes deliberately don't work on multibyte characters, and are documented as such. So if we are changing that, there should be documentation changes and an entry in NEWS as well (but I suggest not to make such changes too easily, not without measuring the impact on performance, if any). > * src/regex.c (executing_charset): A new function for executing the > charset and charset_not opcodes. It performs check on the character > taking into consideration existing bitmap, rang table and class bits. ^^^^ A typo. > +#ifdef emacs > + else if (rtp) > + { > + int class_bits = CHARSET_RANGE_TABLE_BITS (p); > + re_wchar_t range_start, range_end; > + > + /* Sort tests by the most commonly used classes with some adjustment to which > + tests are easiest to perform. Frequencies of character class names as of > + 2016-07-15: Not sure what files you used for this. Are those Emacs source files? > diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el > new file mode 100644 > index 0000000..a2dd4f0 > --- /dev/null > +++ b/test/src/regex-tests.el > @@ -0,0 +1,75 @@ > +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- ^^^^^^^^^^^^^^^ Copy-paste error. > +;;; buffer-tests.el ends here And another one. From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCH] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Jul 2016 18:08:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Eli Zaretskii Cc: 24020@debbugs.gnu.org Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.146886525018145 (code B ref 24020); Mon, 18 Jul 2016 18:08:02 +0000 Received: (at 24020) by debbugs.gnu.org; 18 Jul 2016 18:07:30 +0000 Received: from localhost ([127.0.0.1]:55967 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPCwz-0004ib-Q7 for submit@debbugs.gnu.org; Mon, 18 Jul 2016 14:07:30 -0400 Received: from mail-wm0-f42.google.com ([74.125.82.42]:36097) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPCwy-0004iN-6d for 24020@debbugs.gnu.org; Mon, 18 Jul 2016 14:07:28 -0400 Received: by mail-wm0-f42.google.com with SMTP id f126so113598644wma.1 for <24020@debbugs.gnu.org>; Mon, 18 Jul 2016 11:07:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=Ba/+FHgHX35M1I/rWQYuB3CHEQYLDnFSSxGsogOaxfU=; b=i1HejK3s2OyNlwstk+w4Gp42xZEGR2LyyBvPfnVQb0k0jHYAWYc/1vgCqqHKTSpUKN mGAdKC7pAiqdopzyoCPdgnnBzKSFiR3AIoc/wQBQ2XU1l1chhzMhHdtvRzjPmnucG/gg Jdb8P5+2aPTsmXo/jQhCGegdUOEFuRBFxXWH78VUpyhC4XGeMMctKcGt3L0zjQ7Yb5gV UQLqS2KKH1bnRjBycVVugLeU38qDa5qseUcFLQnSiX8iFejx2gKOTVJjKsHh3R8X/KhY a25qrzZwCIyN3zwglQiTm47UsHa9bG835Skri9fEWABagLM6ou6+1p5bSExOgTAxrPKL 41Zw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:user-agent:face:date:message-id :mime-version:content-transfer-encoding; bh=Ba/+FHgHX35M1I/rWQYuB3CHEQYLDnFSSxGsogOaxfU=; b=QwztPJYWBb1eAYfLxHAZZCEDYxEBNhDW5SQcGuQG4BpqpeL64qXSNnKKRhxNaAdeQ+ zPJ1pJCrrvgXokM/Gs/PlEJ6ggYKMsiA5UnG2id2qoWjNhl1BiHnXv2BBe0/KjPFft/1 +4R0D88+IYB9NR1z1LUHN7DcdB7ZTmnO9Bwzbl2lObhH4X51/+zhf3AjtMlVWM6ch+Rs 2s9fBUyloyDept6MALfkGKyBNHim5WqDqN2Qrs+Zn9eT178AlzMQyR0IcyIlDGw73KGE z1LmM4RWYKaNIGH6RDBGVvbVtMq74fXiKkYoyCNbOXDgDW+drcE6qxNaUEA4nwxXTDhc jUyQ== X-Gm-Message-State: ALyK8tICJwvUPTukDAi6A+ETeOsGn0VvaIcNWUeCAAcNpHGLgdrHlQ65lxqpPCJJpMY+WG3t X-Received: by 10.195.18.165 with SMTP id gn5mr2437868wjd.58.1468865242142; Mon, 18 Jul 2016 11:07:22 -0700 (PDT) Received: from mpn-glaptop ([2620:0:105f:301:d9d8:1422:98ad:d0a5]) by smtp.gmail.com with ESMTPSA id z18sm17914807wmz.6.2016.07.18.11.07.19 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Mon, 18 Jul 2016 11:07:20 -0700 (PDT) From: Michal Nazarewicz In-Reply-To: <83r3ar0z0u.fsf@gnu.org> Organization: http://mina86.com/ References: <1468850684-17867-1-git-send-email-mina86@mina86.com> <83r3ar0z0u.fsf@gnu.org> User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.2 (x86_64-unknown-linux-gnu) Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:160718:eliz@gnu.org::oKoq/voMCyL2PF56:000000/Mo X-Hashcash: 1:20:160718:24020@debbugs.gnu.org::dg2wHD3+ZPGl5qZL:0000000000000000000000000000000000000000F8Rl Date: Mon, 18 Jul 2016 20:07:18 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.0 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.0 (--) On Mon, Jul 18 2016, Eli Zaretskii wrote: >> From: Michal Nazarewicz >> Date: Mon, 18 Jul 2016 16:04:44 +0200 >>=20 >> mutually_exclusive_p did not check for the claass bits of an charset >> opcode when comparing it with an exactn which resulted in situation >> where it thought a multibyte character could not match the character >> class. >>=20 >> This assumption caused incorrect optimisation of the regular expression >> and eventually failure of =E2=80=98[[:word:]]*\u2620=E2=80=99 to match = =E2=80=98foo\u2620=E2=80=99. >>=20 >> The issue affects all multibyte word characters as well as other >> character classes which may match multibyte characters. > > Thanks. > > Unfortunately, the above description is too terse for me to understand > the issue and the way you propose to fix it. Could you please provide > more details, =E2=80=98[[:word:]]*\u2620=E2=80=99 ends up being as: 0: /on_failure_keep_string_jump to 28 3: /charset [ !extends past end of pattern! $-%0-9A-Za-z]has-range-table 25: /jump to 3 28: /exactn/3/=C3=A2//=C2=A0 33: /succeed 34: end of pattern. while =E2=80=98\sw*\u2620=E2=80=99 as: 0: /on_failure_jump to 8 3: /syntaxspec/2 5: /jump to 0 8: /exactn/3/=C3=A2//=C2=A0 13: /succeed 14: end of pattern. Apart from a different opcode to match the word class the crux of the difference is the first opcode: on_failure_keep_string_jump vs. on_failure_jump. As a matter of fact, regex_compile puts a on_failure_jump_smart opcode at the beginning which is optimised by re_match_2_internal (debug code removed for brevity): /* This operation is used for greedy *. Compare the beginning of the repeat with what in the pattern follows its end. If we can establish that there is nothing that they would both match, i.e., that we would have to backtrack because of (as in, e.g., `a*a') then we can use a non-backtracking loop based on on_failure_keep_string_jump instead of on_failure_jump. */ case on_failure_jump_smart: EXTRACT_NUMBER_AND_INCR (mcnt, p); { re_char *p1 =3D p; /* Next operation. */ /* Here, we discard `const', making re_match non-reentrant. */ unsigned char *p2 =3D (unsigned char*) p + mcnt; /* Jump dest. */ unsigned char *p3 =3D (unsigned char*) p - 3; /* opcode location. */ p -=3D 3; /* Reset so that we will re-execute the instruction once it's been changed. */ EXTRACT_NUMBER (mcnt, p2 - 2); /* Ensure this is a indeed the trivial kind of loop we are expecting. */ if (mutually_exclusive_p (bufp, p1, p2)) { /* Use a fast `on_failure_keep_string_jump' loop. */ *p3 =3D (unsigned char) on_failure_keep_string_jump; STORE_NUMBER (p2 - 2, mcnt + 3); } else { /* Default to a safe `on_failure_jump' loop. */ *p3 =3D (unsigned char) on_failure_jump; } } In other words, in our example, the code checks whether =E2=80=98[[:word:]]= =E2=80=99 can match =E2=80=98=F0=9F=92=80=E2=80=99. If it cannot than we can be greedy a= bout matching =E2=80=98[[:word:]]*=E2=80=99 and never backtrace looking for a shorter mat= ch; if it can, we may need to backtrace if the overall matching fails. mutually_exclusive_p concludes that =E2=80=98[[:word:]]=E2=80=99 cannot mat= ch =E2=80=98=F0=9F=92=80=E2=80=99 (or any non-ASCII characters really) but as a matter of fact, word class does match skull character. So when =E2=80=98[[:word:]]*=F0=9F=92=80=E2=80=99 matches =E2=80=98foo=F0= =9F=92=80=E2=80=99 the following happens: 1. =E2=80=98[[:word:]]*=E2=80=99 matches the whole string. 2. String is now empty so =E2=80=98=F0=9F=92=80=E2=80=99 doesn=E2=80=99t ma= tch. 3. Because of incorrect assumptions, the engine does not shorten the initial =E2=80=98[[:word:]]*=E2=80=99 match. (I may be butchering the exact terms and algorithm that is being applied but the general idea is, I hope, shown). > Note that some of the classes deliberately don't work on multibyte > characters, and are documented as such. This is irrelevant. =E2=80=98[[:word:]]*=E2=80=99 matches =E2=80=98foo=E2= =80=99 thus =E2=80=98[[:word:]]*b=E2=80=99 must match =E2=80=98foob=E2=80=99 (which it does) and =E2=80=98[[:word:]]*= =E2=98=A0=E2=80=99 must match =E2=80=98foo=E2=98=A0=E2=80=99 (which it doesn=E2=80=99t). > including what problems you saw in classes other than [:word:]? The problem happens for any class which matches multibyte characters. For example: (string-match-p "[[:alpha:]]*" "=C5=BC=C3=B3=C5=82=C4=87") =3D> 0 (string-match-p "[[:alpha:]]*w" "=C5=BC=C3=B3=C5=82w") =3D> 0 (string-match-p "[[:alpha:]]*=C4=87" "=C5=BC=C3=B3=C5=82=C4=87") =3D> nil (should be 0) ;; or even more simply: (string-match-p "[[:alpha:]]*=C4=87" "=C5=BC") =3D> nil (should be 0) In general, for a class FOO, if a multibyte character C matches that class regex "[[:FOO]]*C") should match the character itself but it doesn=E2=80=99t. > So if we are changing that, there should be documentation changes and > an entry in NEWS as well (but I suggest not to make such changes too > easily, not without measuring the impact on performance, if any). > >> * src/regex.c (executing_charset): A new function for executing the >> charset and charset_not opcodes. It performs check on the character >> taking into consideration existing bitmap, rang table and class bits. > ^^^^ > A typo. > >> +#ifdef emacs >> + else if (rtp) >> + { >> + int class_bits =3D CHARSET_RANGE_TABLE_BITS (p); >> + re_wchar_t range_start, range_end; >> + >> + /* Sort tests by the most commonly used classes with some adjustment = to which >> + tests are easiest to perform. Frequencies of character class name= s as of >> + 2016-07-15: > > Not sure what files you used for this. Are those Emacs source files? > >> diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el >> new file mode 100644 >> index 0000000..a2dd4f0 >> --- /dev/null >> +++ b/test/src/regex-tests.el >> @@ -0,0 +1,75 @@ >> +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding= : t -*- > ^^^^^^^^^^^^^^^ > > Copy-paste error. > >> +;;; buffer-tests.el ends here > > And another one. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCHv2] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= (bug#24020) Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Mon, 18 Jul 2016 23:31:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: 24020@debbugs.gnu.org Cc: eliz@gnu.org Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.146888461722213 (code B ref 24020); Mon, 18 Jul 2016 23:31:01 +0000 Received: (at 24020) by debbugs.gnu.org; 18 Jul 2016 23:30:17 +0000 Received: from localhost ([127.0.0.1]:56088 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPHzM-0005mD-J4 for submit@debbugs.gnu.org; Mon, 18 Jul 2016 19:30:17 -0400 Received: from mail-wm0-f42.google.com ([74.125.82.42]:37595) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPHzK-0005lv-A2 for 24020@debbugs.gnu.org; Mon, 18 Jul 2016 19:30:15 -0400 Received: by mail-wm0-f42.google.com with SMTP id i5so5902951wmg.0 for <24020@debbugs.gnu.org>; Mon, 18 Jul 2016 16:30:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3fe9volpqIA6+u6eqP1Ko2crUK5QGpvRO0Z3TJv3uD0=; b=QaXObqS3KuHuSO/IP7Jf3sfc/5KJkOz/Xw89euh8zsbqtBBe+SiNFWS1iGVArwJqEY kveScgIcV7JtRZUE8MKaeChHk4SJHn+cYQ/LXq5v/NzcJnFihbBUQXQkzjnOH60XLXPA 55sd8Sv0QJwYh9nFjSP9Pm2RW8wK8Gf9Q8aCiEa7AumaWU1wh5X4i9OpO2PNtaKTqdMn GRfzvdzB+/yGEPouDJEwKp19BAmFCXWCtfgrTop62UCtUUaR2HeS5hxCAsrmOPEhdLIQ usXnP+9qyneCn92hRTq44GzrRXTI+939ndWvkERLcdcWFO14c/kQ1loB6ztJ/4gfkK4b zkYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=3fe9volpqIA6+u6eqP1Ko2crUK5QGpvRO0Z3TJv3uD0=; b=c0qocr4+5AQwhwLpImDklza5WSXtpqpSFifVPX6rpnzrt+FGQ31HYRGxO7txKibS5d f+F9MxUn+HD4dFu4ZWTrl2w/VhNQdWaxaN32kz10D9i+YpMcPXFK8bS8dycpCiFuVz0i 3QD6hh5gQ47b/bAYGO3UChX8XdVRxqEkliWEf78Oh7HQ1w9lqnLUNBxEUKqFZRf2e0Vq b+3NuTinkeHDHzYokaOVO0GuauJcp7X1cwfTxxIG81G44A7qxMc0n12yl9ltsHdLC4AH XFsCndlzoxYvw3bGLL4zEJX04khQMfSmJDSryHy1Q982GhFYAYAUYzQJZac2Rq3NDPw9 1ZTA== X-Gm-Message-State: ALyK8tJSDZrgYuuSosAT9JjQLI7xz4iUBeGVEWZuHG/+P0+FrehBgBqdt6ZThAhf9C5444o3 X-Received: by 10.194.77.193 with SMTP id u1mr3557437wjw.94.1468884608110; Mon, 18 Jul 2016 16:30:08 -0700 (PDT) Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id 17sm18996891wmf.6.2016.07.18.16.30.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Jul 2016 16:30:07 -0700 (PDT) Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 74D2B1E028F; Tue, 19 Jul 2016 01:30:06 +0200 (CEST) From: Michal Nazarewicz Date: Tue, 19 Jul 2016 01:30:01 +0200 Message-Id: <1468884601-31164-1-git-send-email-mina86@mina86.com> X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 In-Reply-To: <83r3ar0z0u.fsf@gnu.org> References: <83r3ar0z0u.fsf@gnu.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.0 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.0 (--) The regex engine tries to optimise (greedy) Kleene star by avoiding backtracking when it can detect that portion of the expression after the star cannot match if the repeated portion does match. For example, take regular expression ‘[[:alpha:]]*1’ trying to match a string ‘foo’. Since the Kleene star is greedy, the engine will test the shortest match for ‘[[:alpha:]]*’ which is ‘foo’. At this point though, the string being matched is empty while there’s still a literal digit one in the pattern. The engine will not, however, attempt to back-trace testing a shorter match for the character class (namely ‘fo’ leaving ‘o’ in the string) because it knows that whatever will be left in the string cannot match literal digit one. In the regexes of the form ‘[[:CC:]]*X’, the optimisation can be applied if and only if the regex engine can prove that the character class CC does not match character X (as is the case with alpha character class not matching digit 1). In the code, the proof is performed by mutually_exclusive_p function. However, it did not check class bits of a charset opcode which resulted in it assuming that character classes cannot match multibyte characters. For example, it would assume that [[:alpha:]] cannot match ‘ż’ even though ‘ż’ is indeed an alphanumeric character matching the alpha character class. This assumption caused incorrect optimisation of the regular expression and eventually failure of ‘[[:alpha:]]*żółw’ to match ‘żółw’. This issue affects any character class witch matches multibyte characters. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, range table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching. --- src/regex.c | 209 +++++++++++++++++++++--------------------------- test/src/regex-tests.el | 92 +++++++++++++++++++++ 2 files changed, 185 insertions(+), 116 deletions(-) create mode 100644 test/src/regex-tests.el diff --git a/src/regex.c b/src/regex.c index f92bcb7..297bf71 100644 --- a/src/regex.c +++ b/src/regex.c @@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source) and end. */ #define CHARSET_RANGE_TABLE_END(range_table, count) \ ((range_table) + (count) * 2 * 3) - -/* Test if C is in RANGE_TABLE. A flag NOT is negated if C is in. - COUNT is number of ranges in RANGE_TABLE. */ -#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count) \ - do \ - { \ - re_wchar_t range_start, range_end; \ - re_char *rtp; \ - re_char *range_table_end \ - = CHARSET_RANGE_TABLE_END ((range_table), (count)); \ - \ - for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3) \ - { \ - EXTRACT_CHARACTER (range_start, rtp); \ - EXTRACT_CHARACTER (range_end, rtp + 3); \ - \ - if (range_start <= (c) && (c) <= range_end) \ - { \ - (not) = !(not); \ - break; \ - } \ - } \ - } \ - while (0) - -/* Test if C is in range table of CHARSET. The flag NOT is negated if - C is listed in it. */ -#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset) \ - do \ - { \ - /* Number of ranges in range table. */ \ - int count; \ - re_char *range_table = CHARSET_RANGE_TABLE (charset); \ - \ - EXTRACT_NUMBER_AND_INCR (count, range_table); \ - CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count); \ - } \ - while (0) /* If DEBUG is defined, Regex prints many voluminous messages about what it is doing (if the variable `debug' is nonzero). If linked with the @@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend) return p; } +/* Test if C matches charset op. *PP points to the charset or chraset_not + opcode. When the function finishes, *PP will be advanced past that opcode. + C is character to test (possibly after translations) and CORIG is original + character (i.e. without any translations). UNIBYTE denotes whether c is + unibyte or multibyte character. */ +static bool +execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +{ + re_char *p = *pp, *rtp = NULL; + bool not = (re_opcode_t) *p == charset_not; + + if (CHARSET_RANGE_TABLE_EXISTS_P (p)) + { + int count; + rtp = CHARSET_RANGE_TABLE (p); + EXTRACT_NUMBER_AND_INCR (count, rtp); + *pp = CHARSET_RANGE_TABLE_END ((rtp), (count)); + } + else + *pp += 2 + CHARSET_BITMAP_SIZE (p); + + if (unibyte && c < (1 << BYTEWIDTH)) + { /* Lookup bitmap. */ + /* Cast to `unsigned' instead of `unsigned char' in + case the bit list is a full 32 bytes long. */ + if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH) + && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + return !not; + } +#ifdef emacs + else if (rtp) + { + int class_bits = CHARSET_RANGE_TABLE_BITS (p); + re_wchar_t range_start, range_end; + + /* Sort tests by the most commonly used classes with some adjustment to which + tests are easiest to perform. Frequencies of character class names used in + Emacs sources as of 2016-07-15: + + $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + | + sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr + 213 [:alnum:] + 104 [:alpha:] + 62 [:space:] + 39 [:digit:] + 36 [:blank:] + 26 [:upper:] + 24 [:word:] + 21 [:lower:] + 10 [:punct:] + 10 [:ascii:] + 9 [:xdigit:] + 4 [:nonascii:] + 4 [:graph:] + 2 [:print:] + 2 [:cntrl:] + 1 [:ff:] + */ + + if ((class_bits & BIT_MULTIBYTE) || + (class_bits & BIT_ALNUM && ISALNUM (c)) || + (class_bits & BIT_ALPHA && ISALPHA (c)) || + (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_WORD && ISWORD (c)) || + ((class_bits & BIT_UPPER) && + (ISUPPER (c) || (corig != c && + c == downcase (corig) && ISLOWER (c)))) || + ((class_bits & BIT_LOWER) && + (ISLOWER (c) || (corig != c && + c == upcase (corig) && ISUPPER(c)))) || + (class_bits & BIT_PUNCT && ISPUNCT (c)) || + (class_bits & BIT_GRAPH && ISGRAPH (c)) || + (class_bits & BIT_PRINT && ISPRINT (c))) + return !not; + + for (p = *pp; rtp < p; rtp += 2 * 3) + { + EXTRACT_CHARACTER (range_start, rtp); + EXTRACT_CHARACTER (range_end, rtp + 3); + if (range_start <= c && c <= range_end) + return !not; + } + } +#endif /* emacs */ + return not; +} + /* Non-zero if "p1 matches something" implies "p2 fails". */ static int mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, @@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - int not = (re_opcode_t) *p1 == charset_not; - - /* Test if C is listed in charset (or charset_not) - at `p1'. */ - if (! multibyte || IS_REAL_ASCII (c)) - { - if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH - && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } - else if (CHARSET_RANGE_TABLE_EXISTS_P (p1)) - CHARSET_LOOKUP_RANGE_TABLE (not, c, p1); - - /* `not' is equal to 1 if c would match, which means - that we can't change to pop_failure_jump. */ - if (!not) + if (!execute_charset (&p1, c, c, !multibyte)) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset_not: { register unsigned int c, corig; - boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; - /* Start of actual range_table, or end of bitmap if there is no - range table. */ - re_char *range_table UNINIT; - - /* Nonzero if there is a range table. */ - int range_table_exists; - - /* Number of ranges of range table. This is not included - in the initial byte-length of the command. */ - int count = 0; - /* Whether matching against a unibyte character. */ boolean unibyte_char = false; - DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : ""); - - range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]); - - if (range_table_exists) - { - range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap. */ - EXTRACT_NUMBER_AND_INCR (count, range_table); - } + DEBUG_PRINT ("EXECUTING charset%s.\n", + (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); @@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, unibyte_char = true; } - if (unibyte_char && c < (1 << BYTEWIDTH)) - { /* Lookup bitmap. */ - /* Cast to `unsigned' instead of `unsigned char' in - case the bit list is a full 32 bytes long. */ - if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH) - && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } -#ifdef emacs - else if (range_table_exists) - { - int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - - if ( (class_bits & BIT_LOWER - && (ISLOWER (c) - || (corig != c - && c == upcase (corig) && ISUPPER(c)))) - | (class_bits & BIT_MULTIBYTE) - | (class_bits & BIT_PUNCT && ISPUNCT (c)) - | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER - && (ISUPPER (c) - || (corig != c - && c == downcase (corig) && ISLOWER (c)))) - | (class_bits & BIT_WORD && ISWORD (c)) - | (class_bits & BIT_ALPHA && ISALPHA (c)) - | (class_bits & BIT_ALNUM && ISALNUM (c)) - | (class_bits & BIT_GRAPH && ISGRAPH (c)) - | (class_bits & BIT_PRINT && ISPRINT (c))) - not = !not; - else - CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); - } -#endif /* emacs */ - - if (range_table_exists) - p = CHARSET_RANGE_TABLE_END (range_table, count); - else - p += CHARSET_BITMAP_SIZE (&p[-1]) + 1; - - if (!not) goto fail; + p -= 1; + if (!execute_charset (&p, c, corig, unibyte_char)) + goto fail; d += len; } diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el new file mode 100644 index 0000000..00165ab --- /dev/null +++ b/test/src/regex-tests.el @@ -0,0 +1,92 @@ +;;; regex-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- + +;; Copyright (C) 2015-2016 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see . + +;;; Code: + +(require 'ert) + +(ert-deftest regex-word-cc-fallback-test () + "Test that ‘[[:cc:]]*x’ matches ‘x’ (bug#24020). + +Test that a regex of the form \"[[:cc:]]*x\" where CC is +a character class which matches a multibyte character X, matches +string \"x\". + +For example, ‘[[:word:]]*\u2620’ regex (note: \u2620 is a word +character) must match a string \"\u2420\"." + (dolist (class '("[[:word:]]" "\\sw")) + (dolist (repeat '("*" "+")) + (dolist (suffix '("" "b" "bar" "\u2620")) + (dolist (string '("" "foo")) + (when (not (and (string-equal repeat "+") + (string-equal string ""))) + (should (string-match (concat "^" class repeat suffix "$") + (concat string suffix))))))))) + +(defun regex--test-cc (name matching not-matching) + (should (string-match-p (concat "^[[:" name ":]]*$") matching)) + (should (string-match-p (concat "^[[:" name ":]]*?\u2622$") + (concat matching "\u2622"))) + (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching)) + (should (string-match-p (concat "^[^[:" name ":]]*\u2622$") + (concat not-matching "\u2622"))) + (with-temp-buffer + (insert matching) + (let ((p (point))) + (insert not-matching) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]")) + (should (equal (point) p)) + (skip-chars-forward (concat "^[:" name ":]")) + (should (equal (point) (point-max))) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]\u2622")) + (should (or (equal (point) p) (equal (point) (1+ p))))))) + +(ert-deftest regex-character-classes () + "Perform sanity test of regexes using character classes. + +Go over all the supported character classes and test whether the +classes and their inversions match what they are supposed to +match. The test is done using `string-match-p' as well as +`skip-chars-forward'." + (let (case-fold-search) + (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n") + (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n") + (regex--test-cc "digit" "012" "abcABCłąka-, \t\n") + (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n") + (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n") + (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n") + + (regex--test-cc "word" "abcABC012\u2620" "-, \t\n") + + (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n") + (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ") + (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1") + (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1") + + (regex--test-cc "space" " \t\n\u2001" "abcABCł0123") + (regex--test-cc "blank" " \t" "\n\u2001") + + (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620") + (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1") + (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622") + (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1"))) + +;;; regex-tests.el ends here -- 2.8.0.rc3.226.g39d4020 From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCHv2] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= (bug#24020) Resent-From: Andreas Schwab Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Tue, 19 Jul 2016 08:01:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Michal Nazarewicz Cc: 24020@debbugs.gnu.org Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.146891525517670 (code B ref 24020); Tue, 19 Jul 2016 08:01:01 +0000 Received: (at 24020) by debbugs.gnu.org; 19 Jul 2016 08:00:55 +0000 Received: from localhost ([127.0.0.1]:56465 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPPxV-0004au-Ez for submit@debbugs.gnu.org; Tue, 19 Jul 2016 04:00:55 -0400 Received: from mx2.suse.de ([195.135.220.15]:34043) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPPxQ-0004ah-8O for 24020@debbugs.gnu.org; Tue, 19 Jul 2016 04:00:51 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 126E8AD44; Tue, 19 Jul 2016 08:00:47 +0000 (UTC) From: Andreas Schwab References: <83r3ar0z0u.fsf@gnu.org> <1468884601-31164-1-git-send-email-mina86@mina86.com> X-Yow: O.K.! Speak with a PHILADELPHIA ACCENT!! Send out for CHINESE FOOD!! Hop a JET! Date: Tue, 19 Jul 2016 10:00:46 +0200 In-Reply-To: <1468884601-31164-1-git-send-email-mina86@mina86.com> (Michal Nazarewicz's message of "Tue, 19 Jul 2016 01:30:01 +0200") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.5 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -3.6 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.6 (---) Michal Nazarewicz writes: > For example, take regular expression ‘[[:alpha:]]*1’ trying to match > a string ‘foo’. Since the Kleene star is greedy, the engine will test > the shortest match for ‘[[:alpha:]]*’ which is ‘foo’. At this point Did you mean "the longest match"? Andreas. -- Andreas Schwab, SUSE Labs, schwab@suse.de GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE 1748 E4D4 88E3 0EEA B9D7 "And now for something completely different." From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCHv2] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= (bug#24020) Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Wed, 20 Jul 2016 12:37:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: Andreas Schwab Cc: 24020@debbugs.gnu.org Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.146901820926534 (code B ref 24020); Wed, 20 Jul 2016 12:37:01 +0000 Received: (at 24020) by debbugs.gnu.org; 20 Jul 2016 12:36:49 +0000 Received: from localhost ([127.0.0.1]:57972 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPqk4-0006tt-US for submit@debbugs.gnu.org; Wed, 20 Jul 2016 08:36:49 -0400 Received: from mail-wm0-f43.google.com ([74.125.82.43]:38055) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bPqk3-0006td-Oy for 24020@debbugs.gnu.org; Wed, 20 Jul 2016 08:36:48 -0400 Received: by mail-wm0-f43.google.com with SMTP id o80so66940503wme.1 for <24020@debbugs.gnu.org>; Wed, 20 Jul 2016 05:36:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:cc:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=d5skdV6SoztJyZkd+6q1ksFQlKexb8o4zbFyiuIn5Rw=; b=AdRZY9Yd4UbW1Jk9Vab2+wWxNFTgQfy3r1RCvcfKepUbM0VFxO/g8dqKJINuMpSb1P zS9xXXbB4xH4J5pYRlQ90FhlyvyHomqPXjQG50A/qWRE03va48oJ3gVjYAJtAICvsmiD F4f04iYQYLkAlcRHqmgoEp68DZPUJ7fGcWFDn+PWTLO2YE6f3Q3Mr7rmMbGmPKHj4R4U frBg4BL8CrjVGAfpb671VU2Qa1fFhger2kioKwQHsHh4SnYQHr2mSIc0Hm2n940jPBGX LpczCuv5CuHWxLDCfg1WwnlyX3tkzBMlLTzp8umxlb8qXvz6zcDpL7uMWuGayPtYDKOW nxlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:cc:subject:in-reply-to :organization:references:user-agent:face:date:message-id :mime-version:content-transfer-encoding; bh=d5skdV6SoztJyZkd+6q1ksFQlKexb8o4zbFyiuIn5Rw=; b=hzkEt8xR1FH9jRIv+hBUrgGa3xVAP4+VF4UWHiuRd901bbFqCMvBZssdzeT3fojYFV XOgPMZESDMPekkmfRCasEzjFYK9xGKyn2PwglS1psjsH/RvoLhXQbkNKxCz8u5FFMgvz mPuKpFZeCp/LGCDqN0sdFtf30WlRGu+qNdxxFSg60YQtaxuNT3mTrlSe/nvMs6+6FkX0 6zzIkJk/ycO1i7qs6g6/2CuV+ivXp3URUAM47TIBWCJ9ayRSl6OZ+//OO0WZUKtFZyxV EBNB1R+BowrpaDmKpl5ZbEmLx5R02JaxRzMul7+vvggNHvx9S85gE3tppkgWU9aLxs40 d0KQ== X-Gm-Message-State: ALyK8tKkFpS11CHhQNIfy+sJuqnMCn8HtjRtwPrIF42DcQfsU1aYQUoYcKAC6MvaAQQbdIho X-Received: by 10.194.69.198 with SMTP id g6mr1258821wju.136.1469018201591; Wed, 20 Jul 2016 05:36:41 -0700 (PDT) Received: from mpn-glaptop ([172.28.88.8]) by smtp.gmail.com with ESMTPSA id a194sm26072007wmd.24.2016.07.20.05.36.40 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Wed, 20 Jul 2016 05:36:40 -0700 (PDT) From: Michal Nazarewicz In-Reply-To: Organization: http://mina86.com/ References: <83r3ar0z0u.fsf@gnu.org> <1468884601-31164-1-git-send-email-mina86@mina86.com> User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.1 (x86_64-unknown-linux-gnu) Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:160720:schwab@suse.de::Qn9lZtsO4q6voyC6:0003Ky6 X-Hashcash: 1:20:160720:24020@debbugs.gnu.org::Y7rCbDhIKy/dUg/h:00000000000000000000000000000000000000004Q7m Date: Wed, 20 Jul 2016 14:36:39 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.0 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.0 (--) > Michal Nazarewicz writes: >> For example, take regular expression =E2=80=98[[:alpha:]]*1=E2=80=99 try= ing to match >> a string =E2=80=98foo=E2=80=99. Since the Kleene star is greedy, the en= gine will test >> the shortest match for =E2=80=98[[:alpha:]]*=E2=80=99 which is =E2=80=98= foo=E2=80=99. At this point On Tue, Jul 19 2016, Andreas Schwab wrote: > Did you mean "the longest match"? Yep, thanks for spotting this. --=20 Best regards =E3=83=9F=E3=83=8F=E3=82=A6 =E2=80=9C=F0=9D=93=B6=F0=9D=93=B2=F0=9D=93=B7= =F0=9D=93=AA86=E2=80=9D =E3=83=8A=E3=82=B6=E3=83=AC=E3=83=B4=E3=82=A4=E3=83= =84 =C2=ABIf at first you don=E2=80=99t succeed, give up skydiving=C2=BB From unknown Sun Jun 15 13:01:53 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Michal Nazarewicz Subject: bug#24020: closed (Re: bug#24020: [PATCHv2] Fix =?UTF-8?Q?=E2=80=98[[:word:]]*\u2620=E2=80=99?= failing to match =?UTF-8?Q?=E2=80=98foo\u2620=E2=80=99?= (bug#24020)) Message-ID: References: <1468850684-17867-1-git-send-email-mina86@mina86.com> X-Gnu-PR-Message: they-closed 24020 X-Gnu-PR-Package: emacs X-Gnu-PR-Keywords: patch Reply-To: 24020@debbugs.gnu.org Date: Mon, 25 Jul 2016 21:55:01 +0000 Content-Type: multipart/mixed; boundary="----------=_1469483701-13937-1" This is a multi-part message in MIME format... ------------=_1469483701-13937-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #24020: [PATCH] Fix =E2=80=98[[:word:]]*\u2620=E2=80=99 failing to match = =E2=80=98foo\u2620=E2=80=99 which was filed against the emacs package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 24020@debbugs.gnu.org. --=20 24020: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24020 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1469483701-13937-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 24020-done) by debbugs.gnu.org; 25 Jul 2016 21:54:19 +0000 Received: from localhost ([127.0.0.1]:37261 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bRnpK-0003be-Tu for submit@debbugs.gnu.org; Mon, 25 Jul 2016 17:54:19 -0400 Received: from mail-wm0-f50.google.com ([74.125.82.50]:35970) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bRnpJ-0003bN-Do for 24020-done@debbugs.gnu.org; Mon, 25 Jul 2016 17:54:17 -0400 Received: by mail-wm0-f50.google.com with SMTP id q128so150698147wma.1 for <24020-done@debbugs.gnu.org>; Mon, 25 Jul 2016 14:54:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:in-reply-to:organization:references :user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=nfoQon4tyv45eVCgXz/JX5++bl/EJ+BPN0P+OMeH+rU=; b=WMvLG3LOQZLxw7NYrHoj4R2dXmTnEvNND1mYz3ArbBZ8hRErqcX2feDBfFDoD4nPKl 2XUSkSEGuBCqbCarEkB5uQ303kyftZqcXry/uLYjY5ct7IY3N8KeRBBSTvpd2I0HY79o 1JC7z/hfadViMiyx3gE+BGGFmithcahwxEuCx367zpz93yZLLqhke6LcGjGbmuGXwuJz LuO2vnsiNAGOQeyEWBbb8jhTCdEjxe+4mAh/6RCm4ojzTI17Xg5SBhSA+Dj1u9cYItUz mpC/yym1ri2yLhnAgCf/lolDKVsCzbxQ0dnb3S0f70D9DLZO2A43oCS8Avdsw1rZ8wY+ 87Fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:in-reply-to:organization :references:user-agent:face:date:message-id:mime-version :content-transfer-encoding; bh=nfoQon4tyv45eVCgXz/JX5++bl/EJ+BPN0P+OMeH+rU=; b=g3SotD3p+ho/6glhO0QeshlVUxlVp8Ahd8WvlAQUCA61m0b+UV4jwg82+pO5jIQimd Q/jpzsrHaMzoHch0YNpVNLeT0mhyTiUlNeHOSkfbHKH1sPwzHKFpnopJdlex5AuaToOj iVug9oA+V260ZxTmXvj9Qdhe0+t7/gR5aiHwSVLlnFG4QL8SBxoozCvQHRokUuCOvZLB fnIeD6EvXXq6+T4GpzKvGilMZd65RsOuucs9DXxy7qCBUE04S+fgsaj2Gfk+wlDMJi4b 6su/xKBdbeWwpPy3jzNc3TqDT2rIU9Jj9eCKMWIO23kDqRkWi95fggpt/8V9fr7tjQHF 9ztg== X-Gm-Message-State: AEkoouv1TOB+3mZ6VUS5717mgDChOx1Dic7HlapcQ3wZUsPPs6jgRQS3pY3Uxl+l0/fYhld3 X-Received: by 10.28.232.149 with SMTP id f21mr20888303wmi.51.1469483651178; Mon, 25 Jul 2016 14:54:11 -0700 (PDT) Received: from mpn-glaptop ([172.28.88.8]) by smtp.gmail.com with ESMTPSA id hb8sm17734018wjd.13.2016.07.25.14.54.09 for <24020-done@debbugs.gnu.org> (version=TLS1_2 cipher=AES128-SHA bits=128/128); Mon, 25 Jul 2016 14:54:09 -0700 (PDT) From: Michal Nazarewicz To: 24020-done@debbugs.gnu.org Subject: Re: bug#24020: [PATCHv2] Fix =?utf-8?Q?=E2=80=98=5B=5B=3Aword=3A?= =?utf-8?Q?=5D=5D*=5Cu2620=E2=80=99?= failing to match =?utf-8?B?4oCYZm9v?= =?utf-8?B?XHUyNjIw4oCZ?= (bug#24020) In-Reply-To: Organization: http://mina86.com/ References: <83r3ar0z0u.fsf@gnu.org> <1468884601-31164-1-git-send-email-mina86@mina86.com> User-Agent: Notmuch/0.19+53~g2e63a09 (http://notmuchmail.org) Emacs/25.1.50.1 (x86_64-unknown-linux-gnu) Face: iVBORw0KGgoAAAANSUhEUgAAADAAAAAwBAMAAAClLOS0AAAAJFBMVEWbfGlUPDDHgE57V0jUupKjgIObY0PLrom9mH4dFRK4gmjPs41MxjOgAAACP0lEQVQ4T23Sv2vbQBQHcBk1xE6WyALX107VUEgmn6+ouUwpEQQ6uRjttkWP4CkBg2M0BQLBdPFZYPsyFYo7qEtKDQ7on+t7+nF2Ux8ahD587717OmNYrOvycHsZ+o2r051wHTHysAvGb8ygvgu4QWT0sCmkgZCIEnlV2X8BtyraazFGDuxhmKSQJMlwHQ7v5MHSNxmz78rfElwAa3ieVD9e+hBhjaPDDG6NgFo2f4wBMNIo5YmRtF0RyDgFjJjlMIWbnuM4x9MMfABGTlN4qgIQB4A1DEyA1BHWtfeWNUMwiVJKoqh97KrkOO+qzgluVYLvFCUKAX73nONeBr7BGMdM6Sg0kuep03VywLaIzRiVr+GAzKlpQIsAFnWAG2e6DT5WmWDiudZMIc6hYrMOmeMQK9WX0B+/RfjzL9DI7Y9/Iayn29Ci0r2i4f9gMimMSZLCDMalgQGU5hnUtqAN0OGvEmO1Wnl0C0wWSCEHnuHBqmygxdxA8oWXwbipoc1EoNR9DqOpBpOJrnr0criQab9ZT4LL+wI+K7GBQH30CrhUruilgP9DRTrhVWZCiAyILP+wiuLeCKGTD6r/nc8LOJcAwR6IBTUs+7CASw3QFZ0MdA2PI3zNziH4ZKVhXCRMBjeZ1DWMekKwDCASwExy+NQ86TaykaDAFHO4aP48y4fIcDM5yOG8GcTLbOyp8A8azjJI93JFd1EA6yN8sSxMQJWoABqniRZVykYgRXErzrdqExAoUrRb0xfRp8p2A/4XmfilTtkDZ4cAAAAASUVORK5CYII= X-Face: -TR8(rDTHy/(xl?SfWd1|3:TTgDIatE^t'vop%*gVg[kn$t{EpK(P"VQ=~T2#ysNmJKN$"yTRLB4YQs$4{[.]Fc1)*O]3+XO^oXM>Q#b^ix, O)Zbn)q[y06$`e3?C)`CwR9y5riE=fv^X@x$y?D:XO6L&x4f-}}I4=VRNwiA^t1-ZrVK^07.Pi/57c_du'& X-PGP: 50751FF4 X-PGP-FP: AC1F 5F5C D418 88F8 CC84 5858 2060 4012 5075 1FF4 X-Hashcash: 1:20:160725:24020@debbugs.gnu.org::C5u8KmAD16e9Mhu8:00000000000000000000000000000000000000006mFc X-Hashcash: 1:20:160725:schwab@suse.de::6Py1v2asPFOqCzun:0005Ic/ Date: Mon, 25 Jul 2016 23:54:08 +0200 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: base64 X-Spam-Score: -2.0 (--) X-Debbugs-Envelope-To: 24020-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.0 (--) UHVzaGVkLg0KDQotLSANCkJlc3QgcmVnYXJkcw0K44Of44OP44KmIOKAnPCdk7bwnZOy8J2Tt/Cd k6o4NuKAnSDjg4rjgrbjg6zjg7TjgqTjg4QNCsKrSWYgYXQgZmlyc3QgeW91IGRvbuKAmXQgc3Vj Y2VlZCwgZ2l2ZSB1cCBza3lkaXZpbmfCuw0K ------------=_1469483701-13937-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 18 Jul 2016 14:05:32 +0000 Received: from localhost ([127.0.0.1]:55852 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bP9Al-0007JZ-MS for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:31 -0400 Received: from eggs.gnu.org ([208.118.235.92]:52871) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bP9Ah-0007JI-1G for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9AZ-00086X-S8 for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:17 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:46365) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AZ-000867-PD for submit@debbugs.gnu.org; Mon, 18 Jul 2016 10:05:15 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43768) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AT-0001tv-3v for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:15 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bP9AN-0007y8-SZ for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:08 -0400 Received: from mail-wm0-x235.google.com ([2a00:1450:400c:c09::235]:37997) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bP9AN-0007xv-FK for bug-gnu-emacs@gnu.org; Mon, 18 Jul 2016 10:05:03 -0400 Received: by mail-wm0-x235.google.com with SMTP id o80so118557409wme.1 for ; Mon, 18 Jul 2016 07:05:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=gj8lQHRmO27VzkOaKA84h29qN91LttkfA7ZPPHSnzMQ=; b=FZHaLYAD1k1xL/0wz2gDRMSYnhbXV4uNFjOYq0wD0J0KjNvx8ANQQlCtV2sR8HrYN9 EQm8YYFmvpGmUh06CTsjmkyDDNOJKp8Wh33ppSewZWttWU43H0rVSLbPP05EevZ8rp2K qUYuQLM4DRBFe0oFF8IxB4K7fTlVbXh95EM7wqdTV+2ErK2L68sw8b9/Hso7OFeSwTmL r5pwzu8KwJPm7zLh9GNCOm8AWqesZy0snIrgyvsG1Vx37WUj5iIzx21n3Mz4XkPEJ+4V rPpdgymnApPT6We8dIOW6qvhKZPBNTDSaKrM94CuQfRBPkLUp1DJc3U3uovebnu7Shbc evCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=gj8lQHRmO27VzkOaKA84h29qN91LttkfA7ZPPHSnzMQ=; b=gJJm2EUQ96Mr/6F05JGpXzhJfoda/EKWXdxDpXmxS7c7bZ7RWXnKRbRzT5qmGTbv8r gHAkp1ANwgw9v5r0I8k1x+M1oBo0+GOOyKf1tD6DDScpdFkRI8TsU3rEnqMIAI6fNseP FDNK0bjEKvWLeVR0D0xaPdjQ9VFUnObG1bNXpazF7WBFmtCb9OLPzjdBXIfaJQ3Wa+2X 3wBn7L5sKDZ4oWVzswIdKwUx2HlF9U+ipFxRG2TDyzq4peUSIjofqqtX/Xk8N/kgKzhw T27chptcr3eNhhkd4gZA9BFcSDsG/GNGQqtOG8+zt4P/08VNOyzV1PT5b4AZ9/a1XGU5 jCCA== X-Gm-Message-State: ALyK8tJJbC3Pcrblo2cfC9GnaMuSEWw5aFLcEMWZBnLMPgmWHjVyPn1rL+MgMyuIjZdJN2fj X-Received: by 10.194.222.230 with SMTP id qp6mr1419132wjc.102.1468850701839; Mon, 18 Jul 2016 07:05:01 -0700 (PDT) Received: from mpn.zrh.corp.google.com ([172.16.113.135]) by smtp.gmail.com with ESMTPSA id z5sm16880107wme.5.2016.07.18.07.04.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Jul 2016 07:05:00 -0700 (PDT) Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id 2CA881E0270; Mon, 18 Jul 2016 16:04:59 +0200 (CEST) From: Michal Nazarewicz To: bug-gnu-emacs@gnu.org Subject: [PATCH] =?UTF-8?q?Fix=20=E2=80=98[[:word:]]*\u2620=E2=80=99=20fai?= =?UTF-8?q?ling=20to=20match=20=E2=80=98foo\u2620=E2=80=99?= Date: Mon, 18 Jul 2016 16:04:44 +0200 Message-Id: <1468850684-17867-1-git-send-email-mina86@mina86.com> X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) mutually_exclusive_p did not check for the claass bits of an charset opcode when comparing it with an exactn which resulted in situation where it thought a multibyte character could not match the character class. This assumption caused incorrect optimisation of the regular expression and eventually failure of ‘[[:word:]]*\u2620’ to match ‘foo\u2620’. The issue affects all multibyte word characters as well as other character classes which may match multibyte characters. * src/regex.c (executing_charset): A new function for executing the charset and charset_not opcodes. It performs check on the character taking into consideration existing bitmap, rang table and class bits. It also advances the pointer in the regex bytecode past the parsed opcode. (CHARSET_LOOKUP_RANGE_TABLE_RAW, CHARSET_LOOKUP_RANGE_TABLE): Removed. Code now included in executing_charset. (mutually_exclusive_p, re_match_2_internal): Changed to take advantage of executing_charset function. * test/src/regex-tests.el: New file with tests for the character class matching. --- Unless there are objections I’ll push it within a week or so. src/regex.c | 209 +++++++++++++++++++++--------------------------- test/src/regex-tests.el | 75 +++++++++++++++++ 2 files changed, 168 insertions(+), 116 deletions(-) create mode 100644 test/src/regex-tests.el diff --git a/src/regex.c b/src/regex.c index f92bcb7..9f999a7 100644 --- a/src/regex.c +++ b/src/regex.c @@ -783,44 +783,6 @@ extract_number_and_incr (re_char **source) and end. */ #define CHARSET_RANGE_TABLE_END(range_table, count) \ ((range_table) + (count) * 2 * 3) - -/* Test if C is in RANGE_TABLE. A flag NOT is negated if C is in. - COUNT is number of ranges in RANGE_TABLE. */ -#define CHARSET_LOOKUP_RANGE_TABLE_RAW(not, c, range_table, count) \ - do \ - { \ - re_wchar_t range_start, range_end; \ - re_char *rtp; \ - re_char *range_table_end \ - = CHARSET_RANGE_TABLE_END ((range_table), (count)); \ - \ - for (rtp = (range_table); rtp < range_table_end; rtp += 2 * 3) \ - { \ - EXTRACT_CHARACTER (range_start, rtp); \ - EXTRACT_CHARACTER (range_end, rtp + 3); \ - \ - if (range_start <= (c) && (c) <= range_end) \ - { \ - (not) = !(not); \ - break; \ - } \ - } \ - } \ - while (0) - -/* Test if C is in range table of CHARSET. The flag NOT is negated if - C is listed in it. */ -#define CHARSET_LOOKUP_RANGE_TABLE(not, c, charset) \ - do \ - { \ - /* Number of ranges in range table. */ \ - int count; \ - re_char *range_table = CHARSET_RANGE_TABLE (charset); \ - \ - EXTRACT_NUMBER_AND_INCR (count, range_table); \ - CHARSET_LOOKUP_RANGE_TABLE_RAW ((not), (c), range_table, count); \ - } \ - while (0) /* If DEBUG is defined, Regex prints many voluminous messages about what it is doing (if the variable `debug' is nonzero). If linked with the @@ -4661,6 +4623,93 @@ skip_noops (const_re_char *p, const_re_char *pend) return p; } +/* Test if C matches charset op. *PP points to the charset or chraset_not + opcode. When the function finishes, *PP will be advanced past that opcode. + C is character to test (possibly after translations) and CORIG is original + character (i.e. without any translations). UNIBYTE denotes whether c is + unibyte or multibyte character. */ +static bool +execute_charset (const_re_char **pp, unsigned c, unsigned corig, bool unibyte) +{ + re_char *p = *pp, *rtp = NULL; + bool not = (re_opcode_t) *p == charset_not; + + if (CHARSET_RANGE_TABLE_EXISTS_P (p)) + { + int count; + rtp = CHARSET_RANGE_TABLE (p); + EXTRACT_NUMBER_AND_INCR (count, rtp); + *pp = CHARSET_RANGE_TABLE_END ((rtp), (count)); + } + else + *pp += 2 + CHARSET_BITMAP_SIZE (p); + + if (unibyte && c < (1 << BYTEWIDTH)) + { /* Lookup bitmap. */ + /* Cast to `unsigned' instead of `unsigned char' in + case the bit list is a full 32 bytes long. */ + if (c < (unsigned) (CHARSET_BITMAP_SIZE (p) * BYTEWIDTH) + && p[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + return !not; + } +#ifdef emacs + else if (rtp) + { + int class_bits = CHARSET_RANGE_TABLE_BITS (p); + re_wchar_t range_start, range_end; + + /* Sort tests by the most commonly used classes with some adjustment to which + tests are easiest to perform. Frequencies of character class names as of + 2016-07-15: + + $ find \( -name \*.c -o -name \*.el \) -exec grep -h '\[:[a-z]*:]' {} + | + sed 's/]/]\n/g' |grep -o '\[:[a-z]*:]' |sort |uniq -c |sort -nr + 213 [:alnum:] + 104 [:alpha:] + 62 [:space:] + 39 [:digit:] + 36 [:blank:] + 26 [:upper:] + 24 [:word:] + 21 [:lower:] + 10 [:punct:] + 10 [:ascii:] + 9 [:xdigit:] + 4 [:nonascii:] + 4 [:graph:] + 2 [:print:] + 2 [:cntrl:] + 1 [:ff:] + */ + + if ((class_bits & BIT_MULTIBYTE) || + (class_bits & BIT_ALNUM && ISALNUM (c)) || + (class_bits & BIT_ALPHA && ISALPHA (c)) || + (class_bits & BIT_SPACE && ISSPACE (c)) || + (class_bits & BIT_WORD && ISWORD (c)) || + ((class_bits & BIT_UPPER) && + (ISUPPER (c) || (corig != c && + c == downcase (corig) && ISLOWER (c)))) || + ((class_bits & BIT_LOWER) && + (ISLOWER (c) || (corig != c && + c == upcase (corig) && ISUPPER(c)))) || + (class_bits & BIT_PUNCT && ISPUNCT (c)) || + (class_bits & BIT_GRAPH && ISGRAPH (c)) || + (class_bits & BIT_PRINT && ISPRINT (c))) + return !not; + + for (p = *pp; rtp < p; rtp += 2 * 3) + { + EXTRACT_CHARACTER (range_start, rtp); + EXTRACT_CHARACTER (range_end, rtp + 3); + if (range_start <= c && c <= range_end) + return !not; + } + } +#endif /* emacs */ + return not; +} + /* Non-zero if "p1 matches something" implies "p2 fails". */ static int mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, @@ -4718,22 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - int not = (re_opcode_t) *p1 == charset_not; - - /* Test if C is listed in charset (or charset_not) - at `p1'. */ - if (! multibyte || IS_REAL_ASCII (c)) - { - if (c < CHARSET_BITMAP_SIZE (p1) * BYTEWIDTH - && p1[2 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } - else if (CHARSET_RANGE_TABLE_EXISTS_P (p1)) - CHARSET_LOOKUP_RANGE_TABLE (not, c, p1); - - /* `not' is equal to 1 if c would match, which means - that we can't change to pop_failure_jump. */ - if (!not) + if (!execute_charset (&p1, c, c, !multibyte)) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; @@ -5439,32 +5473,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset_not: { register unsigned int c, corig; - boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; - /* Start of actual range_table, or end of bitmap if there is no - range table. */ - re_char *range_table UNINIT; - - /* Nonzero if there is a range table. */ - int range_table_exists; - - /* Number of ranges of range table. This is not included - in the initial byte-length of the command. */ - int count = 0; - /* Whether matching against a unibyte character. */ boolean unibyte_char = false; - DEBUG_PRINT ("EXECUTING charset%s.\n", not ? "_not" : ""); - - range_table_exists = CHARSET_RANGE_TABLE_EXISTS_P (&p[-1]); - - if (range_table_exists) - { - range_table = CHARSET_RANGE_TABLE (&p[-1]); /* Past the bitmap. */ - EXTRACT_NUMBER_AND_INCR (count, range_table); - } + DEBUG_PRINT ("EXECUTING charset%s.\n", + (re_opcode_t) *(p - 1) == charset_not ? "_not" : ""); PREFETCH (); corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); @@ -5498,47 +5513,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, unibyte_char = true; } - if (unibyte_char && c < (1 << BYTEWIDTH)) - { /* Lookup bitmap. */ - /* Cast to `unsigned' instead of `unsigned char' in - case the bit list is a full 32 bytes long. */ - if (c < (unsigned) (CHARSET_BITMAP_SIZE (&p[-1]) * BYTEWIDTH) - && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) - not = !not; - } -#ifdef emacs - else if (range_table_exists) - { - int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - - if ( (class_bits & BIT_LOWER - && (ISLOWER (c) - || (corig != c - && c == upcase (corig) && ISUPPER(c)))) - | (class_bits & BIT_MULTIBYTE) - | (class_bits & BIT_PUNCT && ISPUNCT (c)) - | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER - && (ISUPPER (c) - || (corig != c - && c == downcase (corig) && ISLOWER (c)))) - | (class_bits & BIT_WORD && ISWORD (c)) - | (class_bits & BIT_ALPHA && ISALPHA (c)) - | (class_bits & BIT_ALNUM && ISALNUM (c)) - | (class_bits & BIT_GRAPH && ISGRAPH (c)) - | (class_bits & BIT_PRINT && ISPRINT (c))) - not = !not; - else - CHARSET_LOOKUP_RANGE_TABLE_RAW (not, c, range_table, count); - } -#endif /* emacs */ - - if (range_table_exists) - p = CHARSET_RANGE_TABLE_END (range_table, count); - else - p += CHARSET_BITMAP_SIZE (&p[-1]) + 1; - - if (!not) goto fail; + p -= 1; + if (!execute_charset (&p, c, corig, unibyte_char)) + goto fail; d += len; } diff --git a/test/src/regex-tests.el b/test/src/regex-tests.el new file mode 100644 index 0000000..a2dd4f0 --- /dev/null +++ b/test/src/regex-tests.el @@ -0,0 +1,75 @@ +;;; buffer-tests.el --- tests for regex.c functions -*- lexical-binding: t -*- + +;; Copyright (C) 2015-2016 Free Software Foundation, Inc. + +;; This file is part of GNU Emacs. + +;; GNU Emacs is free software: you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation, either version 3 of the License, or +;; (at your option) any later version. + +;; GNU Emacs is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. + +;; You should have received a copy of the GNU General Public License +;; along with GNU Emacs. If not, see . + +;;; Code: + +(require 'ert) + +(ert-deftest regex-word-cc-fallback-test () + (dolist (class '("[[:word:]]" "\\sw")) + (dolist (repeat '("*" "+")) + (dolist (suffix '("" "b" "bar" "\u2620")) + (should (string-match (concat "^" class repeat suffix "$") + (concat "foo" suffix))))))) + +(defun regex--test-cc (name matching not-matching) + (should (string-match-p (concat "^[[:" name ":]]*$") matching)) + (should (string-match-p (concat "^[[:" name ":]]*?\u2622$") + (concat matching "\u2622"))) + (should (string-match-p (concat "^[^[:" name ":]]*$") not-matching)) + (should (string-match-p (concat "^[^[:" name ":]]*\u2622$") + (concat not-matching "\u2622"))) + (with-temp-buffer + (insert matching) + (let ((p (point))) + (insert not-matching) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]")) + (should (equal (point) p)) + (skip-chars-forward (concat "^[:" name ":]")) + (should (equal (point) (point-max))) + (goto-char (point-min)) + (skip-chars-forward (concat "[:" name ":]\u2622")) + (should (or (equal (point) p) (equal (point) (1+ p))))))) + +(ert-deftest regex-character-classes () + (let (case-fold-search) + (regex--test-cc "alnum" "abcABC012łąka" "-, \t\n") + (regex--test-cc "alpha" "abcABCłąka" "-,012 \t\n") + (regex--test-cc "digit" "012" "abcABCłąka-, \t\n") + (regex--test-cc "xdigit" "0123aBc" "łąk-, \t\n") + (regex--test-cc "upper" "ABCŁĄKA" "abc012-, \t\n") + (regex--test-cc "lower" "abcłąka" "ABC012-, \t\n") + + (regex--test-cc "word" "abcABC012\u2620" "-, \t\n") + + (regex--test-cc "punct" ".,-" "abcABC012\u2620 \t\n") + (regex--test-cc "cntrl" "\1\2\t\n" ".,-abcABC012\u2620 ") + (regex--test-cc "graph" "abcłąka\u2620-," " \t\n\1") + (regex--test-cc "print" "abcłąka\u2620-, " "\t\n\1") + + (regex--test-cc "space" " \t\n\u2001" "abcABCł0123") + (regex--test-cc "blank" " \t" "\n\u2001") + + (regex--test-cc "ascii" "abcABC012 \t\n\1" "łą\u2620") + (regex--test-cc "nonascii" "łą\u2622" "abcABC012 \t\n\1") + (regex--test-cc "unibyte" "abcABC012 \t\n\1" "łą\u2622") + (regex--test-cc "multibyte" "łą\u2622" "abcABC012 \t\n\1"))) + +;;; buffer-tests.el ends here -- 2.8.0.rc3.226.g39d4020 ------------=_1469483701-13937-1-- From unknown Sun Jun 15 13:01:53 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24020: [PATCH] Fix =?UTF-8?Q?=E2=80=98is_?= =?UTF-8?Q?multibyte=E2=80=99?= test =?UTF-8?Q?regex.c=E2=80=99s?= mutually_exclusive_p (bug#24020) References: <1468850684-17867-1-git-send-email-mina86@mina86.com> In-Reply-To: <1468850684-17867-1-git-send-email-mina86@mina86.com> Resent-From: Michal Nazarewicz Original-Sender: "Debbugs-submit" Resent-CC: bug-gnu-emacs@gnu.org Resent-Date: Wed, 27 Jul 2016 16:23:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24020 X-GNU-PR-Package: emacs X-GNU-PR-Keywords: patch To: 24020@debbugs.gnu.org Received: via spool by 24020-submit@debbugs.gnu.org id=B24020.146963655027525 (code B ref 24020); Wed, 27 Jul 2016 16:23:01 +0000 Received: (at 24020) by debbugs.gnu.org; 27 Jul 2016 16:22:30 +0000 Received: from localhost ([127.0.0.1]:39351 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bSRbJ-00079t-OD for submit@debbugs.gnu.org; Wed, 27 Jul 2016 12:22:29 -0400 Received: from mail-wm0-f53.google.com ([74.125.82.53]:37632) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bSRbI-00079f-BV for 24020@debbugs.gnu.org; Wed, 27 Jul 2016 12:22:28 -0400 Received: by mail-wm0-f53.google.com with SMTP id i5so70342176wmg.0 for <24020@debbugs.gnu.org>; Wed, 27 Jul 2016 09:22:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=sender:from:to:subject:date:message-id:mime-version :content-transfer-encoding; bh=XrdpG488N4+Xkp13Ny5Q3NSGypRUHmzdrZoWJlLGa1Y=; b=UQz7MzDmacP98PoxIGM1ATp4wyBsiZvEms86QuG0Lpw8pPK2p/0vekTDYrRLIgR4bm Sn6W2lw47v+UTuEyP9esaTWPCIIy3//yDp55rlt8uCdLzUNfX/7l7pYXkf3zkTFEp31X LgcErEzDIWjVgneFDRG8IxxxxdLPk6cCVnTHgkhwphPVhSTjuxGhVIURX9T6SEJ67Xrw /xEovA7ZYpC5wAv88zeYQS1KFpRMotswg82LMebx10FK75v+l28ThEu4WEHoBpRTr8Ys 5o/W1X+pMbKvMrAtTk5QsplfjIyCYTzZOk0zBZakdX9WDLxd02rgAw3oa13EAVBZCJIF Tkag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:from:to:subject:date:message-id :mime-version:content-transfer-encoding; bh=XrdpG488N4+Xkp13Ny5Q3NSGypRUHmzdrZoWJlLGa1Y=; b=AUMGkZ/VoykUoLuv6nRPm5FFKeiNAAbeZgqBth72iBGjUIJlOiDbRsuaPx8OcRsmaR 6HlhIXtRJ0bx1ASO0cA2Oc0AojupeiyaNX71UOMNeS5QUz0qHaYOB89+sqTZUur7PrVA KomXtebjZlaMQQ8fajHGuLax9nV2+iwoikYq97w0hNMePcU0VFtloUjQDfeLasvOEiRl AYE5kRMT4cdlLw1nh4ShxxMzM+PGXHnbqr9DjDfEBrj5Q+R9sdZKPkeoJD9OumNqJpY2 cJ4DXiL6NDmMhCn2BQVF3eGqQfbN206jG9tArY5vEuJqJeOxSkctkb8auG2TT9ko4Nmo HbkA== X-Gm-Message-State: ALyK8tLffoDEj/JpEyOfP9/NkOhbO1gnO37xjLqFa9GN52w8KMBU8g0u+OQdseB1RNaByqXb X-Received: by 10.28.54.229 with SMTP id y98mr51280903wmh.96.1469636542261; Wed, 27 Jul 2016 09:22:22 -0700 (PDT) Received: from mpn.zrh.corp.google.com ([2620:0:105f:301:2815:7585:2e57:6c3b]) by smtp.gmail.com with ESMTPSA id f4sm39135730wmf.8.2016.07.27.09.22.20 for <24020@debbugs.gnu.org> (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 27 Jul 2016 09:22:20 -0700 (PDT) Received: by mpn.zrh.corp.google.com (Postfix, from userid 126942) id CC52C1E0268; Wed, 27 Jul 2016 18:22:19 +0200 (CEST) From: Michal Nazarewicz Date: Wed, 27 Jul 2016 18:22:17 +0200 Message-Id: <1469636537-17036-1-git-send-email-mina86@mina86.com> X-Mailer: git-send-email 2.8.0.rc3.226.g39d4020 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.0 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.0 (--) * src/regex.c (mutually_exclusive_p): Fix how whether character is unibyte is tested when calling execute_charset function. This bug has been introduced by [6dc6b00: Fix ‘[[:cc:]]*literal’ regex failing to match ‘literal’] which dropped a call to IS_REAL_ASCII (c) macro. Reinstitute it. --- src/regex.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) FYI, I’ve just submitted this patch. The bug has been caught by tests from dima_regex_embedded_modifiers branch which I’m working on pulling into master branch so they will come in another commits. diff --git a/src/regex.c b/src/regex.c index 297bf71..1f2a1f08 100644 --- a/src/regex.c +++ b/src/regex.c @@ -4767,7 +4767,7 @@ mutually_exclusive_p (struct re_pattern_buffer *bufp, const_re_char *p1, else if ((re_opcode_t) *p1 == charset || (re_opcode_t) *p1 == charset_not) { - if (!execute_charset (&p1, c, c, !multibyte)) + if (!execute_charset (&p1, c, c, !multibyte || IS_REAL_ASCII (c))) { DEBUG_PRINT (" No match => fast loop.\n"); return 1; -- 2.8.0.rc3.226.g39d4020