From unknown Mon Aug 18 06:58:02 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24009: [PATCH] grep: use fastmap in regex Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 16 Jul 2016 16:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 24009 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 24009@debbugs.gnu.org X-Debbugs-Original-To: Received: via spool by submit@debbugs.gnu.org id=B.14686882834295 (code B ref -1); Sat, 16 Jul 2016 16:59:02 +0000 Received: (at submit) by debbugs.gnu.org; 16 Jul 2016 16:58:03 +0000 Received: from localhost ([127.0.0.1]:53989 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOSuh-00017D-Ee for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:58:03 -0400 Received: from eggs.gnu.org ([208.118.235.92]:51444) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOSuf-00016X-68 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:58:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bOSuZ-0003Zx-07 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:57:55 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:35724) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOSuY-0003Zt-T4 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:57:54 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42415) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOSuW-0004fa-F0 for bug-grep@gnu.org; Sat, 16 Jul 2016 12:57:53 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bOSuR-0003Zh-I8 for bug-grep@gnu.org; Sat, 16 Jul 2016 12:57:51 -0400 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:38021) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOSuR-0003Zb-8x for bug-grep@gnu.org; Sat, 16 Jul 2016 12:57:47 -0400 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id CFABC804C9 for ; Sun, 17 Jul 2016 01:57:36 +0900 (JST) X-matriXscan-loop-detect: 089dc09e96448a539c23c38f5bfe0935893dffd7 Received: from mail06.kcn.ne.jp ([61.86.6.185]) by mxs02-s with ESMTP; Sun, 17 Jul 2016 01:57:35 +0900 (JST) Received: from [10.120.1.14] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id 19AD01BF0090 for ; Sun, 17 Jul 2016 01:57:35 +0900 (JST) Date: Sun, 17 Jul 2016 01:57:30 +0900 From: Norihiro Tanaka Message-Id: <20160717015717.AF60.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_578A609600000000AF51_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --------_578A609600000000AF51_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit sed and gawk use fastmap in regex, but grep does not. By using fastmap, I expect that grep speeds up for patterns as regex is used. before: $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k real 7.83 user 7.62 sys 0.07 after: $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k real 0.46 user 0.38 sys 0.07 However, if grep uses fastmap, fails in case-fold-titlecase test. It means that grep's behavior differ from sed and gawk, as they use fastmap, although it seems to be a bug in regex. --------_578A609600000000AF51_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-use-fastmap-in-regex.patch" Content-Disposition: attachment; filename="0001-grep-use-fastmap-in-regex.patch" Content-Transfer-Encoding: base64 RnJvbSAxMzM3MDA2NTk3YTdkN2UxNDk5M2FmMTRlNTdkNDdkNmI0ODNmYjBkIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBTdW4sIDE3IEp1bCAyMDE2IDAxOjI1OjE4ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZ3Jl cDogdXNlIGZhc3RtYXAgaW4gcmVnZXgKCiogc3JjL2RmYXNlYXJjaC5jIChHRUFjb21waWxlKTog VXNlIGZhc3RtYXAgaW4gcmVnZXguCi0tLQogc3JjL2RmYXNlYXJjaC5jIHwgICAgMyArKysKIDEg ZmlsZXMgY2hhbmdlZCwgMyBpbnNlcnRpb25zKCspLCAwIGRlbGV0aW9ucygtKQoKZGlmZiAtLWdp dCBhL3NyYy9kZmFzZWFyY2guYyBiL3NyYy9kZmFzZWFyY2guYwppbmRleCA4MDUyZWYwLi5lNTIy M2U1IDEwMDY0NAotLS0gYS9zcmMvZGZhc2VhcmNoLmMKKysrIGIvc3JjL2RmYXNlYXJjaC5jCkBA IC0xNTQsNiArMTU0LDkgQEAgR0VBY29tcGlsZSAoY2hhciBjb25zdCAqcGF0dGVybiwgc2l6ZV90 IHNpemUsIHJlZ19zeW50YXhfdCBzeW50YXhfYml0cykKICAgICAgIHBhdHRlcm5zID0geG5yZWFs bG9jIChwYXR0ZXJucywgcGNvdW50ICsgMSwgc2l6ZW9mICpwYXR0ZXJucyk7CiAgICAgICBwYXR0 ZXJuc1twY291bnRdID0gcGF0dGVybnMwOwogCisgICAgICBwYXR0ZXJuc1twY291bnRdLnJlZ2V4 YnVmLmZhc3RtYXAgPQorICAgICAgICA9IHhtYWxsb2MgKChVQ0hBUl9NQVggKyAxKSAqIHNpemVv ZiAoY2hhcikpOworCiAgICAgICBjaGFyIGNvbnN0ICplcnIgPSByZV9jb21waWxlX3BhdHRlcm4g KHAsIGxlbiwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgJihw YXR0ZXJuc1twY291bnRdLnJlZ2V4YnVmKSk7CiAgICAgICBpZiAoZXJyKQotLSAKMS43LjEKCg== --------_578A609600000000AF51_MULTIPART_MIXED_-- From unknown Mon Aug 18 06:58:02 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24009: [PATCH] grep: use fastmap in regex Resent-From: "Jens Schleusener" Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 16 Jul 2016 20:08:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24009 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 24009@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.146869963521459 (code B ref -1); Sat, 16 Jul 2016 20:08:01 +0000 Received: (at submit) by debbugs.gnu.org; 16 Jul 2016 20:07:15 +0000 Received: from localhost ([127.0.0.1]:54054 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOVrn-0005a3-1d for submit@debbugs.gnu.org; Sat, 16 Jul 2016 16:07:15 -0400 Received: from eggs.gnu.org ([208.118.235.92]:50286) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOVrl-0005Zq-Ck for submit@debbugs.gnu.org; Sat, 16 Jul 2016 16:07:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bOVrf-0002vF-B0 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 16:07:08 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:38700) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOVrf-0002v1-87 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 16:07:07 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41257) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOVrd-0002SW-4Z for bug-grep@gnu.org; Sat, 16 Jul 2016 16:07:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bOVrY-0002uW-2k for bug-grep@gnu.org; Sat, 16 Jul 2016 16:07:04 -0400 Received: from mailout07.t-online.de ([194.25.134.83]:47682) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOVrX-0002uG-Sz for bug-grep@gnu.org; Sat, 16 Jul 2016 16:07:00 -0400 Received: from fwd01.aul.t-online.de (fwd01.aul.t-online.de [172.20.27.147]) by mailout07.t-online.de (Postfix) with SMTP id DD93F4243A69 for ; Sat, 16 Jul 2016 22:06:56 +0200 (CEST) Received: from schiller.dip.t-dialin.net (Zk87NOZboh0FaqTGfcFT0sp0M0x6zYnNp7T2fysXNx8jTU1Xg9O3+OrtX1TJz+kwQT@[79.210.212.216]) by fwd01.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1bOVrS-1r64C80; Sat, 16 Jul 2016 22:06:54 +0200 Date: Sat, 16 Jul 2016 22:06:53 +0200 (CEST) From: "Jens Schleusener" In-Reply-To: <20160717015717.AF60.27F6AC2D@kcn.ne.jp> Message-ID: References: <20160717015717.AF60.27F6AC2D@kcn.ne.jp> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; CHARSET=US-ASCII; format=flowed X-ID: Zk87NOZboh0FaqTGfcFT0sp0M0x6zYnNp7T2fysXNx8jTU1Xg9O3+OrtX1TJz+kwQT X-TOI-MSGID: d2bc93d1-f989-44f0-83e2-1fcee1003f1c X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Hi Norihiro. > sed and gawk use fastmap in regex, but grep does not. By using fastmap, > I expect that grep speeds up for patterns as regex is used. > > before: > $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k > real 7.83 > user 7.62 > sys 0.07 > > after: > $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k > real 0.46 > user 0.38 > sys 0.07 > > However, if grep uses fastmap, fails in case-fold-titlecase test. It > means that grep's behavior differ from sed and gawk, as they use fastmap, > although it seems to be a bug in regex. Wow, that is a spectacular speed improvement. Since I use grep with regex patterns heavily in some of my scripts I could not resist to make some first simple tests (including your example pattern with a back reference). The non-representative results using grep 2.25 shows a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25 itself was already a factor 1.4-2.8 faster than the grep 2.16 offered by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep outputs were identical. By the way I had to remove one of the two "=" in your patch otherwise gcc issued an error (but caution, I am a C-layman). Regards Jens From unknown Mon Aug 18 06:58:02 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24009: [PATCH] grep: use fastmap in regex Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 17 Jul 2016 00:56:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24009 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: "Jens Schleusener" Cc: 24009@debbugs.gnu.org Received: via spool by 24009-submit@debbugs.gnu.org id=B24009.146871691214811 (code B ref 24009); Sun, 17 Jul 2016 00:56:02 +0000 Received: (at 24009) by debbugs.gnu.org; 17 Jul 2016 00:55:12 +0000 Received: from localhost ([127.0.0.1]:54124 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOaMS-0003qp-HJ for submit@debbugs.gnu.org; Sat, 16 Jul 2016 20:55:12 -0400 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:58530) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOaMQ-0003qa-BM for 24009@debbugs.gnu.org; Sat, 16 Jul 2016 20:55:11 -0400 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id 4BD43805ED for <24009@debbugs.gnu.org>; Sun, 17 Jul 2016 09:55:03 +0900 (JST) X-matriXscan-loop-detect: 21caba6e177a569aee736933c8d7ece0b750bdb2 Received: from mail01.kcn.ne.jp ([61.86.6.180]) by mxs01-s with ESMTP; Sun, 17 Jul 2016 09:55:02 +0900 (JST) Received: from [10.120.1.67] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail01.kcn.ne.jp (Postfix) with ESMTPA id 98D355A830C; Sun, 17 Jul 2016 09:55:02 +0900 (JST) Date: Sun, 17 Jul 2016 09:55:00 +0900 From: Norihiro Tanaka In-Reply-To: References: <20160717015717.AF60.27F6AC2D@kcn.ne.jp> Message-Id: <20160717095452.7A88.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_578ACC6B000000007A7C_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.3 (-) --------_578ACC6B000000007A7C_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Sat, 16 Jul 2016 22:06:53 +0200 (CEST) "Jens Schleusener" wrote: > Wow, that is a spectacular speed improvement. Since I use grep with > regex patterns heavily in some of my scripts I could not resist to > make some first simple tests (including your example pattern with a > back reference). The non-representative results using grep 2.25 shows > a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25 > itself was already a factor 1.4-2.8 faster than the grep 2.16 offered > by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep > outputs were identical. I believe that cases to speed up by this patch is not so much, as grep makes a lot of other optimizations. In fact, I spent a little time to make a test case to demonstrate to speed up by this patch. So I have an interest with what kind of test cases you could confirm to speed up by this patch. > By the way I had to remove one of the two "=" in your patch otherwise > gcc issued an error (but caution, I am a C-layman). Thanks, I fixed it. I made a mistake before sending the patch. Of course, "=" should be one. --------_578ACC6B000000007A7C_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-use-fastmap-in-regex.patch" Content-Disposition: attachment; filename="0001-grep-use-fastmap-in-regex.patch" Content-Transfer-Encoding: base64 RnJvbSAxMzM3MDA2NTk3YTdkN2UxNDk5M2FmMTRlNTdkNDdkNmI0ODNmYjBkIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBTdW4sIDE3IEp1bCAyMDE2IDAxOjI1OjE4ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZ3Jl cDogdXNlIGZhc3RtYXAgaW4gcmVnZXgKCiogc3JjL2RmYXNlYXJjaC5jIChHRUFjb21waWxlKTog VXNlIGZhc3RtYXAgaW4gcmVnZXguCi0tLQogc3JjL2RmYXNlYXJjaC5jIHwgICAgMyArKysKIDEg ZmlsZXMgY2hhbmdlZCwgMyBpbnNlcnRpb25zKCspLCAwIGRlbGV0aW9ucygtKQoKZGlmZiAtLWdp dCBhL3NyYy9kZmFzZWFyY2guYyBiL3NyYy9kZmFzZWFyY2guYwppbmRleCA4MDUyZWYwLi5lNTIy M2U1IDEwMDY0NAotLS0gYS9zcmMvZGZhc2VhcmNoLmMKKysrIGIvc3JjL2RmYXNlYXJjaC5jCkBA IC0xNTQsNiArMTU0LDkgQEAgR0VBY29tcGlsZSAoY2hhciBjb25zdCAqcGF0dGVybiwgc2l6ZV90 IHNpemUsIHJlZ19zeW50YXhfdCBzeW50YXhfYml0cykKICAgICAgIHBhdHRlcm5zID0geG5yZWFs bG9jIChwYXR0ZXJucywgcGNvdW50ICsgMSwgc2l6ZW9mICpwYXR0ZXJucyk7CiAgICAgICBwYXR0 ZXJuc1twY291bnRdID0gcGF0dGVybnMwOwogCisgICAgICBwYXR0ZXJuc1twY291bnRdLnJlZ2V4 YnVmLmZhc3RtYXAKKyAgICAgICAgPSB4bWFsbG9jICgoVUNIQVJfTUFYICsgMSkgKiBzaXplb2Yg KGNoYXIpKTsKKwogICAgICAgY2hhciBjb25zdCAqZXJyID0gcmVfY29tcGlsZV9wYXR0ZXJuIChw LCBsZW4sCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICYocGF0 dGVybnNbcGNvdW50XS5yZWdleGJ1ZikpOwogICAgICAgaWYgKGVycikKLS0gCjEuNy4xCgo= --------_578ACC6B000000007A7C_MULTIPART_MIXED_-- From unknown Mon Aug 18 06:58:02 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24009: [PATCH] grep: use fastmap in regex Resent-From: Jens Schleusener Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 17 Jul 2016 10:22:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24009 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka Cc: 24009@debbugs.gnu.org Received: via spool by 24009-submit@debbugs.gnu.org id=B24009.146875091128067 (code B ref 24009); Sun, 17 Jul 2016 10:22:02 +0000 Received: (at 24009) by debbugs.gnu.org; 17 Jul 2016 10:21:51 +0000 Received: from localhost ([127.0.0.1]:54271 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOjCo-0007Id-PA for submit@debbugs.gnu.org; Sun, 17 Jul 2016 06:21:50 -0400 Received: from mailout10.t-online.de ([194.25.134.21]:51816) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOjCm-0007IS-R1 for 24009@debbugs.gnu.org; Sun, 17 Jul 2016 06:21:49 -0400 Received: from fwd05.aul.t-online.de (fwd05.aul.t-online.de [172.20.27.149]) by mailout10.t-online.de (Postfix) with SMTP id C0CBB41E5C73; Sun, 17 Jul 2016 12:21:46 +0200 (CEST) Received: from schiller.dip.t-dialin.net (ESDsF0Z-QhAdi0Ix3BEED1rtjF9NnHl8T0+AWfJqzgSfeYGsP7Bdb6WYLV8ZWW8Qd-@[79.210.212.216]) by fwd05.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1bOjCk-25X8O80; Sun, 17 Jul 2016 12:21:46 +0200 Date: Sun, 17 Jul 2016 12:21:46 +0200 (CEST) From: Jens Schleusener X-X-Sender: rz7y@schiller.dip.t-dialin.net In-Reply-To: <20160717095452.7A88.27F6AC2D@kcn.ne.jp> Message-ID: References: <20160717015717.AF60.27F6AC2D@kcn.ne.jp> <20160717095452.7A88.27F6AC2D@kcn.ne.jp> User-Agent: Alpine 2.20 (LSU 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; CHARSET=US-ASCII; format=flowed X-ID: ESDsF0Z-QhAdi0Ix3BEED1rtjF9NnHl8T0+AWfJqzgSfeYGsP7Bdb6WYLV8ZWW8Qd- X-TOI-MSGID: bf4229e3-fa8e-4678-95fd-6c56be95c92c X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.3 (-) On Sun, 17 Jul 2016, Norihiro Tanaka wrote: > On Sat, 16 Jul 2016 22:06:53 +0200 (CEST) > "Jens Schleusener" wrote: > >> Wow, that is a spectacular speed improvement. Since I use grep with >> regex patterns heavily in some of my scripts I could not resist to >> make some first simple tests (including your example pattern with a >> back reference). The non-representative results using grep 2.25 shows >> a gain of a factor 5-10 (while the unpatched self-compiled grep 2.25 >> itself was already a factor 1.4-2.8 faster than the grep 2.16 offered >> by the OS (OpenSUSE Leap 42.1). At least in my tests all the grep >> outputs were identical. > > I believe that cases to speed up by this patch is not so much, as grep > makes a lot of other optimizations. In fact, I spent a little time to > make a test case to demonstrate to speed up by this patch. So I have an > interest with what kind of test cases you could confirm to speed up by > this patch. First, as I mentioned in my mail, my "tests" are non-representative and done on a server system that runs also other jobs just to get a first impression. Currently I have redone some of the tests here are the more detailed results (hopefully readable in this mail). OS: OpenSUSE Leap 42.1 (64-bit) gcc: version 4.8.5 (SUSE Linux) Main testfile was an Apache access log file (in nearly combined log format) with a size of 157 MB and 673623 lines that looks like: 66.249.78.85 - - [16/Jul/2016:00:00:02 +0200] "GET /dox/phpMyAdmin-4.6.1-all-languages/namespacePMA_1_1libraries_1_1properties_1_1options_1_1items_1_1TextPropertyItem.html HTTP/1.1" 410 1977 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 - - The test command was time -p env LC_ALL=en_US.UTF-8 grep pattern logfile > /dev/shm/out The tested grep versions: 1: GNU grep 2.16 (OpenSuSE Leap 42.1) 2: GNU grep 2.25 (self-compiled) 3: GNU grep 2.25 (self-compiled with 0001-grep-use-fastmap-in-regex.patch) "self-compiled" means just "./configure; make; make install". Times are in seconds (rounded; so the sum user+sys sometimes is different from the real value) and at least 2 measurements were done. Naturally the output time and the current load may have had an influence (but probably not a drastic one). pattern vrs real user sys ------------ --- ---- ---- --- \([a-b]\)\1 1: 24.3 23.6 0.6 2: 18.2 18.2 0.0 3: 1.9 1.9 0.0 \([a-b]\) 1: 9.4 8.8 0.6 2: 3.4 3.4 0.0 3: 0.7 0.6 0.1 [a-b] 1: 8.8 8.1 0.7 2: 3.2 3.1 0.0 3: 0.4 0.4 0.1 "GET /dox/.*-[0-9\.]*.*/.*\.html.* HTTP/1.1" 410 1: 7.62 7.60 0.02 2: 0.33 0.32 0.01 3: 0.29 0.28 0.01 No idea if that values are meaningful but as a layman I have the impression grep version 3 is faster than 2 and 2 is faster than 1 ;-) >> By the way I had to remove one of the two "=" in your patch otherwise >> gcc issued an error (but caution, I am a C-layman). > > Thanks, I fixed it. I made a mistake before sending the patch. Of > course, "=" should be one. No problem, I used exact your new patch version. Regards Jens P.S.: OT and you are probably the wrong address: I would like to see some "agrep" functionality in GNU grep. From unknown Mon Aug 18 06:58:02 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24009: [PATCH] grep: use fastmap in regex Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 17 Jul 2016 12:29:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24009 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Jens Schleusener Cc: 24009@debbugs.gnu.org Received: via spool by 24009-submit@debbugs.gnu.org id=B24009.146875849213403 (code B ref 24009); Sun, 17 Jul 2016 12:29:02 +0000 Received: (at 24009) by debbugs.gnu.org; 17 Jul 2016 12:28:12 +0000 Received: from localhost ([127.0.0.1]:54330 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOlB5-0003U6-Mv for submit@debbugs.gnu.org; Sun, 17 Jul 2016 08:28:11 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:48639) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOlB2-0003Tn-Js for 24009@debbugs.gnu.org; Sun, 17 Jul 2016 08:28:09 -0400 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 67C624A0811 for <24009@debbugs.gnu.org>; Sun, 17 Jul 2016 21:28:01 +0900 (JST) X-matriXscan-loop-detect: 723e65bc140fdced7c1e49327a6e47ab3b2aefe8 Received: from mail02.kcn.ne.jp ([61.86.6.181]) by mxs02-s with ESMTP; Sun, 17 Jul 2016 21:27:59 +0900 (JST) Received: from [10.120.1.67] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail02.kcn.ne.jp (Postfix) with ESMTPA id 90238BE8001; Sun, 17 Jul 2016 21:27:59 +0900 (JST) Date: Sun, 17 Jul 2016 21:27:55 +0900 From: Norihiro Tanaka In-Reply-To: References: <20160717095452.7A88.27F6AC2D@kcn.ne.jp> Message-Id: <20160717212754.7A90.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.3 (-) On Sun, 17 Jul 2016 12:21:46 +0200 (CEST) Jens Schleusener wrote: > First, as I mentioned in my mail, my "tests" are non-representative > and done on a server system that runs also other jobs just to get a first impression. > > Currently I have redone some of the tests here are the more detailed results (hopefully readable in this mail). > > OS: OpenSUSE Leap 42.1 (64-bit) > gcc: version 4.8.5 (SUSE Linux) > > Main testfile was an Apache access log file (in nearly combined log format) with a size of 157 MB and 673623 lines that looks like: > > 66.249.78.85 - - [16/Jul/2016:00:00:02 +0200] "GET /dox/phpMyAdmin-4.6.1-all-languages/namespacePMA_1_1libraries_1_1properties_1_1options_1_1items_1_1TextPropertyItem.html HTTP/1.1" 410 1977 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0 - - > > The test command was > > time -p env LC_ALL=en_US.UTF-8 grep pattern logfile > /dev/shm/out > > The tested grep versions: > > 1: GNU grep 2.16 (OpenSuSE Leap 42.1) > 2: GNU grep 2.25 (self-compiled) > 3: GNU grep 2.25 (self-compiled with 0001-grep-use-fastmap-in-regex.patch) > > "self-compiled" means just "./configure; make; make install". > > Times are in seconds (rounded; so the sum user+sys sometimes is different from the real value) and at least 2 measurements were done. Naturally the output time and the current load may have had an influence (but probably not a drastic one). > > pattern vrs real user sys > ------------ --- ---- ---- --- > \([a-b]\)\1 > 1: 24.3 23.6 0.6 > 2: 18.2 18.2 0.0 > 3: 1.9 1.9 0.0 > > \([a-b]\) > 1: 9.4 8.8 0.6 > 2: 3.4 3.4 0.0 > 3: 0.7 0.6 0.1 > > [a-b] > 1: 8.8 8.1 0.7 > 2: 3.2 3.1 0.0 > 3: 0.4 0.4 0.1 > > "GET /dox/.*-[0-9\.]*.*/.*\.html.* HTTP/1.1" 410 > 1: 7.62 7.60 0.02 > 2: 0.33 0.32 0.01 > 3: 0.29 0.28 0.01 > > No idea if that values are meaningful but as a layman I have the impression grep version 3 is faster than 2 and 2 is faster than 1 ;-) > > >> By the way I had to remove one of the two "=" in your patch otherwise >> gcc issued an error (but caution, I am a C-layman). > > > > Thanks, I fixed it. I made a mistake before sending the patch. Of > > course, "=" should be one. > > No problem, I used exact your new patch version. > > Regards > > Jens > > P.S.: OT and you are probably the wrong address: I would like to see some "agrep" functionality in GNU grep. Thanks. I expect that this patch improves pattern starting with range, character class, collating element and equivalent class in a multibyte locale as three of until third case. forth case seems to be mainly improved by other optimization. > However, if grep uses fastmap, fails in case-fold-titlecase test. It > means that grep's behavior differ from sed and gawk, as they use fastmap, > although it seems to be a bug in regex. However, I think this patch should be suspended because of this issue. I reported it to glibc developers. https://sourceware.org/bugzilla/show_bug.cgi?id=20381 From unknown Mon Aug 18 06:58:02 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Norihiro Tanaka Subject: bug#24009: closed (Re: bug#24009: [PATCH] grep: use fastmap in regex) Message-ID: References: <1b4e4883-9a2e-b974-1a9c-c4409a84c1c2@cs.ucla.edu> <20160717015717.AF60.27F6AC2D@kcn.ne.jp> X-Gnu-PR-Message: they-closed 24009 X-Gnu-PR-Package: grep X-Gnu-PR-Keywords: patch Reply-To: 24009@debbugs.gnu.org Date: Fri, 02 Sep 2016 05:33:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1472794382-12002-1" This is a multi-part message in MIME format... ------------=_1472794382-12002-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #24009: [PATCH] grep: use fastmap in regex which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 24009@debbugs.gnu.org. --=20 24009: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24009 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1472794382-12002-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 24009-done) by debbugs.gnu.org; 2 Sep 2016 05:32:23 +0000 Received: from localhost ([127.0.0.1]:46876 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfh5S-00036X-Ms for submit@debbugs.gnu.org; Fri, 02 Sep 2016 01:32:23 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:34722) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfh5Q-00036J-DL for 24009-done@debbugs.gnu.org; Fri, 02 Sep 2016 01:32:21 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id C534A1613DA; Thu, 1 Sep 2016 22:32:13 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id lhVl02GyVMYS; Thu, 1 Sep 2016 22:32:12 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7738F1613D6; Thu, 1 Sep 2016 22:32:12 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id NVKNxefJs3_a; Thu, 1 Sep 2016 22:32:12 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 4FC0A1613D0; Thu, 1 Sep 2016 22:32:12 -0700 (PDT) Subject: Re: bug#24009: [PATCH] grep: use fastmap in regex To: Norihiro Tanaka References: <20160717095452.7A88.27F6AC2D@kcn.ne.jp> <20160717212754.7A90.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <1b4e4883-9a2e-b974-1a9c-c4409a84c1c2@cs.ucla.edu> Date: Thu, 1 Sep 2016 22:32:12 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <20160717212754.7A90.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------BD984F242F27989F9F20025E" X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 24009-done Cc: 24009-done@debbugs.gnu.org, Jens Schleusener X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.5 (-) This is a multi-part message in MIME format. --------------BD984F242F27989F9F20025E Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Norihiro Tanaka wrote: > I think this patch should be suspended because of this issue. > I reported it to glibc developers. https://sourceware.org/bugzilla/sho= w_bug.cgi?id=3D20381 After thinking about it a bit, I came up with a variant of the patch that= gives=20 the performance improvement unless -i is used, so I installed the attache= d=20 patches. The first patch is mostly just refactoring this somewhat-crufty = code=20 and fixing an O(N**2) reallocation problem. The second is the real improv= ement. The second patch just captures the low-hanging fruit. For example, even w= ith -i=20 we could use a fastmap if all the pattern's letters (including letters ma= tched=20 by ranges) happen to avoid the glibc bug. Something like that might be wo= rth=20 pursuing. Since the attached patch fixes the test case that prompted the bug report= I'm=20 closing the bug. We can reopen it, or open a new one, if someone wants to= fix=20 the remaining performance glitches. Thanks again for all these fixes! --------------BD984F242F27989F9F20025E Content-Type: text/x-diff; name="0001-grep-improve-dfasearch-storage-management.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-grep-improve-dfasearch-storage-management.patch" =46rom f3c82fc663f7155e04d94bc0dcdc16bb338605a1 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 1 Sep 2016 21:35:53 -0700 Subject: [PATCH 1/2] grep: improve dfasearch storage management This patch is mostly refactoring, with a bit of performance tweaking. It is done in preparation for a fix for Bug#24009. * src/dfasearch.c (patterns): Now of type struct re_pattern_buffer * instead of an anonymous struct pointer, since there is no longer any need to keep regs here. All uses changed. (GEAcompile): Use patlim instead of a hard-to-follow "total". Use x2nrealloc to avoid potential O(N**2) reallocation algorithm. Initialize just the pattern members that need clearing. (EGexecute): Put regs into a static variable, as this code did before 2001-02-18, as there is no need to have a separate set of regs for each pattern. Explain the "Q@#%!#" comment better. --- src/dfasearch.c | 75 +++++++++++++++++++++++++++++----------------------= ------ 1 file changed, 38 insertions(+), 37 deletions(-) diff --git a/src/dfasearch.c b/src/dfasearch.c index 61b1f70..b5ae623 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -40,13 +40,7 @@ static kwset_t kwset; static struct dfa *dfa; =20 /* The Regex compiled patterns. */ -static struct -{ - /* Regex compiled regexp. */ - struct re_pattern_buffer regexbuf; - struct re_registers regs; /* This is here on account of a BRAIN-DEAD - Q@#%!# library interface in regex.c. */ -} *patterns; +static struct re_pattern_buffer *patterns; static size_t pcount; =20 /* Number of compiled fixed strings known to exactly match the regexp. @@ -122,7 +116,6 @@ kwsmusts (void) void GEAcompile (char const *pattern, size_t size, reg_syntax_t syntax_bits) { - size_t total =3D size; char *motif; =20 dfa =3D dfaalloc (); @@ -137,28 +130,31 @@ GEAcompile (char const *pattern, size_t size, reg_s= yntax_t syntax_bits) this should be a syntax error. The same for backref, where the backref should be local to each pattern. */ char const *p =3D pattern; + char const *patlim =3D pattern + size; bool compilation_failed =3D false; + size_t palloc =3D 0; + do { size_t len; - char const *sep =3D memchr (p, '\n', total); + char const *sep =3D memchr (p, '\n', patlim - p); if (sep) { len =3D sep - p; sep++; - total -=3D (len + 1); } else - { - len =3D total; - total =3D 0; - } + len =3D patlim - p; =20 - patterns =3D xnrealloc (patterns, pcount + 1, sizeof *patterns); - memset (&patterns[pcount], 0, sizeof patterns[pcount]); + if (palloc <=3D pcount) + patterns =3D x2nrealloc (patterns, &palloc, sizeof *patterns); + struct re_pattern_buffer *pat =3D &patterns[pcount]; + pat->buffer =3D NULL; + pat->allocated =3D 0; + pat->fastmap =3D NULL; + pat->translate =3D NULL; =20 - char const *err =3D re_compile_pattern (p, len, - &(patterns[pcount].regexbuf)= ); + char const *err =3D re_compile_pattern (p, len, pat); if (err) { /* With patterns specified only on the command line, emit the = bare @@ -198,7 +194,7 @@ GEAcompile (char const *pattern, size_t size, reg_syn= tax_t syntax_bits) =20 strcpy (n, match_lines ? (bk ? line_beg_bk : line_beg_no_bk) : (bk ? word_beg_bk : word_beg_no_bk)); - total =3D strlen(n); + size_t total =3D strlen (n); memcpy (n + total, pattern, size); total +=3D size; strcpy (n + total, match_lines ? (bk ? line_end_bk : line_end_no_b= k) @@ -213,7 +209,7 @@ GEAcompile (char const *pattern, size_t size, reg_syn= tax_t syntax_bits) dfacomp (pattern, size, dfa, 1); kwsmusts (); =20 - free(motif); + free (motif); } =20 size_t @@ -355,17 +351,24 @@ EGexecute (char *buf, size_t size, size_t *match_si= ze, best_len =3D 0; for (i =3D 0; i < pcount; i++) { - patterns[i].regexbuf.not_eol =3D 0; - patterns[i].regexbuf.newline_anchor =3D eolbyte =3D=3D '\n'; - start =3D re_search (&(patterns[i].regexbuf), - beg, end - beg - 1, - ptr - beg, end - ptr - 1, - &(patterns[i].regs)); + /* This is static because of a BRAIN-DEAD Q@#%!# library + interface in regex.c, as later calls reuse the + dynamically allocated storage that REGS members point at + and the API provides no way to free this storage. + If grep is ever made multithreaded, REGS would have to be + per-thread or the library API changed or the library + encapsulation violated. */ + static struct re_registers regs; + + patterns[i].not_eol =3D 0; + patterns[i].newline_anchor =3D eolbyte =3D=3D '\n'; + start =3D re_search (&patterns[i], beg, end - beg - 1, + ptr - beg, end - ptr - 1, ®s); if (start < -1) xalloc_die (); else if (0 <=3D start) { - len =3D patterns[i].regs.end[0] - start; + len =3D regs.end[0] - start; match =3D beg + start; if (match > best_match) continue; @@ -396,11 +399,10 @@ EGexecute (char *buf, size_t size, size_t *match_si= ze, { /* Try a shorter length anchored at the same pla= ce. */ --len; - patterns[i].regexbuf.not_eol =3D 1; - shorter_len =3D re_match (&(patterns[i].regexbuf= ), - beg, match + len - ptr, - match - beg, - &(patterns[i].regs)); + patterns[i].not_eol =3D 1; + shorter_len =3D re_match (&patterns[i], beg, + match + len - ptr, match= - beg, + ®s); if (shorter_len < -1) xalloc_die (); } @@ -412,18 +414,17 @@ EGexecute (char *buf, size_t size, size_t *match_si= ze, if (match =3D=3D end - 1) break; match++; - patterns[i].regexbuf.not_eol =3D 0; - start =3D re_search (&(patterns[i].regexbuf), - beg, end - beg - 1, + patterns[i].not_eol =3D 0; + start =3D re_search (&patterns[i], beg, end - be= g - 1, match - beg, end - match - 1,= - &(patterns[i].regs)); + ®s); if (start < 0) { if (start < -1) xalloc_die (); break; } - len =3D patterns[i].regs.end[0] - start; + len =3D regs.end[0] - start; match =3D beg + start; } } /* while (match <=3D best_match) */ --=20 2.7.4 --------------BD984F242F27989F9F20025E Content-Type: text/x-diff; name="0002-grep-use-regex-fastmap-unless-i.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0002-grep-use-regex-fastmap-unless-i.patch" =46rom 77a846927e4be2d6a58df8d65c22ff2ab1540275 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 1 Sep 2016 21:49:51 -0700 Subject: [PATCH 2/2] grep: use regex fastmap unless -i This builds on a suggestion by Norihiro Tanaka (Bug#24009). * src/dfasearch.c (GEAcompile): Use a fastmap unless -i. This improves performance 20x for me using the first benchmark given in Bug#24009. --- src/dfasearch.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/dfasearch.c b/src/dfasearch.c index b5ae623..0838e1f 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -151,7 +151,10 @@ GEAcompile (char const *pattern, size_t size, reg_sy= ntax_t syntax_bits) struct re_pattern_buffer *pat =3D &patterns[pcount]; pat->buffer =3D NULL; pat->allocated =3D 0; - pat->fastmap =3D NULL; + + /* Do not use a fastmap with -i, to work around glibc Bug#20381. = */ + pat->fastmap =3D match_icase ? NULL : xmalloc (UCHAR_MAX + 1); + pat->translate =3D NULL; =20 char const *err =3D re_compile_pattern (p, len, pat); --=20 2.7.4 --------------BD984F242F27989F9F20025E-- ------------=_1472794382-12002-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 16 Jul 2016 16:58:03 +0000 Received: from localhost ([127.0.0.1]:53989 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOSuh-00017D-Ee for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:58:03 -0400 Received: from eggs.gnu.org ([208.118.235.92]:51444) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bOSuf-00016X-68 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:58:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bOSuZ-0003Zx-07 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:57:55 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:35724) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOSuY-0003Zt-T4 for submit@debbugs.gnu.org; Sat, 16 Jul 2016 12:57:54 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42415) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOSuW-0004fa-F0 for bug-grep@gnu.org; Sat, 16 Jul 2016 12:57:53 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1bOSuR-0003Zh-I8 for bug-grep@gnu.org; Sat, 16 Jul 2016 12:57:51 -0400 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:38021) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1bOSuR-0003Zb-8x for bug-grep@gnu.org; Sat, 16 Jul 2016 12:57:47 -0400 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id CFABC804C9 for ; Sun, 17 Jul 2016 01:57:36 +0900 (JST) X-matriXscan-loop-detect: 089dc09e96448a539c23c38f5bfe0935893dffd7 Received: from mail06.kcn.ne.jp ([61.86.6.185]) by mxs02-s with ESMTP; Sun, 17 Jul 2016 01:57:35 +0900 (JST) Received: from [10.120.1.14] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id 19AD01BF0090 for ; Sun, 17 Jul 2016 01:57:35 +0900 (JST) Date: Sun, 17 Jul 2016 01:57:30 +0900 From: Norihiro Tanaka To: Subject: [PATCH] grep: use fastmap in regex Message-Id: <20160717015717.AF60.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_578A609600000000AF51_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --------_578A609600000000AF51_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit sed and gawk use fastmap in regex, but grep does not. By using fastmap, I expect that grep speeds up for patterns as regex is used. before: $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k real 7.83 user 7.62 sys 0.07 after: $ time -p env LC_ALL=ja_JP.eucjp src/grep '\([a-b]\)\1' k real 0.46 user 0.38 sys 0.07 However, if grep uses fastmap, fails in case-fold-titlecase test. It means that grep's behavior differ from sed and gawk, as they use fastmap, although it seems to be a bug in regex. --------_578A609600000000AF51_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-use-fastmap-in-regex.patch" Content-Disposition: attachment; filename="0001-grep-use-fastmap-in-regex.patch" Content-Transfer-Encoding: base64 RnJvbSAxMzM3MDA2NTk3YTdkN2UxNDk5M2FmMTRlNTdkNDdkNmI0ODNmYjBkIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBTdW4sIDE3IEp1bCAyMDE2IDAxOjI1OjE4ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZ3Jl cDogdXNlIGZhc3RtYXAgaW4gcmVnZXgKCiogc3JjL2RmYXNlYXJjaC5jIChHRUFjb21waWxlKTog VXNlIGZhc3RtYXAgaW4gcmVnZXguCi0tLQogc3JjL2RmYXNlYXJjaC5jIHwgICAgMyArKysKIDEg ZmlsZXMgY2hhbmdlZCwgMyBpbnNlcnRpb25zKCspLCAwIGRlbGV0aW9ucygtKQoKZGlmZiAtLWdp dCBhL3NyYy9kZmFzZWFyY2guYyBiL3NyYy9kZmFzZWFyY2guYwppbmRleCA4MDUyZWYwLi5lNTIy M2U1IDEwMDY0NAotLS0gYS9zcmMvZGZhc2VhcmNoLmMKKysrIGIvc3JjL2RmYXNlYXJjaC5jCkBA IC0xNTQsNiArMTU0LDkgQEAgR0VBY29tcGlsZSAoY2hhciBjb25zdCAqcGF0dGVybiwgc2l6ZV90 IHNpemUsIHJlZ19zeW50YXhfdCBzeW50YXhfYml0cykKICAgICAgIHBhdHRlcm5zID0geG5yZWFs bG9jIChwYXR0ZXJucywgcGNvdW50ICsgMSwgc2l6ZW9mICpwYXR0ZXJucyk7CiAgICAgICBwYXR0 ZXJuc1twY291bnRdID0gcGF0dGVybnMwOwogCisgICAgICBwYXR0ZXJuc1twY291bnRdLnJlZ2V4 YnVmLmZhc3RtYXAgPQorICAgICAgICA9IHhtYWxsb2MgKChVQ0hBUl9NQVggKyAxKSAqIHNpemVv ZiAoY2hhcikpOworCiAgICAgICBjaGFyIGNvbnN0ICplcnIgPSByZV9jb21waWxlX3BhdHRlcm4g KHAsIGxlbiwKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgJihw YXR0ZXJuc1twY291bnRdLnJlZ2V4YnVmKSk7CiAgICAgICBpZiAoZXJyKQotLSAKMS43LjEKCg== --------_578A609600000000AF51_MULTIPART_MIXED_-- ------------=_1472794382-12002-1-- From unknown Mon Aug 18 06:58:02 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24009: [PATCH] grep: use fastmap in regex Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 02 Sep 2016 11:51:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24009 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 24009-done@debbugs.gnu.org, Jens Schleusener Received: via spool by 24009-done@debbugs.gnu.org id=D24009.147281703922088 (code D ref 24009); Fri, 02 Sep 2016 11:51:01 +0000 Received: (at 24009-done) by debbugs.gnu.org; 2 Sep 2016 11:50:39 +0000 Received: from localhost ([127.0.0.1]:47022 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfmzX-0005kA-7V for submit@debbugs.gnu.org; Fri, 02 Sep 2016 07:50:39 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:55300) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfmzV-0005jt-EM for 24009-done@debbugs.gnu.org; Fri, 02 Sep 2016 07:50:38 -0400 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id B86A34A0867 for <24009-done@debbugs.gnu.org>; Fri, 2 Sep 2016 20:50:28 +0900 (JST) X-matriXscan-loop-detect: e3de3c12308128ae8d8fb35ed242796f73924190 Received: from mail03.kcn.ne.jp ([61.86.6.182]) by mxs01-s with ESMTP; Fri, 02 Sep 2016 20:50:27 +0900 (JST) Received: from [10.120.1.18] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail03.kcn.ne.jp (Postfix) with ESMTPA id 35DBE1410099; Fri, 2 Sep 2016 20:50:27 +0900 (JST) Date: Fri, 02 Sep 2016 20:50:27 +0900 From: Norihiro Tanaka In-Reply-To: <1b4e4883-9a2e-b974-1a9c-c4409a84c1c2@cs.ucla.edu> References: <20160717212754.7A90.27F6AC2D@kcn.ne.jp> <1b4e4883-9a2e-b974-1a9c-c4409a84c1c2@cs.ucla.edu> Message-Id: <20160902205026.DE17.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -1.5 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.5 (-) On Thu, 1 Sep 2016 22:32:12 -0700 Paul Eggert wrote: > Norihiro Tanaka wrote: > > I think this patch should be suspended because of this issue. > > I reported it to glibc developers. https://sourceware.org/bugzilla/show_bug.cgi?id=20381 > > After thinking about it a bit, I came up with a variant of the patch that gives the performance improvement unless -i is used, so I installed the attached patches. The first patch is mostly just refactoring this somewhat-crufty code and fixing an O(N**2) reallocation problem. The second is the real improvement. > > The second patch just captures the low-hanging fruit. For example, even with -i we could use a fastmap if all the pattern's letters (including letters matched by ranges) happen to avoid the glibc bug. Something like that might be worth pursuing. > > Since the attached patch fixes the test case that prompted the bug report I'm closing the bug. We can reopen it, or open a new one, if someone wants to fix the remaining performance glitches. > > Thanks again for all these fixes! Nice! I could not find the workaround. I will occasionally check the bug in glibc, and if it is fixed, I will write a patch again. Thanks.