From unknown Mon Aug 18 08:27:49 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#17025 <17025@debbugs.gnu.org> To: bug#17025 <17025@debbugs.gnu.org> Subject: Status: [PATCH] grep: matching line-by-line with regex Reply-To: bug#17025 <17025@debbugs.gnu.org> Date: Mon, 18 Aug 2025 15:27:49 +0000 retitle 17025 [PATCH] grep: matching line-by-line with regex reassign 17025 grep submitter 17025 Norihiro Tanaka severity 17025 normal tag 17025 patch thanks From debbugs-submit-bounces@debbugs.gnu.org Mon Mar 17 10:49:33 2014 Received: (at submit) by debbugs.gnu.org; 17 Mar 2014 14:49:33 +0000 Received: from localhost ([127.0.0.1]:39070 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WPYr6-0002kZ-Rl for submit@debbugs.gnu.org; Mon, 17 Mar 2014 10:49:33 -0400 Received: from pbsg500.nifty.com ([202.248.238.70]:31189) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WPYr1-0002kN-S6 for submit@debbugs.gnu.org; Mon, 17 Mar 2014 10:49:31 -0400 Received: from [10.120.1.51] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) (authenticated) by pbsg500.nifty.com with ESMTP id s2HEnLp7016960 for ; Mon, 17 Mar 2014 23:49:22 +0900 X-Nifty-SrcIP: [118.21.128.66] Date: Mon, 17 Mar 2014 23:49:20 +0900 From: Norihiro Tanaka To: submit@debbugs.gnu.org Subject: [PATCH] grep: matching line-by-line with regex Message-Id: <20140317234912.7261.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_531AAC47000000000212_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: 3.9 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Package: grep Tags: patch I ran following test, which used the regex enging in non-UTF8 locale. $ yes abcd.abc | head -10000 > m $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m real 7.28 user 6.36 sys 0.57 [...] Content analysis details: (3.9 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL [202.248.238.70 listed in psbl.surriel.com] 1.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see ] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -0.0 SPF_PASS SPF: sender matches SPF record X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 3.9 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Package: grep Tags: patch I ran following test, which used the regex enging in non-UTF8 locale. $ yes abcd.abc | head -10000 > m $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m real 7.28 user 6.36 sys 0.57 [...] Content analysis details: (3.9 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL [202.248.238.70 listed in psbl.surriel.com] 1.2 RCVD_IN_BL_SPAMCOP_NET RBL: Received via a relay in bl.spamcop.net [Blocked - see ] -0.0 SPF_HELO_PASS SPF: HELO matches SPF record -0.0 T_RP_MATCHES_RCVD Envelope sender domain matches handover relay domain -0.0 SPF_PASS SPF: sender matches SPF record --------_531AAC47000000000212_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Package: grep Tags: patch I ran following test, which used the regex enging in non-UTF8 locale. $ yes abcd.abc | head -10000 > m $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m real 7.28 user 6.36 sys 0.57 It's extremally slow. When regex engine is used in grep, a text is splitted by line. However all of buffer is passed to re_search and re_match. I seem that it's wrong. Norihiro --------_531AAC47000000000212_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Disposition: attachment; filename="patch1.txt" Content-Transfer-Encoding: 7bit >From 7187092186b982b95e94df81393e8fa72060985c Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka Date: Mon, 17 Mar 2014 23:46:31 +0900 Subject: [PATCH] grep: matching line-by-line with regex * src/dfasearch.c (EGexecute): matching line-by-line with regex. --- src/dfasearch.c | 34 ++++++++++++++++++---------------- 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/src/dfasearch.c b/src/dfasearch.c index 0b56960..8697383 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -204,7 +204,7 @@ size_t EGexecute (char const *buf, size_t size, size_t *match_size, char const *start_ptr) { - char const *buflim, *beg, *end, *match, *best_match, *mb_start; + char const *buflim, *beg, *end, *ptr, *match, *best_match, *mb_start; char eol = eolbyte; int backref; regoff_t start; @@ -272,18 +272,20 @@ EGexecute (char const *buf, size_t size, size_t *match_size, /* Successful, no backreferences encountered! */ if (!backref) goto success; + ptr = beg; } else { /* We are looking for the leftmost (then longest) exact match. We will go through the outer loop only once. */ - beg = start_ptr; + beg = buf; end = buflim; + ptr = start_ptr; } /* If the "line" is longer than the maximum regexp offset, die as if we've run out of memory. */ - if (TYPE_MAXIMUM (regoff_t) < end - buf - 1) + if (TYPE_MAXIMUM (regoff_t) < end - beg - 1) xalloc_die (); /* If we've made it to this point, this means DFA has seen @@ -294,24 +296,24 @@ EGexecute (char const *buf, size_t size, size_t *match_size, { patterns[i].regexbuf.not_eol = 0; start = re_search (&(patterns[i].regexbuf), - buf, end - buf - 1, - beg - buf, end - beg - 1, + beg, end - beg - 1, + ptr - beg, end - ptr - 1, &(patterns[i].regs)); if (start < -1) xalloc_die (); else if (0 <= start) { len = patterns[i].regs.end[0] - start; - match = buf + start; + match = beg + start; if (match > best_match) continue; if (start_ptr && !match_words) goto assess_pattern_match; if ((!match_lines && !match_words) - || (match_lines && len == end - beg - 1)) + || (match_lines && len == end - ptr - 1)) { - match = beg; - len = end - beg; + match = ptr; + len = end - ptr; goto assess_pattern_match; } /* If -w, check if the match aligns with word boundaries. @@ -325,8 +327,8 @@ EGexecute (char const *buf, size_t size, size_t *match_size, while (match <= best_match) { regoff_t shorter_len = 0; - if ((match == buf || !WCHAR (to_uchar (match[-1]))) - && (start + len == end - buf - 1 + if ((match == beg || !WCHAR (to_uchar (match[-1]))) + && (start + len == end - beg - 1 || !WCHAR (to_uchar (match[len])))) goto assess_pattern_match; if (len > 0) @@ -335,8 +337,8 @@ EGexecute (char const *buf, size_t size, size_t *match_size, --len; patterns[i].regexbuf.not_eol = 1; shorter_len = re_match (&(patterns[i].regexbuf), - buf, match + len - beg, - match - buf, + beg, match + len - ptr, + match - beg, &(patterns[i].regs)); if (shorter_len < -1) xalloc_die (); @@ -351,8 +353,8 @@ EGexecute (char const *buf, size_t size, size_t *match_size, match++; patterns[i].regexbuf.not_eol = 0; start = re_search (&(patterns[i].regexbuf), - buf, end - buf - 1, - match - buf, end - match - 1, + beg, end - beg - 1, + match - beg, end - match - 1, &(patterns[i].regs)); if (start < 0) { @@ -361,7 +363,7 @@ EGexecute (char const *buf, size_t size, size_t *match_size, break; } len = patterns[i].regs.end[0] - start; - match = buf + start; + match = beg + start; } } /* while (match <= best_match) */ continue; -- 1.9.0 --------_531AAC47000000000212_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Tue Apr 01 05:10:46 2014 Received: (at 17025) by debbugs.gnu.org; 1 Apr 2014 09:10:46 +0000 Received: from localhost ([127.0.0.1]:58780 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WUuiU-0004yL-5j for submit@debbugs.gnu.org; Tue, 01 Apr 2014 05:10:46 -0400 Received: from mail-we0-f182.google.com ([74.125.82.182]:48595) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WUuiR-0004yA-Js for 17025@debbugs.gnu.org; Tue, 01 Apr 2014 05:10:44 -0400 Received: by mail-we0-f182.google.com with SMTP id p61so6009777wes.13 for <17025@debbugs.gnu.org>; Tue, 01 Apr 2014 02:10:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=9gns+IO+gPRayqBBNwgTQcw+qiWfvDhdtXL5nSvzQeU=; b=T8n6kdZlcaRMLT9EFEMGxLxLwXmguTU9LEDrJjm6lgKjE10AlSo2e19V3Ds6cTav4r B+i6WZkxSZ5p5O9CeoWg9xtwGRUWIwiCJRCHpZIrrsKm8jyLAUDhsf32tHPghsKuhkJj ps2drRX8R9w2fHJq32jNvQog4bYk6sNjy7Nz+/k2qyKM8/1k6MA6cobd/x9CPghmVc1J TydL0oD3vatfkR7BH4kW+1rMeAS2cbrpURkwjdKYxEociavhKarAygRxn7hHaECkvfJ4 +L+fUBliUxBxHU0rUSW9/9ADi2LY9KzeiJL1yE8q8n359sXub7c/Ct8YRyzk0KjXfi7q JQeA== X-Received: by 10.180.100.72 with SMTP id ew8mr18593536wib.16.1396343442196; Tue, 01 Apr 2014 02:10:42 -0700 (PDT) Received: from yakj.usersys.redhat.com (net-37-117-156-129.cust.vodafonedsl.it. [37.117.156.129]) by mx.google.com with ESMTPSA id 48sm39136431eei.24.2014.04.01.02.10.40 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 01 Apr 2014 02:10:41 -0700 (PDT) Message-ID: <533A828E.1030400@gnu.org> Date: Tue, 01 Apr 2014 11:10:38 +0200 From: Paolo Bonzini User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Norihiro Tanaka , 17025@debbugs.gnu.org Subject: Re: bug#17025: [PATCH] grep: matching line-by-line with regex References: <20140317234912.7261.27F6AC2D@kcn.ne.jp> In-Reply-To: <20140317234912.7261.27F6AC2D@kcn.ne.jp> X-Enigmail-Version: 1.6 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 17025 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Il 17/03/2014 15:49, Norihiro Tanaka ha scritto: > Package: grep > Tags: patch > > I ran following test, which used the regex enging in non-UTF8 locale. > > $ yes abcd.abc | head -10000 > m > $ time -p env LC_ALL=ja_JP.eucJP src/grep abcd.abd m > real 7.28 > user 6.36 > sys 0.57 > > It's extremally slow. When regex engine is used in grep, a text is > splitted by line. However all of buffer is passed to re_search and > re_match. I seem that it's wrong. Yes, very good catch. It's likely that the old bytecode matcher didn't care, but the new one in glibc has to process even the "ignored" part of the buffer to find the boundaries of multibyte characters. Paolo From debbugs-submit-bounces@debbugs.gnu.org Sun Apr 06 01:27:05 2014 Received: (at 17025-done) by debbugs.gnu.org; 6 Apr 2014 05:27:05 +0000 Received: from localhost ([127.0.0.1]:37357 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WWfbk-0006Xg-J4 for submit@debbugs.gnu.org; Sun, 06 Apr 2014 01:27:05 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:46919) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WWfbj-0006XQ-0B; Sun, 06 Apr 2014 01:27:03 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 3C5FEA60001; Sat, 5 Apr 2014 22:27:02 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id DpuR60Vk9qSu; Sat, 5 Apr 2014 22:26:57 -0700 (PDT) Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net [108.0.233.62]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 8034C39E8014; Sat, 5 Apr 2014 22:26:57 -0700 (PDT) Message-ID: <5340E5A0.6040107@cs.ucla.edu> Date: Sat, 05 Apr 2014 22:26:56 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.4.0 MIME-Version: 1.0 To: Norihiro Tanaka , 17025-done@debbugs.gnu.org Subject: Re: bug#17025: [PATCH] grep: matching line-by-line with regex References: <20140317234912.7261.27F6AC2D@kcn.ne.jp> In-Reply-To: <20140317234912.7261.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 17025-done Cc: Paolo Bonzini , 17156@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Thanks for this bug report and patch. Paolo wrote it up in , and I installed it into the savannah grep master and am marking Bug#17025 as done. From unknown Mon Aug 18 08:27:49 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sun, 04 May 2014 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator