From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: poor performance since grep 2.19 when comparing files with grep Resent-From: "Bennett, Steve" Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 26 Oct 2015 14:19:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 21763@debbugs.gnu.org X-Debbugs-Original-To: "bug-grep@gnu.org" Received: via spool by submit@debbugs.gnu.org id=B.144586912931350 (code B ref -1); Mon, 26 Oct 2015 14:19:03 +0000 Received: (at submit) by debbugs.gnu.org; 26 Oct 2015 14:18:49 +0000 Received: from localhost ([127.0.0.1]:39287 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zqibn-00089Z-Tv for submit@debbugs.gnu.org; Mon, 26 Oct 2015 10:18:48 -0400 Received: from eggs.gnu.org ([208.118.235.92]:60697) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1ZqhHu-0005zM-8s for submit@debbugs.gnu.org; Mon, 26 Oct 2015 08:54:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZqhHs-0000Ck-Vo for submit@debbugs.gnu.org; Mon, 26 Oct 2015 08:54:09 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:45655) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZqhHs-0000Cg-Sr for submit@debbugs.gnu.org; Mon, 26 Oct 2015 08:54:08 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42470) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZqhHr-00056c-Ts for bug-grep@gnu.org; Mon, 26 Oct 2015 08:54:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZqhHo-0000C8-ND for bug-grep@gnu.org; Mon, 26 Oct 2015 08:54:07 -0400 Received: from sideburn.lancs.ac.uk ([148.88.17.22]:39853) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZqhHo-0000BA-HE for bug-grep@gnu.org; Mon, 26 Oct 2015 08:54:04 -0400 Received: from ex-0-ht0.lancs.ac.uk ([10.42.18.47] helo=EX-0-HT0.lancs.local) by sideburn.lancs.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1ZqhHm-0007Wk-Bh for bug-grep@gnu.org; Mon, 26 Oct 2015 12:54:02 +0000 Received: from EX-0-MB1.lancs.local ([fe80::1c8:ad8:cba3:7976]) by EX-0-HT0.lancs.local ([fe80::7d10:114a:53b0:7f2f%12]) with mapi id 14.03.0248.002; Mon, 26 Oct 2015 12:54:02 +0000 From: "Bennett, Steve" Thread-Topic: poor performance since grep 2.19 when comparing files with grep Thread-Index: AdEP6Xx378J5j24uQGyB2jN30u7Jfg== Date: Mon, 26 Oct 2015 12:54:01 +0000 Message-ID: <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [148.88.134.80] x-iss-local-domain: 1 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.3 (----) X-Mailman-Approved-At: Mon, 26 Oct 2015 10:18:46 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.3 (----) Apologies in advance if this is more of a "discuss" question, but it looks = like a particular use-case shows a marked change in performance between rec= ent versions of grep. A colleague mentioned a performance issue with grep to me, and its puzzling= me a bit. It turns out that he was using "grep -Fvif" to find lines in one file that = are not present in another. Up until grep 2.18 this seems to work with linear performance and it takes = less than 50ms to compare files up to about 20,000 lines. With grep 2.19 and later, ever relatively small files are quite slow, runti= me (and memory use) increases exponentially (e.g. 300ms to compare 200 line= s, 1.5s to compare 400 lines, 5s to compare 600 lines). I've shown my colleague how to use sort and diff (and "comm", which I think= is vastly underrated), but it made me wonder if this is a reasonable thing= to expect grep to be able to do, and whether such a performance drop shoul= d be seen as a bug. The way he was using it, he had two (unsorted) data sets (about 6000 rows i= n each), with most lines being common, and he was just using: grep -Fvif FILE1 FILE2 In his case, the older version of grep took way less than a second to run, = but after he had upgraded his machine it took 20 minutes before running out= of swap and seg faulting.=20 In terms of comparing performance, I've found that the following works to c= ompare performance (vary N to try different sized data files): N=3D600; F=3D/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N >= $F; time grep -Fvif $F $F; rm $F Steve. From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: poor performance since grep 2.19 when comparing files with grep Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 26 Oct 2015 18:41:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: "Bennett, Steve" Cc: 21763@debbugs.gnu.org Received: via spool by 21763-submit@debbugs.gnu.org id=B21763.144588481722490 (code B ref 21763); Mon, 26 Oct 2015 18:41:02 +0000 Received: (at 21763) by debbugs.gnu.org; 26 Oct 2015 18:40:17 +0000 Received: from localhost ([127.0.0.1]:39384 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zqmgq-0005qf-Vi for submit@debbugs.gnu.org; Mon, 26 Oct 2015 14:40:17 -0400 Received: from mail-vk0-f53.google.com ([209.85.213.53]:33850) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zqmgo-0005qW-55 for 21763@debbugs.gnu.org; Mon, 26 Oct 2015 14:40:14 -0400 Received: by vkgs66 with SMTP id s66so44891684vkg.1 for <21763@debbugs.gnu.org>; Mon, 26 Oct 2015 11:40:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=vbot7Thu9fL782U8amVBUgFLpJpJtH5FJX9A+TAvQmc=; b=MC0sD+1VCD6xjrUlRqHAHDy4jvNDEPgP/txT0bPRecW0Qn4CC/B+qxuA6ma6dU8Nep mnNe7qnjYO9toxAwoLMYCvkIbnr1lVbnTCYPpl5hE4W++eKMV03guzat9QnOgSxH8C/y fsaOLlQUPMajKxfKCy4HsOTg7IRwLASNNo5ZUjcHEO2H+n0hdAvR08sDI2h00bELbbLb gMJuWHEDbXc2yK4HmM3GiJneEnUzVgopl803fEGe/6246F/JmTqHm8iORK9TMtmQJ+tY OwUQxuG/Wkt8FQdficfMPY2fGTzZmxLHgu0ZBR8T2J63mGmgSzf8CFodpjdIyKy1euAB k/5w== X-Received: by 10.31.178.198 with SMTP id b189mr22565426vkf.114.1445884813583; Mon, 26 Oct 2015 11:40:13 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.65.11 with HTTP; Mon, 26 Oct 2015 11:39:54 -0700 (PDT) In-Reply-To: <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> References: <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> From: Jim Meyering Date: Mon, 26 Oct 2015 11:39:54 -0700 X-Google-Sender-Auth: S0b8YcK2EWFWV_WkzOJlbjpv-fM Message-ID: Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve wrote: > Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep. > > A colleague mentioned a performance issue with grep to me, and its puzzling me a bit. > It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another. > > Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines. > With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines). > > I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug. > > The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using: > grep -Fvif FILE1 FILE2 > In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting. > > In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files): > N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F Thank you for reporting that. Interesting: that progression (time vs. increasing N) is clearly quadratic or worse when using a multibyte locale, but is linear with LC_ALL=C. I suspect when you run "locale", it reports something like en_US.utf8. I.e., if you have no need for multi-byte matching, set LC_ALL=C, and that idiom will be very quick, even for a million lines: $ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F 0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata 131536maxresident)k 0inputs+0outputs (0major+32587minor)pagefaults 0swaps Currently, I am not planning even to investigate this for the imminent release. From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: poor performance since grep 2.19 when comparing files with grep Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Tue, 27 Oct 2015 00:08:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 21763@debbugs.gnu.org Cc: Jim Meyering , "Bennett, Steve" Received: via spool by 21763-submit@debbugs.gnu.org id=B21763.144590443618790 (code B ref 21763); Tue, 27 Oct 2015 00:08:01 +0000 Received: (at 21763) by debbugs.gnu.org; 27 Oct 2015 00:07:16 +0000 Received: from localhost ([127.0.0.1]:39510 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1ZqrnH-0004sz-4c for submit@debbugs.gnu.org; Mon, 26 Oct 2015 20:07:15 -0400 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:56722) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zqrmw-0004sI-L1 for 21763@debbugs.gnu.org; Mon, 26 Oct 2015 20:07:14 -0400 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 7B1AFC8010 for <21763@debbugs.gnu.org>; Tue, 27 Oct 2015 09:06:52 +0900 (JST) X-matriXscan-loop-detect: 0919ca1610846baeb3f7657d3ec32036264ec145 Received: from mail01.kcn.ne.jp ([61.86.6.180]) by mxs02-s with ESMTP; Tue, 27 Oct 2015 09:06:51 +0900 (JST) Received: from [10.120.1.48] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail01.kcn.ne.jp (Postfix) with ESMTPA id 510255A82CF; Tue, 27 Oct 2015 09:06:51 +0900 (JST) Date: Tue, 27 Oct 2015 09:06:51 +0900 From: Norihiro Tanaka In-Reply-To: References: <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> Message-Id: <20151027090650.F3C2.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Mon, 26 Oct 2015 11:39:54 -0700 Jim Meyering wrote: > On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve > wrote: > > Apologies in advance if this is more of a "discuss" question, but it looks like a particular use-case shows a marked change in performance between recent versions of grep. > > > > A colleague mentioned a performance issue with grep to me, and its puzzling me a bit. > > It turns out that he was using "grep -Fvif" to find lines in one file that are not present in another. > > > > Up until grep 2.18 this seems to work with linear performance and it takes less than 50ms to compare files up to about 20,000 lines. > > With grep 2.19 and later, ever relatively small files are quite slow, runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s to compare 400 lines, 5s to compare 600 lines). > > > > I've shown my colleague how to use sort and diff (and "comm", which I think is vastly underrated), but it made me wonder if this is a reasonable thing to expect grep to be able to do, and whether such a performance drop should be seen as a bug. > > > > The way he was using it, he had two (unsorted) data sets (about 6000 rows in each), with most lines being common, and he was just using: > > grep -Fvif FILE1 FILE2 > > In his case, the older version of grep took way less than a second to run, but after he had upgraded his machine it took 20 minutes before running out of swap and seg faulting. > > > > In terms of comparing performance, I've found that the following works to compare performance (vary N to try different sized data files): > > N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; time grep -Fvif $F $F; rm $F > > Thank you for reporting that. > Interesting: that progression (time vs. increasing N) is clearly > quadratic or worse when using a multibyte locale, but is linear with > LC_ALL=C. I suspect when you run "locale", it reports something like > en_US.utf8. > > I.e., if you have no need for multi-byte matching, set LC_ALL=C, and > that idiom will be very quick, even for a million lines: > > $ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 > $N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F > 0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata > 131536maxresident)k > 0inputs+0outputs (0major+32587minor)pagefaults 0swaps > > Currently, I am not planning even to investigate this for the imminent release. In grep 2.19 or later, grep -Fi is re-written to grep -i and corresponding regular expression in only multibyte-locales. Now, assume the case of N=2. -- 1 bottles of beer on the wall 2 bottles of beer on the wall -- "grep -Fif $F" to "grep -e 'bottles of beer on the wall\|2 bottles of - beer on the wall' $F" By the way, we may expect to "grep -e '\(1\|2\) bottles of beer on the - wall'", but grep does not do it. It will cause slow down. From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: poor performance since grep 2.19 when comparing files with grep Resent-From: "Bennett, Steve" Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Tue, 27 Oct 2015 09:07:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Jim Meyering Cc: "21763@debbugs.gnu.org" <21763@debbugs.gnu.org> Received: via spool by 21763-submit@debbugs.gnu.org id=B21763.14459367899405 (code B ref 21763); Tue, 27 Oct 2015 09:07:02 +0000 Received: (at 21763) by debbugs.gnu.org; 27 Oct 2015 09:06:29 +0000 Received: from localhost ([127.0.0.1]:39970 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zr0D7-0002Rc-3p for submit@debbugs.gnu.org; Tue, 27 Oct 2015 05:06:29 -0400 Received: from sideburn.lancs.ac.uk ([148.88.17.22]:51789) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zr0D4-0002RS-62 for 21763@debbugs.gnu.org; Tue, 27 Oct 2015 05:06:27 -0400 Received: from ex-1-ht0.lancs.ac.uk ([10.42.18.57] helo=EX-1-HT0.lancs.local) by sideburn.lancs.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1Zr0D2-0008Kp-Tn; Tue, 27 Oct 2015 09:06:24 +0000 Received: from EX-0-MB1.lancs.local ([fe80::1c8:ad8:cba3:7976]) by EX-1-HT0.lancs.local ([fe80::d9e8:ad10:d075:a6b6%12]) with mapi id 14.03.0248.002; Tue, 27 Oct 2015 09:06:24 +0000 From: "Bennett, Steve" Thread-Topic: bug#21763: poor performance since grep 2.19 when comparing files with grep Thread-Index: AdEP6Xx378J5j24uQGyB2jN30u7JfgANDgYAAB4dhkA= Date: Tue, 27 Oct 2015 09:06:24 +0000 Message-ID: <1F9F4161D543984DB0123972858E8344490E862A@EX-0-MB1.lancs.local> References: <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> In-Reply-To: Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [148.88.134.80] x-iss-local-domain: 1 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 MIME-Version: 1.0 X-Spam-Score: -1.6 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.6 (-) PiBUaGFuayB5b3UgZm9yIHJlcG9ydGluZyB0aGF0Lg0KPiBJbnRlcmVzdGluZzogdGhhdCBwcm9n cmVzc2lvbiAodGltZSB2cy4gaW5jcmVhc2luZyBOKSBpcyBjbGVhcmx5IHF1YWRyYXRpYw0KPiBv ciB3b3JzZSB3aGVuIHVzaW5nIGEgbXVsdGlieXRlIGxvY2FsZSwgYnV0IGlzIGxpbmVhciB3aXRo IExDX0FMTD1DLg0KPiBJIHN1c3BlY3Qgd2hlbiB5b3UgcnVuICJsb2NhbGUiLCBpdCByZXBvcnRz IHNvbWV0aGluZyBsaWtlIGVuX1VTLnV0ZjguDQoNClllcywgaXQncyBhIFVURjggbG9jYWxlLiBJ IGhhZG4ndCB0aG91Z2h0IG9mIHRoYXQgYW5kIEkgc2hvdWxkIGhhdmUgcmVhbGlzZWQsDQpnaXZl biB0aGF0IHRoYXQncyB3aGF0IHRoZSBtYWluIGNoYW5nZSB3YXMgaW4gZ3JlcCB2Mi4xOS4NCg0K PiBJLmUuLCBpZiB5b3UgaGF2ZSBubyBuZWVkIGZvciBtdWx0aS1ieXRlIG1hdGNoaW5nLCBzZXQg TENfQUxMPUMsIGFuZCB0aGF0IA0KPiBpZGlvbSB3aWxsIGJlIHZlcnkgcXVpY2ssIGV2ZW4gZm9y IGEgbWlsbGlvbiBsaW5lczoNCg0KWWVzLCB5b3UncmUgdG90YWxseSByaWdodCB0aGVyZS4gVGhh bmtzIQ0KDQo+IEN1cnJlbnRseSwgSSBhbSBub3QgcGxhbm5pbmcgZXZlbiB0byBpbnZlc3RpZ2F0 ZSB0aGlzIGZvciB0aGUgaW1taW5lbnQNCj4gcmVsZWFzZS4NCg0KSSB0b3RhbGx5IGFncmVlIQ0K DQpDaGVlcnMsDQoNClN0ZXZlLg0K From unknown Sun Jun 22 03:59:00 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: "Bennett, Steve" Subject: bug#21763: closed (Re: bug#22357: grep -f not only huge memory usage, but also huge time cost) Message-ID: References: <3a0eb353-96a4-80fd-5f3c-7fd77be096a0@cs.ucla.edu> <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> X-Gnu-PR-Message: they-closed 21763 X-Gnu-PR-Package: grep Reply-To: 21763@debbugs.gnu.org Date: Wed, 21 Dec 2016 06:46:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1482302762-26538-1" This is a multi-part message in MIME format... ------------=_1482302762-26538-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #21763: poor performance since grep 2.19 when comparing files with grep which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 21763@debbugs.gnu.org. --=20 21763: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D21763 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1482302762-26538-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 21763-done) by debbugs.gnu.org; 21 Dec 2016 06:45:45 +0000 Received: from localhost ([127.0.0.1]:49596 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJaem-0006t6-3r for submit@debbugs.gnu.org; Wed, 21 Dec 2016 01:45:45 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39584) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJZH5-0004FH-FU; Wed, 21 Dec 2016 00:17:12 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 497A1160059; Tue, 20 Dec 2016 21:17:04 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id WYG9VHYcmlCw; Tue, 20 Dec 2016 21:17:02 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4B27216005F; Tue, 20 Dec 2016 21:17:02 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 2N8nxvMb7QRK; Tue, 20 Dec 2016 21:17:02 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D8F6C160059; Tue, 20 Dec 2016 21:17:01 -0800 (PST) Subject: Re: bug#22357: grep -f not only huge memory usage, but also huge time cost To: 22357-done@debbugs.gnu.org, 21763-done@debbugs.gnu.org References: <2051273399.9196271.1452605975684.JavaMail.zimbra@redhat.com> <20161209072419.GA30226@pog.tecnopolis.ca> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <3a0eb353-96a4-80fd-5f3c-7fd77be096a0@cs.ucla.edu> Date: Tue, 20 Dec 2016 21:17:01 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161209072419.GA30226@pog.tecnopolis.ca> Content-Type: multipart/mixed; boundary="------------4476EE1989198F97E6DF6D47" X-Spam-Score: -0.1 (/) X-Debbugs-Envelope-To: 21763-done X-Mailman-Approved-At: Wed, 21 Dec 2016 01:45:43 -0500 Cc: Norihiro Tanaka , sur-behoffski , "L.A. Walsh" , Jim Meyering , "Bennett, Steve" , JQK , arnold@skeeve.com, 22239@debbugs.gnu.org, Trevor Cordes , Paul Jackson , Bruno Haible , =?UTF-8?B?T25kxZllaiBDw61ma2E=?= , Bruce Dubbs X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) This is a multi-part message in MIME format. --------------4476EE1989198F97E6DF6D47 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable I installed the attached patches into grep master. These fix the performa= nce=20 regressions noted at the start of Bug#22357. I see that the related perfo= rmance=20 problems noted in Bug#21763 seem to be fixed too, I expect because of Nor= ihiro=20 Tanaka's recent changes, so I'll boldly close both bug reports. To some extent the attached patches restore the old behavior for grep -F,= when=20 grep is given two or more patterns. The patch doesn't change the underlyi= ng=20 algorithms; it merely uses a different heuristic to decide whether to use= the -F=20 matcher. Although I wouldn't be surprised if the attached patches hurt=20 performance in some cases, I didn't uncover any such cases in my performa= nce=20 testing, which I admit mostly consisted of running the examples in the=20 abovementioned bug reports. I'll leave Bug#22239 open, as I get the following performance figures=20 (user+system CPU time) for the Bug#22239 benchmark, where list.txt is cre= ated by=20 "aspell dump master | head -n 100000 >list.txt", and the grep commands al= l use=20 the operands "-F -f list.txt /etc/passwd" in the en_US.utf8 locale on Fed= ora 24=20 x86-64. no -i -i grep version 0.25 0.33 2.16 0.26 10.95 2.21 0.11 2.90* current master (including attached patches) In the C locale, the current grep master is always significantly faster t= han=20 grep 2.16 or 2.21 on the benchmark, so the only significant problem is th= e=20 number marked "*". I ran the benchmarks on an AMD Phenom II X4 910e. --------------4476EE1989198F97E6DF6D47 Content-Type: text/x-diff; name="0001-grep-simplify-line-counting-in-patterns.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-grep-simplify-line-counting-in-patterns.patch" =46rom 75bca3ca0a6ea87a66531ab7c0f8859e1d6c4300 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 20 Dec 2016 08:56:06 -0800 Subject: [PATCH 1/3] grep: simplify line counting in patterns * src/grep.c (n_patterns): Rename from patfile_lineno, as it is now origin-zero. Now size_t, not uintmax_t. (count_nl_bytes, fl_add): Simplify to just buffer and size. All callers changed. --- src/grep.c | 49 ++++++++++++++++++++----------------------------- 1 file changed, 20 insertions(+), 29 deletions(-) diff --git a/src/grep.c b/src/grep.c index 30e0f54..76f83d5 100644 --- a/src/grep.c +++ b/src/grep.c @@ -104,46 +104,35 @@ static size_t n_fl_pair_slots; and any command-line argument that serves as a regular expression. *= / static size_t n_pattern_files; =20 -/* Given the concatenation of all patterns, one per line, be they - specified via -e, a lone command-line argument or -f, this is the - number of the first line of each entity, in that concatenation. +/* The number of patterns seen so far. It is advanced by fl_add and, when needed, used in pattern_file_name to derive a file-relative line number. */ -static uintmax_t patfile_lineno =3D 1; +static size_t n_patterns; =20 -/* Return the number of newline bytes in BUF starting at offset BEG - and up to and not including offset END. */ +/* Return the number of newline bytes in BUF with size SIZE. */ static size_t _GL_ATTRIBUTE_PURE -count_nl_bytes (char const *buf, size_t beg, size_t end) +count_nl_bytes (char const *buf, size_t size) { - char const *p =3D buf + beg; - char const *end_p =3D buf + end; - uintmax_t n =3D 0; - while (true) - { - p =3D memchr (p, '\n', end_p - p); - if (!p) - break; - p++; - n++; - } + char const *p =3D buf; + char const *end_p =3D buf + size; + size_t n =3D 0; + while ((p =3D memchr (p, '\n', end_p - p))) + p++, n++; return n; } =20 -/* Append a FILENAME,line-number pair to FL_PAIR. The line number we sa= ve - with FILENAME is the initial value of the global PATFILE_LINENO. - PATFILE_LINENO is then incremented by the number of newlines in BUF - from offset BEG up to but not including offset END. */ +/* Append a FILENAME,line-number pair to FL_PAIR, and update + pattern-related counts from the contents of BUF with SIZE bytes. */ static void -fl_add (char const *buf, size_t beg, size_t end, char const *filename) +fl_add (char const *buf, size_t size, char const *filename) { if (n_fl_pair_slots <=3D n_pattern_files) fl_pair =3D x2nrealloc (fl_pair, &n_fl_pair_slots, sizeof *fl_pair);= =20 - fl_pair[n_pattern_files].lineno =3D patfile_lineno; + fl_pair[n_pattern_files].lineno =3D n_patterns + 1; fl_pair[n_pattern_files].filename =3D filename; n_pattern_files++; - patfile_lineno +=3D count_nl_bytes (buf, beg, end); + n_patterns +=3D count_nl_bytes (buf, size); } =20 /* Map the line number, LINENO, of one of the input patterns to the @@ -2521,10 +2510,11 @@ main (int argc, char **argv) keyalloc =3D keycc + cc + 1; keys =3D x2realloc (keys, &keyalloc); } - memcpy (&keys[keycc], optarg, cc); + oldcc =3D keycc; + memcpy (keys + oldcc, optarg, cc); keycc +=3D cc; keys[keycc++] =3D '\n'; - fl_add (keys, keycc - cc - 1, keycc, ""); + fl_add (keys + oldcc, cc + 1, ""); break; =20 case 'f': @@ -2548,7 +2538,7 @@ main (int argc, char **argv) /* Append final newline if file ended in non-newline. */ if (oldcc !=3D keycc && keys[keycc - 1] !=3D '\n') keys[keycc++] =3D '\n'; - fl_add (keys, oldcc, keycc, xstrdup (optarg)); + fl_add (keys + oldcc, keycc - oldcc, optarg); break; =20 case 'h': @@ -2741,7 +2731,8 @@ main (int argc, char **argv) /* Make a copy so that it can be reallocated or freed later. */ keycc =3D strlen (argv[optind]); keys =3D xmemdup (argv[optind++], keycc + 1); - fl_add (keys, 0, keycc, ""); + fl_add (keys, keycc, ""); + n_patterns++; } else usage (EXIT_TROUBLE); --=20 2.7.4 --------------4476EE1989198F97E6DF6D47 Content-Type: text/x-diff; name="0002-grep-simplify-matcher-configuration.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0002-grep-simplify-matcher-configuration.patch" =46rom 214be13d33a26ab45231afe8e983d9a174aae282 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 20 Dec 2016 08:56:06 -0800 Subject: [PATCH 2/3] grep: simplify matcher configuration * src/grep.c (matcher, compile): Remove static vars. (compile_fp_t): Now takes a 3rd syntax argument. (Gcomppile, Ecompile, Acompile, GAcompile, PAcompile): Remove. (struct matcher): Now nameless, since it is used only once. Make 'name' a bit shorter. New member 'syntax'. (matchers): Initialize it, and change removed functions to GEAcompile. (F_MATCHER_INDEX, G_MATCHER_INDEX): New constants. (setmatcher): New arg MATCHER, and return new matcher index. Avoid unnecessary call to strcmp. (main): Keep matcher as a local int, not a global pointer. * src/kwsearch.c (Fcompile): * src/pcresearch.c (Pcompile): Ignore the 3rd syntax argument. --- src/grep.c | 111 +++++++++++++++++++------------------------------= ------ src/kwsearch.c | 2 +- src/pcresearch.c | 2 +- src/search.h | 4 +- 4 files changed, 41 insertions(+), 78 deletions(-) diff --git a/src/grep.c b/src/grep.c index 76f83d5..574a380 100644 --- a/src/grep.c +++ b/src/grep.c @@ -491,8 +491,6 @@ bool match_words; bool match_lines; char eolbyte; =20 -static char const *matcher; - /* For error messages. */ /* The input file name, or (if standard input) null or a --label argumen= t. */ static char const *filename; @@ -577,9 +575,8 @@ static bool seek_failed; static bool seek_data_failed; =20 /* Functions we'll use to search. */ -typedef void (*compile_fp_t) (char const *, size_t); +typedef void (*compile_fp_t) (char const *, size_t, reg_syntax_t); typedef size_t (*execute_fp_t) (char const *, size_t, size_t *, char con= st *); -static compile_fp_t compile; static execute_fp_t execute; =20 static char const * @@ -2019,70 +2016,36 @@ if any error occurs and -q is not given, the exit= status is 2.\n")); =20 /* Pattern compilers and matchers. */ =20 -static void -Gcompile (char const *pattern, size_t size) -{ - GEAcompile (pattern, size, RE_SYNTAX_GREP); -} - -static void -Ecompile (char const *pattern, size_t size) -{ - GEAcompile (pattern, size, RE_SYNTAX_EGREP); -} - -static void -Acompile (char const *pattern, size_t size) -{ - GEAcompile (pattern, size, RE_SYNTAX_AWK); -} - -static void -GAcompile (char const *pattern, size_t size) -{ - GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK); -} - -static void -PAcompile (char const *pattern, size_t size) -{ - GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK); -} - -struct matcher +static struct { - char const name[16]; + char const name[12]; + int syntax; /* used if compile =3D=3D GEAcompile */ compile_fp_t compile; execute_fp_t execute; +} const matchers[] =3D { + { "grep", RE_SYNTAX_GREP, GEAcompile, EGexecute }, + { "egrep", RE_SYNTAX_EGREP, GEAcompile, EGexecute }, + { "fgrep", 0, Fcompile, Fexecute, }, + { "awk", RE_SYNTAX_AWK, GEAcompile, EGexecute }, + { "gawk", RE_SYNTAX_GNU_AWK, GEAcompile, EGexecute }, + { "posixawk", RE_SYNTAX_POSIX_AWK, GEAcompile, EGexecute }, + { "perl", 0, Pcompile, Pexecute, }, }; -static struct matcher const matchers[] =3D { - { "grep", Gcompile, EGexecute }, - { "egrep", Ecompile, EGexecute }, - { "fgrep", Fcompile, Fexecute }, - { "awk", Acompile, EGexecute }, - { "gawk", GAcompile, EGexecute }, - { "posixawk", PAcompile, EGexecute }, - { "perl", Pcompile, Pexecute }, - { "", NULL, NULL }, -}; +/* Keep these in sync with the 'matchers' table. */ +enum { F_MATCHER_INDEX =3D 2, G_MATCHER_INDEX =3D 0 }; =20 -/* Set the matcher to M if available. Exit in case of conflicts or if - M is not available. */ -static void -setmatcher (char const *m) +/* Return the index of the matcher corresponding to M if available. + MATCHER is the index of the previous matcher, or -1 if none. + Exit in case of conflicts or if M is not available. */ +static int +setmatcher (char const *m, int matcher) { - struct matcher const *p; - - if (matcher && !STREQ (matcher, m)) - die (EXIT_TROUBLE, 0, _("conflicting matchers specified")); - - for (p =3D matchers; p->compile; p++) - if (STREQ (m, p->name)) + for (int i =3D 0; i < sizeof matchers / sizeof *matchers; i++) + if (STREQ (m, matchers[i].name)) { - matcher =3D p->name; - compile =3D p->compile; - execute =3D p->execute; - return; + if (0 <=3D matcher && matcher !=3D i) + die (EXIT_TROUBLE, 0, _("conflicting matchers specified")); + return i; } =20 die (EXIT_TROUBLE, 0, _("invalid matcher %s"), m); @@ -2367,6 +2330,7 @@ main (int argc, char **argv) { char *keys =3D NULL; size_t keycc =3D 0, oldcc, keyalloc =3D 0; + int matcher =3D -1; bool with_filenames =3D false; size_t cc; int opt, prepended; @@ -2409,9 +2373,6 @@ main (int argc, char **argv) error (0, 0, _("warning: GREP_OPTIONS is deprecated;" " please use an alias or script")); =20 - compile =3D matchers[0].compile; - execute =3D matchers[0].execute; - while (prev_optind =3D optind, (opt =3D get_nondigit_option (argc, argv, &default_context)) !=3D= -1) switch (opt) @@ -2440,23 +2401,23 @@ main (int argc, char **argv) break; =20 case 'E': - setmatcher ("egrep"); + matcher =3D setmatcher ("egrep", matcher); break; =20 case 'F': - setmatcher ("fgrep"); + matcher =3D setmatcher ("fgrep", matcher); break; =20 case 'P': - setmatcher ("perl"); + matcher =3D setmatcher ("perl", matcher); break; =20 case 'G': - setmatcher ("grep"); + matcher =3D setmatcher ("grep", matcher); break; =20 case 'X': /* undocumented on purpose */ - setmatcher (optarg); + matcher =3D setmatcher (optarg, matcher); break; =20 case 'H': @@ -2794,24 +2755,26 @@ main (int argc, char **argv) =20 initialize_unibyte_mask (); =20 + if (matcher < 0) + matcher =3D G_MATCHER_INDEX; + /* In a unibyte locale, switch from fgrep to grep if the pattern matches words (where grep is typically faster). In a multibyte locale, switch from fgrep to grep if either (1) the pattern has an encoding error (where fgrep might not work),= or (2) case is ignored and a fast fgrep ignore-case is not available. = */ - if (compile =3D=3D Fcompile + if (matcher =3D=3D F_MATCHER_INDEX && (MB_CUR_MAX <=3D 1 ? match_words : (contains_encoding_error (keys, keycc) || (match_icase && !fgrep_icase_available (keys, keycc)))))= { fgrep_to_grep_pattern (&keys, &keycc); - matcher =3D "grep"; - compile =3D Gcompile; - execute =3D EGexecute; + matcher =3D G_MATCHER_INDEX; } =20 - compile (keys, keycc); + execute =3D matchers[matcher].execute; + matchers[matcher].compile (keys, keycc, matchers[matcher].syntax); free (keys); /* We need one byte prior and one after. */ char eolbytes[3] =3D { 0, eolbyte, 0 }; diff --git a/src/kwsearch.c b/src/kwsearch.c index c3e69b3..7275973 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -34,7 +34,7 @@ wordchar (wint_t wc) static kwset_t kwset; =20 void -Fcompile (char const *pattern, size_t size) +Fcompile (char const *pattern, size_t size, reg_syntax_t ignored) { size_t total =3D size; =20 diff --git a/src/pcresearch.c b/src/pcresearch.c index 108baff..0e34861 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -88,7 +88,7 @@ static int empty_match[2]; #endif =20 void -Pcompile (char const *pattern, size_t size) +Pcompile (char const *pattern, size_t size, reg_syntax_t ignored) { #if !HAVE_LIBPCRE die (EXIT_TROUBLE, 0, diff --git a/src/search.h b/src/search.h index 4957a63..1ff5be2 100644 --- a/src/search.h +++ b/src/search.h @@ -57,11 +57,11 @@ extern void GEAcompile (char const *, size_t, reg_syn= tax_t); extern size_t EGexecute (char const *, size_t, size_t *, char const *); =20 /* kwsearch.c */ -extern void Fcompile (char const *, size_t); +extern void Fcompile (char const *, size_t, reg_syntax_t); extern size_t Fexecute (char const *, size_t, size_t *, char const *); =20 /* pcresearch.c */ -extern void Pcompile (char const *, size_t); +extern void Pcompile (char const *, size_t, reg_syntax_t); extern size_t Pexecute (char const *, size_t, size_t *, char const *); =20 /* Return the number of bytes in the character at the start of S, which --=20 2.7.4 --------------4476EE1989198F97E6DF6D47 Content-Type: text/x-diff; name="0003-grep-fix-performance-with-multiple-patterns.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0003-grep-fix-performance-with-multiple-patterns.patch" =46rom 290ca116c9172d97b2b026951fac722d3bd3ced9 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 20 Dec 2016 18:04:39 -0800 Subject: [PATCH 3/3] grep: fix performance with multiple patterns Problem reported by Jaroslav Skarvada (Bug#22357). * NEWS: Document this and other recent performance fixes. * src/grep.c (E_MATCHER_INDEX): New constant. (all_single_byte_after_folding): New function, split out from fgrep_icase_available. (fgrep_icase_available): Use it. (try_fgrep_pattern): New function, which also uses it. (main): With two or more patterns, use try_fgrep_pattern to fix performance regression. The number "two" here is just a heuristic. --- NEWS | 11 ++++++ src/grep.c | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++----= ------ 2 files changed, 123 insertions(+), 20 deletions(-) diff --git a/NEWS b/NEWS index aa0483b..2e7d1ae 100644 --- a/NEWS +++ b/NEWS @@ -8,6 +8,17 @@ GNU grep NEWS -*- out= line -*- /proc file system and standard output is /dev/null. [bug introduced in grep-2.27] =20 +** Bug fixes + + Fix performance regression with multiple patterns, e.g., for -Fi in + a multi-byte locale, or for -Fw in a single-byte locale. + [bugs introduced in grep-2.19, grep-2.22 and grep-2.26] + +** Improvements + + Improve performance for -E or -G pattern lists that are easily + converted to -F format. + =20 * Noteworthy changes in release 2.27 (2016-12-06) [stable] =20 diff --git a/src/grep.c b/src/grep.c index 574a380..f36654c 100644 --- a/src/grep.c +++ b/src/grep.c @@ -2032,7 +2032,7 @@ static struct { "perl", 0, Pcompile, Pexecute, }, }; /* Keep these in sync with the 'matchers' table. */ -enum { F_MATCHER_INDEX =3D 2, G_MATCHER_INDEX =3D 0 }; +enum { E_MATCHER_INDEX =3D 1, F_MATCHER_INDEX =3D 2, G_MATCHER_INDEX =3D= 0 }; =20 /* Return the index of the matcher corresponding to M if available. MATCHER is the index of the previous matcher, or -1 if none. @@ -2245,23 +2245,12 @@ contains_encoding_error (char const *pat, size_t = patlen) return false; } =20 -/* Return true if the fgrep pattern PAT, of size PATLEN, matches only - single-byte characters including case folding, and so is suitable - to be converted to a grep pattern. */ +/* Return true if the set of single-byte characters USED contains only + characters whose case counterparts are also single-byte. */ =20 static bool -fgrep_icase_available (char const *pat, size_t patlen) +all_single_byte_after_folding (bool used[UCHAR_MAX + 1]) { - bool used[UCHAR_MAX + 1] =3D { 0, }; - - for (size_t i =3D 0; i < patlen; i++) - { - unsigned char c =3D pat[i]; - if (localeinfo.sbctowc[c] =3D=3D WEOF) - return false; - used[c] =3D true; - } - for (int c =3D 0; c <=3D UCHAR_MAX; c++) if (used[c]) { @@ -2281,6 +2270,26 @@ fgrep_icase_available (char const *pat, size_t pat= len) return true; } =20 +/* Return true if the -F patterns PAT, of size PATLEN, match only + single-byte characters including case folding, and so can be + processed by the -F matcher even if -i is given. */ + +static bool +fgrep_icase_available (char const *pat, size_t patlen) +{ + bool used[UCHAR_MAX + 1] =3D { 0, }; + + for (size_t i =3D 0; i < patlen; i++) + { + unsigned char c =3D pat[i]; + if (localeinfo.sbctowc[c] =3D=3D WEOF) + return false; + used[c] =3D true; + } + + return all_single_byte_after_folding (used); +} + /* Change the pattern *KEYS_P, of size *LEN_P, from fgrep to grep style.= */ =20 static void @@ -2325,6 +2334,84 @@ fgrep_to_grep_pattern (char **keys_p, size_t *len_= p) *len_p =3D p - new_keys; } =20 +/* If it is easy, convert the MATCHER-style patterns KEYS (of size + *LEN_P) to -F style, update *LEN_P to a possibly-smaller value, and + return F_MATCHER_INDEX. If not, leave KEYS and *LEN_P alone and + return MATCHER. This function is conservative and sometimes misses + conversions, e.g., it does not convert the -E pattern "(a|a|[aa])" + to the -F pattern "a". */ + +static int +try_fgrep_pattern (int matcher, char *keys, size_t *len_p) +{ + int result =3D matcher; + size_t len =3D *len_p; + char *new_keys =3D xmalloc (len + 1); + char *p =3D new_keys; + mbstate_t mb_state =3D { 0 }; + size_t mblen_lim =3D match_icase ? 1 : -3; + bool used[UCHAR_MAX + 1] =3D { 0, }; + + while (len !=3D 0) + { + switch (*keys) + { + case '$': case '*': case '.': case '[': case '^': + goto fail; + + case '(': case '+': case '?': case '{': case '|': + if (matcher !=3D G_MATCHER_INDEX) + goto fail; + break; + + case '\\': + if (1 < len) + switch (keys[1]) + { + case '\n': + case 'B': case 'S': case 'W': case'\'': case '<': + case 'b': case 's': case 'w': case '`': case '>': + case '1': case '2': case '3': case '4': + case '5': case '6': case '7': case '8': case '9': + goto fail; + + case '(': case '+': case '?': case '{': case '|': + if (matcher =3D=3D G_MATCHER_INDEX) + goto fail; + /* Fall through. */ + default: + keys++, len--; + break; + } + break; + } + + { + size_t n =3D mb_clen (keys, len, &mb_state); + if (mblen_lim < n) + goto fail; + used[to_uchar (keys[0])] =3D true; + p =3D mempcpy (p, keys, n); + keys +=3D n; + len -=3D n; + } + } + + if (mblen_lim =3D=3D 1 && !all_single_byte_after_folding (used)) + goto fail; + + if (*len_p !=3D p - new_keys) + { + *len_p =3D p - new_keys; + memcpy (keys, new_keys, p - new_keys); + } + result =3D F_MATCHER_INDEX; + + fail: + free (new_keys); + return result; +} + int main (int argc, char **argv) { @@ -2758,11 +2845,11 @@ main (int argc, char **argv) if (matcher < 0) matcher =3D G_MATCHER_INDEX; =20 - /* In a unibyte locale, switch from fgrep to grep if - the pattern matches words (where grep is typically faster). - In a multibyte locale, switch from fgrep to grep if either - (1) the pattern has an encoding error (where fgrep might not work),= or - (2) case is ignored and a fast fgrep ignore-case is not available. = */ + /* In a single-byte locale, switch from -F to -G if it is a single + pattern that matches words, where -G is typically faster. In a + multi-byte locale, switch if the patterns have an encoding error + (where -F does not work) or if -i and the patterns will not work + for -iF. */ if (matcher =3D=3D F_MATCHER_INDEX && (MB_CUR_MAX <=3D 1 ? match_words @@ -2772,6 +2859,11 @@ main (int argc, char **argv) fgrep_to_grep_pattern (&keys, &keycc); matcher =3D G_MATCHER_INDEX; } + /* With two or more patterns, if -F works then switch from either -E + or -G, as -F is probably faster then. */ + else if ((matcher =3D=3D G_MATCHER_INDEX || matcher =3D=3D E_MATCHER_I= NDEX) + && 1 < n_patterns) + matcher =3D try_fgrep_pattern (matcher, keys, &keycc); =20 execute =3D matchers[matcher].execute; matchers[matcher].compile (keys, keycc, matchers[matcher].syntax); --=20 2.7.4 --------------4476EE1989198F97E6DF6D47-- ------------=_1482302762-26538-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 26 Oct 2015 14:18:49 +0000 Received: from localhost ([127.0.0.1]:39287 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zqibn-00089Z-Tv for submit@debbugs.gnu.org; Mon, 26 Oct 2015 10:18:48 -0400 Received: from eggs.gnu.org ([208.118.235.92]:60697) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1ZqhHu-0005zM-8s for submit@debbugs.gnu.org; Mon, 26 Oct 2015 08:54:29 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZqhHs-0000Ck-Vo for submit@debbugs.gnu.org; Mon, 26 Oct 2015 08:54:09 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:45655) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZqhHs-0000Cg-Sr for submit@debbugs.gnu.org; Mon, 26 Oct 2015 08:54:08 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42470) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZqhHr-00056c-Ts for bug-grep@gnu.org; Mon, 26 Oct 2015 08:54:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZqhHo-0000C8-ND for bug-grep@gnu.org; Mon, 26 Oct 2015 08:54:07 -0400 Received: from sideburn.lancs.ac.uk ([148.88.17.22]:39853) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZqhHo-0000BA-HE for bug-grep@gnu.org; Mon, 26 Oct 2015 08:54:04 -0400 Received: from ex-0-ht0.lancs.ac.uk ([10.42.18.47] helo=EX-0-HT0.lancs.local) by sideburn.lancs.ac.uk with esmtp (Exim 4.72) (envelope-from ) id 1ZqhHm-0007Wk-Bh for bug-grep@gnu.org; Mon, 26 Oct 2015 12:54:02 +0000 Received: from EX-0-MB1.lancs.local ([fe80::1c8:ad8:cba3:7976]) by EX-0-HT0.lancs.local ([fe80::7d10:114a:53b0:7f2f%12]) with mapi id 14.03.0248.002; Mon, 26 Oct 2015 12:54:02 +0000 From: "Bennett, Steve" To: "bug-grep@gnu.org" Subject: poor performance since grep 2.19 when comparing files with grep Thread-Topic: poor performance since grep 2.19 when comparing files with grep Thread-Index: AdEP6Xx378J5j24uQGyB2jN30u7Jfg== Date: Mon, 26 Oct 2015 12:54:01 +0000 Message-ID: <1F9F4161D543984DB0123972858E8344490E5E20@EX-0-MB1.lancs.local> Accept-Language: en-GB, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [148.88.134.80] x-iss-local-domain: 1 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.3 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 26 Oct 2015 10:18:46 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.3 (----) Apologies in advance if this is more of a "discuss" question, but it looks = like a particular use-case shows a marked change in performance between rec= ent versions of grep. A colleague mentioned a performance issue with grep to me, and its puzzling= me a bit. It turns out that he was using "grep -Fvif" to find lines in one file that = are not present in another. Up until grep 2.18 this seems to work with linear performance and it takes = less than 50ms to compare files up to about 20,000 lines. With grep 2.19 and later, ever relatively small files are quite slow, runti= me (and memory use) increases exponentially (e.g. 300ms to compare 200 line= s, 1.5s to compare 400 lines, 5s to compare 600 lines). I've shown my colleague how to use sort and diff (and "comm", which I think= is vastly underrated), but it made me wonder if this is a reasonable thing= to expect grep to be able to do, and whether such a performance drop shoul= d be seen as a bug. The way he was using it, he had two (unsorted) data sets (about 6000 rows i= n each), with most lines being common, and he was just using: grep -Fvif FILE1 FILE2 In his case, the older version of grep took way less than a second to run, = but after he had upgraded his machine it took 20 minutes before running out= of swap and seg faulting.=20 In terms of comparing performance, I've found that the following works to c= ompare performance (vary N to try different sized data files): N=3D600; F=3D/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N >= $F; time grep -Fvif $F $F; rm $F Steve. ------------=_1482302762-26538-1-- From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 21 Dec 2016 23:24:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 21763-done@debbugs.gnu.org, Bruce Dubbs , 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , Jim Meyering , "Bennett, Steve" , JQK , arnold@skeeve.com, 22239@debbugs.gnu.org, Trevor Cordes , Paul Jackson , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.148236259913158 (code D ref 21763); Wed, 21 Dec 2016 23:24:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 21 Dec 2016 23:23:19 +0000 Received: from localhost ([127.0.0.1]:50696 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJqEB-0003Q4-Kg for submit@debbugs.gnu.org; Wed, 21 Dec 2016 18:23:19 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:45151) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cJpwO-0002xw-Hg for 21763-done@debbugs.gnu.org; Wed, 21 Dec 2016 18:04:57 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id ED775806A9 for <21763-done@debbugs.gnu.org>; Thu, 22 Dec 2016 08:04:48 +0900 (JST) X-matriXscan-loop-detect: 2acf43152a12c75b80d7ac696e1f5be3400fafca Received: from mail04.kcn.ne.jp ([61.86.6.183]) by mxs02-s with ESMTP; Thu, 22 Dec 2016 08:04:46 +0900 (JST) Received: from [10.120.1.68] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail04.kcn.ne.jp (Postfix) with ESMTPA id AFA8C12900BC; Thu, 22 Dec 2016 08:04:45 +0900 (JST) Date: Thu, 22 Dec 2016 08:04:42 +0900 From: Norihiro Tanaka In-Reply-To: <3a0eb353-96a4-80fd-5f3c-7fd77be096a0@cs.ucla.edu> References: <20161209072419.GA30226@pog.tecnopolis.ca> <3a0eb353-96a4-80fd-5f3c-7fd77be096a0@cs.ucla.edu> Message-Id: <20161222080441.6C45.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.73 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.1 (/) X-Mailman-Approved-At: Wed, 21 Dec 2016 18:23:18 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) On Tue, 20 Dec 2016 21:17:01 -0800 Paul Eggert wrote: > I installed the attached patches into grep master. These fix the > performance regressions noted at the start of Bug#22357. I see that > the related performance problems noted in Bug#21763 seem to be fixed > too, I expect because of Norihiro Tanaka's recent changes, so I'll > boldly close both bug reports. > > To some extent the attached patches restore the old behavior for grep > -F, when grep is given two or more patterns. The patch doesn't change > the underlying algorithms; it merely uses a different heuristic to > decide whether to use the -F matcher. Although I wouldn't be > surprised if the attached patches hurt performance in some cases, I > didn't uncover any such cases in my performance testing, which I > admit mostly consisted of running the examples in the abovementioned > bug reports. > > I'll leave Bug#22239 open, as I get the following performance figures > (user+system CPU time) for the Bug#22239 benchmark, where list.txt is > created by "aspell dump master | head -n 100000 >list.txt", and the > grep commands all use the operands "-F -f list.txt /etc/passwd" in > the en_US.utf8 locale on Fedora 24 x86-64. > > no -i -i grep version > 0.25 0.33 2.16 > 0.26 10.95 2.21 > 0.11 2.90* current master (including attached patches) > > In the C locale, the current grep master is always significantly > faster than grep 2.16 or 2.21 on the benchmark, so the only > significant problem is the number marked "*". I ran the benchmarks on > an AMD Phenom II X4 910e. Thanks. BTW, are you aware of extreme slowdown in the following cases after third patch? yes $(printf %040d 0) | head -10000000 >inp printf '0\n1\n' >pat env LC_ALL=C src/grep -w -f pat inp From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Thu, 22 Dec 2016 17:11:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka Cc: Paul Eggert , 21763-done@debbugs.gnu.org, Bruce Dubbs , 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , "Bennett, Steve" , JQK , Aharon Robbins , 22239@debbugs.gnu.org, Trevor Cordes , Paul Jackson , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.14824266335711 (code D ref 21763); Thu, 22 Dec 2016 17:11:03 +0000 Received: (at 21763-done) by debbugs.gnu.org; 22 Dec 2016 17:10:33 +0000 Received: from localhost ([127.0.0.1]:51771 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cK6sz-0001Tw-78 for submit@debbugs.gnu.org; Thu, 22 Dec 2016 12:10:33 -0500 Received: from mail-it0-f67.google.com ([209.85.214.67]:33862) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cK1Kc-0005gR-Id; Thu, 22 Dec 2016 06:14:43 -0500 Received: by mail-it0-f67.google.com with SMTP id 75so20466050ite.1; Thu, 22 Dec 2016 03:14:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=v4PHNBVPCCcfR3sLcwmkMdsRlrXobKPk+KTY1gg7ODg=; b=YG0/WzDzgrbelPTNIe6ewU8AcYY9WTFANx5jG9ca8SgJkaSJhQ6RZeCtakrfEBcMqh Im1Hzn1bnm7laeo77Lg0oKwaduot+a2ZLinyELMUQWDzSf3sB+EVCUQxlCyBQzeEAJCP Eup7mYqhd+Dyh6BDmix1tNS2ZIs3AbPYeliO17+8uzZ1keR9JVjFr8803N+54FmQpK2n i9rFOudYA0oqYVEV3JC/qI56ioxIvDEfyGPzWlZ3oP+1+ygGcnAJYmDkYod4g+DxSDoA Xbky2aDIGC91wrXmjZcJV8BXPpLqaOzEtaA60c39/pEwv6k58FzMzKKqZgvdgUCi3L+g hGKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=v4PHNBVPCCcfR3sLcwmkMdsRlrXobKPk+KTY1gg7ODg=; b=F7PtYvS3ZSVhhG5jPneMb6lsZBfjxpi9M5C+RulkGYhoO/aYPY19rmoXKt6fVTpqdC 5wDcytGJv3Px5gQtmpTDMzr3XrzvQOqreBPfkLPRIV8rRsJGPa/qhNPZQqFLXRch4LMj iWWF8bkA6NqpwsJAXl4800wjiisYPCFQS6oOJML/8sWhh4u3CBUBEiVhR2HxiHwx4Fjr TymMqaxC2NKzhVkXnM20+WuUjjigUJdLo/mZHOfYmKFJxE+lqVbj3OZWtHA+OfqJ51VK 3sGulo271zkKpB218QyHufoDv7TtG67VyqVqQmyAArI14erZg7/TMeM/cQd5iph9f2v7 aqBg== X-Gm-Message-State: AIkVDXILOFEcBGJb3d5rLKT7QpzZeRhWdexVol2IC84OdQrw4lHlKKUXVM4YqOsBAgfKYW3SBPVW2qpLK9pgpA== X-Received: by 10.36.78.15 with SMTP id r15mr9523534ita.55.1482405276909; Thu, 22 Dec 2016 03:14:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.191.5 with HTTP; Thu, 22 Dec 2016 03:14:16 -0800 (PST) In-Reply-To: <20161222080441.6C45.27F6AC2D@kcn.ne.jp> References: <20161209072419.GA30226@pog.tecnopolis.ca> <3a0eb353-96a4-80fd-5f3c-7fd77be096a0@cs.ucla.edu> <20161222080441.6C45.27F6AC2D@kcn.ne.jp> From: Jim Meyering Date: Thu, 22 Dec 2016 03:14:16 -0800 X-Google-Sender-Auth: cW-7xuhDHxZh5UH6IaJ-pNfgnJo Message-ID: Content-Type: text/plain; charset=UTF-8 X-Spam-Score: 3.5 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Wed, Dec 21, 2016 at 3:04 PM, Norihiro Tanaka wrote: > yes $(printf %040d 0) | head -10000000 >inp > printf '0\n1\n' >pat > env LC_ALL=C src/grep -w -f pat inp Thanks for mentioning that. Here is some actual data (Fedora 25 on an i7-4770S): [...] Content analysis details: (3.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 3.0 MANY_TO_CC Sent to 10+ recipients 0.5 RCVD_IN_SORBS_SPAM RBL: SORBS: sender is a spam source [209.85.214.67 listed in dnsbl.sorbs.net] -0.0 RCVD_IN_MSPIKE_H3 RBL: Good reputation (+3) [209.85.214.67 listed in wl.mailspike.net] 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (meyering[at]gmail.com) 0.0 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different -0.0 SPF_PASS SPF: sender matches SPF record -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [209.85.214.67 listed in list.dnswl.org] 0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid -0.0 RCVD_IN_MSPIKE_WL Mailspike good senders 0.0 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different X-Mailman-Approved-At: Thu, 22 Dec 2016 12:10:30 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) On Wed, Dec 21, 2016 at 3:04 PM, Norihiro Tanaka wrote: > yes $(printf %040d 0) | head -10000000 >inp > printf '0\n1\n' >pat > env LC_ALL=C src/grep -w -f pat inp Thanks for mentioning that. Here is some actual data (Fedora 25 on an i7-4770S): for i in $(env ls -1dv /p/p/grep-*/bin/grep) src/grep; do env LC_ALL=C time -f "%e $i" $i -w -f pat inp; done 1.05 /p/p/grep-2.3/bin/grep 0.91 /p/p/grep-2.4.1/bin/grep 0.91 /p/p/grep-2.4.2/bin/grep 0.91 /p/p/grep-2.4/bin/grep 0.94 /p/p/grep-2.5.1/bin/grep 0.94 /p/p/grep-2.5.3/bin/grep 0.94 /p/p/grep-2.5.4/bin/grep 0.95 /p/p/grep-2.5/bin/grep 0.90 /p/p/grep-2.6.1/bin/grep 1.03 /p/p/grep-2.6.2/bin/grep 0.90 /p/p/grep-2.6.3/bin/grep 0.90 /p/p/grep-2.6/bin/grep 0.90 /p/p/grep-2.7/bin/grep 0.90 /p/p/grep-2.8/bin/grep 0.90 /p/p/grep-2.9/bin/grep 0.90 /p/p/grep-2.10/bin/grep 0.89 /p/p/grep-2.11/bin/grep 0.90 /p/p/grep-2.12/bin/grep 0.90 /p/p/grep-2.13/bin/grep 0.89 /p/p/grep-2.14/bin/grep 0.90 /p/p/grep-2.15/bin/grep 0.90 /p/p/grep-2.16/bin/grep 0.89 /p/p/grep-2.17/bin/grep 0.90 /p/p/grep-2.18/bin/grep 0.90 /p/p/grep-2.19/bin/grep 0.90 /p/p/grep-2.20/bin/grep 0.86 /p/p/grep-2.21/bin/grep 0.90 /p/p/grep-2.22/bin/grep 0.91 /p/p/grep-2.23/bin/grep 0.91 /p/p/grep-2.24/bin/grep 0.91 /p/p/grep-2.25/bin/grep 0.25 /p/p/grep-2.26/bin/grep 13.07 /p/p/grep-2.27/bin/grep 37.49 src/grep From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 24 Dec 2016 01:39:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka Cc: 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , =?UTF-8?Q?Ond=C5=99ej_?= =?UTF-8?Q?C=C3=ADfka?= Received: via spool by 21763-done@debbugs.gnu.org id=D21763.14825435367073 (code D ref 21763); Sat, 24 Dec 2016 01:39:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 24 Dec 2016 01:38:56 +0000 Received: from localhost ([127.0.0.1]:53074 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cKbIV-0001pz-4D for submit@debbugs.gnu.org; Fri, 23 Dec 2016 20:38:56 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:41450) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cKbIS-0001pc-09; Fri, 23 Dec 2016 20:38:53 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D41611600CB; Fri, 23 Dec 2016 17:38:45 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id bci7v12pNFPn; Fri, 23 Dec 2016 17:38:43 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0CEB61600CD; Fri, 23 Dec 2016 17:38:43 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id v3s_MA36AOum; Fri, 23 Dec 2016 17:38:42 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id B83EA1600CB; Fri, 23 Dec 2016 17:38:42 -0800 (PST) References: <20161209072419.GA30226@pog.tecnopolis.ca> <3a0eb353-96a4-80fd-5f3c-7fd77be096a0@cs.ucla.edu> <20161222080441.6C45.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <820176a7-a1e4-7b3f-f140-7d2329d25d63@cs.ucla.edu> Date: Fri, 23 Dec 2016 17:38:42 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161222080441.6C45.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------6B4F36F83DA3EA4416933FAA" X-Spam-Score: -3.1 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) This is a multi-part message in MIME format. --------------6B4F36F83DA3EA4416933FAA Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Norihiro Tanaka wrote: > are you aware of extreme slowdown in the following cases after third pa= tch? > > yes $(printf %040d 0) | head -10000000 >inp > printf '0\n1\n' >pat > env LC_ALL=3DC src/grep -w -f pat inp No. Thanks, I hadn't considered that possibility. I looked into the slowd= own and=20 installed the attached patches, which cause 'grep' to run about as fast o= n this=20 test case as grep 2.25 (though not as fast as grep 2.26). The main fix is= in=20 patch 5. On my platform: -------grep version------ v2.25 v2.26 v2.27 master locale command 1.21 0.69 24.95 1.22 C grep -w -f pat inp 207.36 203.15 202.03 1.22 en_US.utf8 grep -w -f pat inp 1.21 0.69 25.95 0.85 C grep -w -f pat inp -F 66.33 68.07 67.21 1.22 en_US.utf8 grep -w -f pat inp -F All numbers are user+system CPU seconds on Fedora 24 x86-64 (AMD Phenom I= I X4=20 910e). "master" means after the attached patches are installed. Perhaps we can fiddle with the heuristics a bit so that v2.26 is not=20 significantly faster than the master in the C locale. --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0001-maint-rewrite-to-avoid-some-macros.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-maint-rewrite-to-avoid-some-macros.patch" =46rom 969447fdebabe2046ec4261837d5e8f12f75fba9 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 08:04:13 -0800 Subject: [PATCH 1/8] maint: rewrite to avoid some macros These days, the dangerous powers of C macros are not needed if constants or functions will do just as well. * src/grep.c (SEP_CHAR_SELECTED, SEP_CHAR_REJECTED, SEP_STR_GROUP) (INITIAL_BUFSIZE): * src/kwset.c (DEPTH_SIZE): Now constants, not macros. * src/kwset.c (link): Remove macro. Instead, rename local vars from 'link' to 'cur'. (malloc) [GREP]: Remove macro. All uses of malloc changed to xmalloc. Omit double-inclusion of xalloc.h. Do not depend on 'GREP'. (U): Now a function, not a macro. * src/kwset.c, src/searchutils.c (NCHAR): Move this macro to ... * src/system.h: ... here, and make it a constant. --- src/grep.c | 8 ++--- src/kwset.c | 95 ++++++++++++++++++++++++++-----------------------= ------ src/searchutils.c | 2 -- src/system.h | 1 + 4 files changed, 50 insertions(+), 56 deletions(-) diff --git a/src/grep.c b/src/grep.c index f36654c..3729ae0 100644 --- a/src/grep.c +++ b/src/grep.c @@ -50,9 +50,9 @@ #include "xalloc.h" #include "xstrtol.h" =20 -#define SEP_CHAR_SELECTED ':' -#define SEP_CHAR_REJECTED '-' -#define SEP_STR_GROUP "--" +enum { SEP_CHAR_SELECTED =3D ':' }; +enum { SEP_CHAR_REJECTED =3D '-' }; +char const SEP_STR_GROUP[] =3D "--"; =20 #define AUTHORS \ proper_name ("Mike Haertel"), \ @@ -797,7 +797,7 @@ skipped_file (char const *name, bool command_line, bo= ol is_dir) =20 static char *buffer; /* Base of buffer. */ static size_t bufalloc; /* Allocated buffer size, counting slop. */ -#define INITIAL_BUFSIZE 32768 /* Initial buffer size, not counting slop.= */ +enum { INITIAL_BUFSIZE =3D 32768 }; /* Initial buffer size, not counting= slop. */ static int bufdesc; /* File descriptor. */ static char *bufbeg; /* Beginning of user-visible stuff. */ static char *buflim; /* Limit of user-visible stuff. */ diff --git a/src/kwset.c b/src/kwset.c index 264ef22..506c6cd 100644 --- a/src/kwset.c +++ b/src/kwset.c @@ -42,19 +42,14 @@ #include "obstack.h" #include "xalloc.h" =20 -#define link kwset_link - -#ifdef GREP -# include "xalloc.h" -# undef malloc -# define malloc xmalloc -#endif - -#define NCHAR (UCHAR_MAX + 1) -#define obstack_chunk_alloc malloc +#define obstack_chunk_alloc xmalloc #define obstack_chunk_free free =20 -#define U(c) to_uchar (c) +static unsigned char +U (char ch) +{ + return to_uchar (ch); +} =20 /* Balanced tree of edges and labels leaving a given trie node. */ struct tree @@ -159,7 +154,7 @@ kwsalloc (char const *trans, bool reverse) =20 /* This upper bound is valid for CHAR_BIT >=3D 4 and exact for CHAR_BIT in { 4..11, 13, 15, 17, 19 }. */ -#define DEPTH_SIZE (CHAR_BIT + CHAR_BIT/2) +enum { DEPTH_SIZE =3D CHAR_BIT + CHAR_BIT / 2 }; =20 /* Add the given string to the contents of the keyword set. */ void @@ -181,46 +176,46 @@ kwsincr (kwset_t kwset, char const *text, size_t le= n) /* Descend the tree of outgoing links for this trie node, looking for the current character and keeping track of the path followed. */ - struct tree *link =3D trie->links; + struct tree *cur =3D trie->links; struct tree *links[DEPTH_SIZE]; enum { L, R } dirs[DEPTH_SIZE]; links[0] =3D (struct tree *) &trie->links; dirs[0] =3D L; int depth =3D 1; =20 - while (link && label !=3D link->label) + while (cur && label !=3D cur->label) { - links[depth] =3D link; - if (label < link->label) - dirs[depth++] =3D L, link =3D link->llink; + links[depth] =3D cur; + if (label < cur->label) + dirs[depth++] =3D L, cur =3D cur->llink; else - dirs[depth++] =3D R, link =3D link->rlink; + dirs[depth++] =3D R, cur =3D cur->rlink; } =20 /* The current character doesn't have an outgoing link at this trie node, so build a new trie node and install a link in the current trie node's tree. */ - if (!link) + if (!cur) { - link =3D obstack_alloc (&kwset->obstack, sizeof *link); - link->llink =3D NULL; - link->rlink =3D NULL; - link->trie =3D obstack_alloc (&kwset->obstack, sizeof *link->t= rie); - link->trie->accepting =3D 0; - link->trie->links =3D NULL; - link->trie->parent =3D trie; - link->trie->next =3D NULL; - link->trie->fail =3D NULL; - link->trie->depth =3D trie->depth + 1; - link->trie->shift =3D 0; - link->label =3D label; - link->balance =3D 0; + cur =3D obstack_alloc (&kwset->obstack, sizeof *cur); + cur->llink =3D NULL; + cur->rlink =3D NULL; + cur->trie =3D obstack_alloc (&kwset->obstack, sizeof *cur->tri= e); + cur->trie->accepting =3D 0; + cur->trie->links =3D NULL; + cur->trie->parent =3D trie; + cur->trie->next =3D NULL; + cur->trie->fail =3D NULL; + cur->trie->depth =3D trie->depth + 1; + cur->trie->shift =3D 0; + cur->label =3D label; + cur->balance =3D 0; =20 /* Install the new tree node in its parent. */ if (dirs[--depth] =3D=3D L) - links[depth]->llink =3D link; + links[depth]->llink =3D cur; else - links[depth]->rlink =3D link; + links[depth]->rlink =3D cur; =20 /* Back up the tree fixing the balance flags. */ while (depth && !links[depth]->balance) @@ -291,7 +286,7 @@ kwsincr (kwset_t kwset, char const *text, size_t len)= } } =20 - trie =3D link->trie; + trie =3D cur->trie; } =20 /* Mark the node we finally reached as accepting, encoding the @@ -326,7 +321,7 @@ static void treefails (struct tree const *tree, struct trie const *fail, struct trie *recourse, bool reverse) { - struct tree *link; + struct tree *cur; =20 if (!tree) return; @@ -338,16 +333,16 @@ treefails (struct tree const *tree, struct trie con= st *fail, node that has a descendant on the current label. */ while (fail) { - link =3D fail->links; - while (link && tree->label !=3D link->label) - if (tree->label < link->label) - link =3D link->llink; + cur =3D fail->links; + while (cur && tree->label !=3D cur->label) + if (tree->label < cur->label) + cur =3D cur->llink; else - link =3D link->rlink; - if (link) + cur =3D cur->rlink; + if (cur) { - tree->trie->fail =3D link->trie; - if (!reverse && link->trie->accepting && !tree->trie->acceptin= g) + tree->trie->fail =3D cur->trie; + if (!reverse && cur->trie->accepting && !tree->trie->accepting= ) tree->trie->accepting =3D SIZE_MAX; return; } @@ -641,18 +636,18 @@ static size_t memoff2_kwset (char const *s, size_t n, kwset_t kwset, struct kwsmatch *kwsmatch) { - struct tree const *link =3D kwset->trie->links; - struct tree const *clink =3D link->llink ? link->llink : link->rlink; + struct tree const *cur =3D kwset->trie->links; + struct tree const *clink =3D cur->llink ? cur->llink : cur->rlink; char const *mch =3D (clink - ? memchr2 (s, link->label, clink->label, n) - : memchr (s, link->label, n)); + ? memchr2 (s, cur->label, clink->label, n) + : memchr (s, cur->label, n)); if (! mch) return SIZE_MAX; else { size_t off =3D mch - s; - if (*mch =3D=3D link->label) - kwsmatch->index =3D link->trie->accepting / 2; + if (*mch =3D=3D cur->label) + kwsmatch->index =3D cur->trie->accepting / 2; else kwsmatch->index =3D clink->trie->accepting / 2; kwsmatch->offset[0] =3D off; diff --git a/src/searchutils.c b/src/searchutils.c index 73d6c1c..deaab60 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -22,8 +22,6 @@ #define SYSTEM_INLINE _GL_EXTERN_INLINE #include "search.h" =20 -#define NCHAR (UCHAR_MAX + 1) - kwset_t kwsinit (bool mb_trans) { diff --git a/src/system.h b/src/system.h index 6f4918d..c875275 100644 --- a/src/system.h +++ b/src/system.h @@ -37,6 +37,7 @@ #include =20 enum { EXIT_TROUBLE =3D 2 }; +enum { NCHAR =3D UCHAR_MAX + 1 }; =20 #include #define N_(String) gettext_noop(String) --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0002-grep-remove-C-label.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0002-grep-remove-C-label.patch" =46rom 19227eb98bc586a5ef3c2fb993a23c1182a4dab6 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 10:54:54 -0800 Subject: [PATCH 2/8] grep: remove C label * src/kwsearch.c (Fexecute): Remove label. --- src/kwsearch.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/kwsearch.c b/src/kwsearch.c index 7275973..d04d34c 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -103,7 +103,7 @@ Fexecute (char const *buf, size_t size, size_t *match= _size, buf + size - beg + match_lines, &kwsmatch= , longest); if (offset =3D=3D (size_t) -1) - goto failure; + break; len =3D kwsmatch.size[0] - 2 * match_lines; if (mb_check && mb_goback (&mb_start, beg + offset, buf + size) !=3D= 0) { @@ -157,7 +157,6 @@ Fexecute (char const *buf, size_t size, size_t *match= _size, goto success; } /* for (beg in buf) */ =20 - failure: return -1; =20 success: --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0003-grep-simplify-Fexecute.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0003-grep-simplify-Fexecute.patch" =46rom f139976800435db91032f3b1d9435a166890be38 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 11:10:23 -0800 Subject: [PATCH 3/8] grep: simplify Fexecute * src/kwsearch.c (Fexecute): Avoid the need for a 'try' local or for a 'goto success'. Update mb_start to reflect newline found. --- src/kwsearch.c | 46 ++++++++++++++++++++++++---------------------- 1 file changed, 24 insertions(+), 22 deletions(-) diff --git a/src/kwsearch.c b/src/kwsearch.c index d04d34c..5596ebd 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -81,7 +81,7 @@ size_t Fexecute (char const *buf, size_t size, size_t *match_size, char const *start_ptr) { - char const *beg, *try, *end, *mb_start; + char const *beg, *end, *mb_start; size_t len; char eol =3D eolbyte; struct kwsmatch kwsmatch; @@ -131,30 +131,32 @@ Fexecute (char const *buf, size_t size, size_t *mat= ch_size, len +=3D start_ptr =3D=3D NULL; goto success_in_beg_and_len; } - if (match_words) - for (try =3D beg; ; ) + if (! match_words) + goto success; + + /* Succeed if the preceding and following characters are word + constituents. If the following character is not a word + constituent, keep trying with shorter matches. */ + char const *bol =3D memrchr (mb_start, eol, beg - mb_start); + if (bol) + mb_start =3D bol + 1; + if (! wordchar (mb_prev_wc (mb_start, beg, buf + size))) + for (;;) { - char const *bol =3D memrchr (buf, eol, beg - buf); - bol =3D bol ? bol + 1 : buf; - if (wordchar (mb_prev_wc (bol, try, buf + size))) - break; - if (wordchar (mb_next_wc (try + len, buf + size))) + if (! wordchar (mb_next_wc (beg + len, buf + size))) { - if (!len) - break; - offset =3D kwsexec (kwset, beg, --len, &kwsmatch, true);= - if (offset =3D=3D (size_t) -1) - break; - try =3D beg + offset; - len =3D kwsmatch.size[0]; + if (start_ptr) + goto success_in_beg_and_len; + else + goto success; } - else if (!start_ptr) - goto success; - else - goto success_in_beg_and_len; - } /* for (try) */ - else - goto success; + if (!len) + break; + offset =3D kwsexec (kwset, beg, --len, &kwsmatch, true); + if (offset !=3D 0) + break; + len =3D kwsmatch.size[0]; + } } /* for (beg in buf) */ =20 return -1; --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0004-grep-specialize-word-finding-functions.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0004-grep-specialize-word-finding-functions.patch" =46rom 740048e66e7c55a8e42f4f7e4c24256a61506f70 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 12:25:24 -0800 Subject: [PATCH 4/8] grep: specialize word-finding functions This improves performance a bit. * src/dfasearch.c, src/kwsearch.c (wordchar): Remove; now in searchutils.c. * src/grep.c (main): Call wordinit if -w. * src/search.h: Adjust. * src/searchutils.c: Include verify.h. (word_start): New static var. (wordchar): Move here from dfasearch.c and kwsearch.c. (wordinit, wordchars_count, wordchar_next, wordchar_prev): New functions. (mb_prev_wc, mb_next_wc): Remove. All callers changed to use the new functions instead. --- src/dfasearch.c | 11 ++----- src/grep.c | 1 + src/kwsearch.c | 11 ++----- src/search.h | 5 +-- src/searchutils.c | 91 +++++++++++++++++++++++++++++++++++++++++++------= ------ 5 files changed, 80 insertions(+), 39 deletions(-) diff --git a/src/dfasearch.c b/src/dfasearch.c index 24a36cd..87e1f7e 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -26,13 +26,6 @@ =20 struct localeinfo localeinfo; =20 -/* Whether -w considers WC to be a word constituent. */ -static bool -wordchar (wint_t wc) -{ - return wc =3D=3D L'_' || iswalnum (wc); -} - /* KWset compiled pattern. For Ecompile and Gcompile, we compile a list of strings, at least one of which is known to occur in any string matching the regexp. */ @@ -394,8 +387,8 @@ EGexecute (char const *buf, size_t size, size_t *matc= h_size, while (match <=3D best_match) { regoff_t shorter_len =3D 0; - if (!wordchar (mb_prev_wc (beg, match, end - 1)) - && !wordchar (mb_next_wc (match + len, end - 1))= ) + if (! wordchar_next (match + len, end - 1) + && ! wordchar_prev (beg, match, end - 1)) goto assess_pattern_match; if (len > 0) { diff --git a/src/grep.c b/src/grep.c index 3729ae0..f9d1d86 100644 --- a/src/grep.c +++ b/src/grep.c @@ -2651,6 +2651,7 @@ main (int argc, char **argv) break; =20 case 'w': + wordinit (); match_words =3D true; break; =20 diff --git a/src/kwsearch.c b/src/kwsearch.c index 5596ebd..b30dfd0 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -21,13 +21,6 @@ #include #include "search.h" =20 -/* Whether -w considers WC to be a word constituent. */ -static bool -wordchar (wint_t wc) -{ - return wc =3D=3D L'_' || iswalnum (wc); -} - /* KWset compiled pattern. For Ecompile and Gcompile, we compile a list of strings, at least one of which is known to occur in any string matching the regexp. */ @@ -140,10 +133,10 @@ Fexecute (char const *buf, size_t size, size_t *mat= ch_size, char const *bol =3D memrchr (mb_start, eol, beg - mb_start); if (bol) mb_start =3D bol + 1; - if (! wordchar (mb_prev_wc (mb_start, beg, buf + size))) + if (! wordchar_prev (mb_start, beg, buf + size)) for (;;) { - if (! wordchar (mb_next_wc (beg + len, buf + size))) + if (! wordchar_next (beg + len, buf + size)) { if (start_ptr) goto success_in_beg_and_len; diff --git a/src/search.h b/src/search.h index 1ff5be2..6fe1797 100644 --- a/src/search.h +++ b/src/search.h @@ -46,10 +46,11 @@ _GL_INLINE_HEADER_BEGIN typedef signed char mb_len_map_t; =20 /* searchutils.c */ +extern void wordinit (void); extern kwset_t kwsinit (bool); +extern size_t wordchar_next (char const *, char const *); +extern bool wordchar_prev (char const *, char const *, char const *); extern ptrdiff_t mb_goback (char const **, char const *, char const *); -extern wint_t mb_prev_wc (char const *, char const *, char const *); -extern wint_t mb_next_wc (char const *, char const *); =20 /* dfasearch.c */ extern struct localeinfo localeinfo; diff --git a/src/searchutils.c b/src/searchutils.c index deaab60..e0a1db3 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -22,6 +22,30 @@ #define SYSTEM_INLINE _GL_EXTERN_INLINE #include "search.h" =20 +#include + +/* For each byte B, word_start[B] is 1 if B is a single-byte character + that is a word constituent, 0 if B cannot start a word constituent, + and -1 if B might be or might not be the start of a word + constituent. */ +static wint_t word_start[NCHAR]; +verify (WEOF !=3D 0 && WEOF !=3D 1); + +/* Whether -w considers WC to be a word constituent. */ +static bool +wordchar (wint_t wc) +{ + return wc =3D=3D L'_' || iswalnum (wc); +} + +void +wordinit (void) +{ + for (int i =3D 0; i < NCHAR; i++) + word_start[i] =3D (localeinfo.sbclen[i] =3D=3D -2 ? WEOF + : wordchar (localeinfo.sbctowc[i])); +} + kwset_t kwsinit (bool mb_trans) { @@ -93,27 +117,56 @@ mb_goback (char const **mb_start, char const *cur, c= har const *end) return p =3D=3D cur ? 0 : cur - p0; } =20 -/* In the buffer BUF, return the wide character that is encoded just - before CUR. The buffer ends at END. Return WEOF if there is no - wide character just before CUR. */ -wint_t -mb_prev_wc (char const *buf, char const *cur, char const *end) +/* Examine the start of BUF (of size SIZE) for word constituents. + If COUNTALL, examine as many as possible; otherwise, examine at most = one. + Return the total number of bytes in the examined characters. */ +static size_t +wordchars_count (char const *buf, char const *end, bool countall) { - if (cur =3D=3D buf) - return WEOF; - char const *p =3D buf; - cur--; - cur -=3D mb_goback (&p, cur, end); - return mb_next_wc (cur, end); + size_t n =3D 0; + mbstate_t mbs =3D { 0 }; + while (n < end - buf) + { + wint_t ws =3D word_start[to_uchar (buf[n])]; + if (ws =3D=3D 0) + break; + else if (ws =3D=3D 1) + n++; + else + { + wchar_t wc =3D 0; + size_t wcbytes =3D mbrtowc (&wc, buf + n, end - buf - n, &mbs)= ; + if (!wordchar (wc)) + break; + n +=3D wcbytes + !wcbytes; + } + if (!countall) + break; + } + return n; } =20 -/* Return the wide character that is encoded at CUR. The buffer ends - at END. Return WEOF if there is no wide character encoded at CUR. *= / -wint_t -mb_next_wc (char const *cur, char const *end) +/* If BUF starts with a word constituent, return the number of bytes + used to represent it; otherwise, return zero. The buffer ends at END= =2E */ +size_t +wordchar_next (char const *buf, char const *end) { - wchar_t wc; - mbstate_t mbs =3D { 0 }; - return (end - cur !=3D 0 && mbrtowc (&wc, cur, end - cur, &mbs) < (siz= e_t) -2 - ? wc : WEOF); + return wordchars_count (buf, end, false); +} + +/* In the buffer BUF, return true if the character whose encoding + contains the byte before CUR is a word constituent. The buffer + ends at END. */ +bool +wordchar_prev (char const *buf, char const *cur, char const *end) +{ + if (buf =3D=3D cur) + return false; + cur--; + wint_t ws =3D word_start[to_uchar (*cur)]; + if (! localeinfo.multibyte) + return ws =3D=3D 1; + char const *p =3D buf; + cur -=3D mb_goback (&p, cur, end); + return wordchar_next (cur, end) !=3D 0; } --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0005-grep-speed-up-wf-in-C-locale.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0005-grep-speed-up-wf-in-C-locale.patch" =46rom e89a7e6d4be4669a8a73650c28bb1eb69399d703 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 12:43:46 -0800 Subject: [PATCH 5/8] grep: speed up -wf in C locale Problem reported by Norihiro Tanaka (Bug#22357#100). This patch improves the performance on that benchmark on my platform so that grep is now only about 2x slower than grep 2.26, which means it is considerably faster than grep 2.25 and earlier. * src/kwsearch.c (Fexecute): Use wordchars_size to boost performance for this case. * src/search.h, src/searchutils.c (wordchars_size): New function. --- src/kwsearch.c | 6 ++++++ src/search.h | 1 + src/searchutils.c | 9 +++++++++ 3 files changed, 16 insertions(+) diff --git a/src/kwsearch.c b/src/kwsearch.c index b30dfd0..6005b60 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -150,6 +150,12 @@ Fexecute (char const *buf, size_t size, size_t *matc= h_size, break; len =3D kwsmatch.size[0]; } + + /* No word match was found at BEG. Skip past word constituents, + since they cannot precede the next match and not skipping + them could make things much slower. */ + beg +=3D wordchars_size (beg, buf + size); + mb_start =3D beg; } /* for (beg in buf) */ =20 return -1; diff --git a/src/search.h b/src/search.h index 6fe1797..1def4d6 100644 --- a/src/search.h +++ b/src/search.h @@ -48,6 +48,7 @@ typedef signed char mb_len_map_t; /* searchutils.c */ extern void wordinit (void); extern kwset_t kwsinit (bool); +extern size_t wordchars_size (char const *, char const *); extern size_t wordchar_next (char const *, char const *); extern bool wordchar_prev (char const *, char const *, char const *); extern ptrdiff_t mb_goback (char const **, char const *, char const *); diff --git a/src/searchutils.c b/src/searchutils.c index e0a1db3..6f6ae0b 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -146,6 +146,15 @@ wordchars_count (char const *buf, char const *end, b= ool countall) return n; } =20 +/* Examine the start of BUF for the longest prefix containing just + word constituents. Return the total number of bytes in the prefix. + The buffer ends at END. */ +size_t +wordchars_size (char const *buf, char const *end) +{ + return wordchars_count (buf, end, true); +} + /* If BUF starts with a word constituent, return the number of bytes used to represent it; otherwise, return zero. The buffer ends at END= =2E */ size_t --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0006-grep-standardize-on-localeinfo.multibyte.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0006-grep-standardize-on-localeinfo.multibyte.patch" =46rom 4a9bbf519f3761a70c147597bcf967a27951b376 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 12:57:10 -0800 Subject: [PATCH 6/8] grep: standardize on localeinfo.multibyte * src/dfasearch.c (EGexecute): * src/grep.c (main): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pcompile): Prefer localeinfo.multibyte to (MB_CUR_MAX > 1). --- src/dfasearch.c | 2 +- src/grep.c | 2 +- src/kwsearch.c | 2 +- src/pcresearch.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/src/dfasearch.c b/src/dfasearch.c index 87e1f7e..7f68907 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -270,7 +270,7 @@ EGexecute (char const *buf, size_t size, size_t *matc= h_size, =20 if (exact_kwset_match) { - if (MB_CUR_MAX =3D=3D 1 || localeinfo.using_utf8) + if (!localeinfo.multibyte | localeinfo.using_utf8) goto success; if (mb_start < beg) mb_start =3D beg; diff --git a/src/grep.c b/src/grep.c index f9d1d86..1c45286 100644 --- a/src/grep.c +++ b/src/grep.c @@ -2852,7 +2852,7 @@ main (int argc, char **argv) (where -F does not work) or if -i and the patterns will not work for -iF. */ if (matcher =3D=3D F_MATCHER_INDEX - && (MB_CUR_MAX <=3D 1 + && (! localeinfo.multibyte ? match_words : (contains_encoding_error (keys, keycc) || (match_icase && !fgrep_icase_available (keys, keycc)))))= diff --git a/src/kwsearch.c b/src/kwsearch.c index 6005b60..7d11230 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -86,7 +86,7 @@ Fexecute (char const *buf, size_t size, size_t *match_s= ize, mb_check =3D longest =3D false; else { - mb_check =3D MB_CUR_MAX > 1 && !localeinfo.using_utf8; + mb_check =3D localeinfo.multibyte & !localeinfo.using_utf8; longest =3D mb_check | !!start_ptr | match_words; } =20 diff --git a/src/pcresearch.c b/src/pcresearch.c index 0e34861..245469c 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -110,7 +110,7 @@ Pcompile (char const *pattern, size_t size, reg_synta= x_t ignored) char const *p; char const *pnul; =20 - if (1 < MB_CUR_MAX) + if (localeinfo.multibyte) { if (! localeinfo.using_utf8) die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 loca= les")); --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0007-grep-improve-word-checking-with-UTF-8.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0007-grep-improve-word-checking-with-UTF-8.patch" =46rom 2c84095a777c2a20e99e92f406398501242ac131 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 16:16:01 -0800 Subject: [PATCH 7/8] grep: improve word checking with UTF-8 * src/searchutils.c: Do not include . (word_start): Remove, replacing with ... (sbwordchar): New static var. All uses changed. (wordchar_prev): Return size_t, not bool, as this generates slightly better code. Go back faster if UTF-8. --- src/search.h | 2 +- src/searchutils.c | 85 +++++++++++++++++++++++++++++++++----------------= ------ 2 files changed, 52 insertions(+), 35 deletions(-) diff --git a/src/search.h b/src/search.h index 1def4d6..b700ed5 100644 --- a/src/search.h +++ b/src/search.h @@ -50,7 +50,7 @@ extern void wordinit (void); extern kwset_t kwsinit (bool); extern size_t wordchars_size (char const *, char const *); extern size_t wordchar_next (char const *, char const *); -extern bool wordchar_prev (char const *, char const *, char const *); +extern size_t wordchar_prev (char const *, char const *, char const *); extern ptrdiff_t mb_goback (char const **, char const *, char const *); =20 /* dfasearch.c */ diff --git a/src/searchutils.c b/src/searchutils.c index 6f6ae0b..3ba3cdb 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -22,14 +22,9 @@ #define SYSTEM_INLINE _GL_EXTERN_INLINE #include "search.h" =20 -#include - -/* For each byte B, word_start[B] is 1 if B is a single-byte character - that is a word constituent, 0 if B cannot start a word constituent, - and -1 if B might be or might not be the start of a word - constituent. */ -static wint_t word_start[NCHAR]; -verify (WEOF !=3D 0 && WEOF !=3D 1); +/* For each byte B, sbwordchar[B] is true if B is a single-byte + character that is a word constituent, and is false otherwise. */ +static bool sbwordchar[NCHAR]; =20 /* Whether -w considers WC to be a word constituent. */ static bool @@ -42,8 +37,7 @@ void wordinit (void) { for (int i =3D 0; i < NCHAR; i++) - word_start[i] =3D (localeinfo.sbclen[i] =3D=3D -2 ? WEOF - : wordchar (localeinfo.sbctowc[i])); + sbwordchar[i] =3D wordchar (localeinfo.sbctowc[i]); } =20 kwset_t @@ -94,23 +88,46 @@ mb_goback (char const **mb_start, char const *cur, ch= ar const *end) { const char *p =3D *mb_start; const char *p0 =3D p; - mbstate_t cur_state; =20 - memset (&cur_state, 0, sizeof cur_state); + if (cur <=3D p) + return cur - p; =20 - while (p < cur) + if (localeinfo.using_utf8) { - size_t clen =3D mb_clen (p, end - p, &cur_state); - - if ((size_t) -2 <=3D clen) + p =3D cur; + + if (cur < end && (*cur & 0xc0) =3D=3D 0x80) + for (int i =3D 1; i <=3D 3; i++) + if ((cur[-i] & 0xc0) !=3D 0x80) + { + mbstate_t mbs =3D { 0 }; + size_t clen =3D mb_clen (cur - i, end - (cur - i), &mbs); + if (i < clen && clen < (size_t) -2) + { + p0 =3D cur - i; + p =3D p0 + clen; + } + break; + } + } + else + { + mbstate_t mbs =3D { 0 }; + do { - /* An invalid sequence, or a truncated multibyte character. - Treat it as a single byte character. */ - clen =3D 1; - memset (&cur_state, 0, sizeof cur_state); + size_t clen =3D mb_clen (p, end - p, &mbs); + + if ((size_t) -2 <=3D clen) + { + /* An invalid sequence, or a truncated multibyte character= =2E + Treat it as a single byte character. */ + clen =3D 1; + memset (&mbs, 0, sizeof mbs); + } + p0 =3D p; + p +=3D clen; } - p0 =3D p; - p +=3D clen; + while (p < cur); } =20 *mb_start =3D p; @@ -127,11 +144,11 @@ wordchars_count (char const *buf, char const *end, = bool countall) mbstate_t mbs =3D { 0 }; while (n < end - buf) { - wint_t ws =3D word_start[to_uchar (buf[n])]; - if (ws =3D=3D 0) - break; - else if (ws =3D=3D 1) + unsigned char b =3D buf[n]; + if (sbwordchar[b]) n++; + else if (localeinfo.sbclen[b] !=3D -2) + break; else { wchar_t wc =3D 0; @@ -163,19 +180,19 @@ wordchar_next (char const *buf, char const *end) return wordchars_count (buf, end, false); } =20 -/* In the buffer BUF, return true if the character whose encoding +/* In the buffer BUF, return nonzero if the character whose encoding contains the byte before CUR is a word constituent. The buffer ends at END. */ -bool +size_t wordchar_prev (char const *buf, char const *cur, char const *end) { if (buf =3D=3D cur) - return false; - cur--; - wint_t ws =3D word_start[to_uchar (*cur)]; - if (! localeinfo.multibyte) - return ws =3D=3D 1; + return 0; + unsigned char b =3D *--cur; + if (! localeinfo.multibyte + || (localeinfo.using_utf8 && localeinfo.sbclen[b] !=3D -2)) + return sbwordchar[b]; char const *p =3D buf; cur -=3D mb_goback (&p, cur, end); - return wordchar_next (cur, end) !=3D 0; + return wordchar_next (cur, end); } --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA Content-Type: text/x-diff; name="0008-grep-fix-comment-in-searchutils.c.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0008-grep-fix-comment-in-searchutils.c.patch" =46rom 2353c63efb1c471c752a1b6575f81dcafb67684b Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Fri, 23 Dec 2016 17:29:54 -0800 Subject: [PATCH 8/8] grep: fix comment in searchutils.c --- src/searchutils.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/searchutils.c b/src/searchutils.c index 3ba3cdb..1552ed7 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -77,7 +77,7 @@ kwsinit (bool mb_trans) start of a multibyte character or is an error-encoding byte. The buffer ends at END (i.e., one past the address of the buffer's last byte). If CUR is already at a boundary, return 0. If *MB_START is - greater than or equal to CUR, return the negative value CUR - *MB_STA= RT. + greater than CUR, return the negative value CUR - *MB_START. =20 When returning zero, set *MB_START to CUR. When returning a positive value, set *MB_START to the next boundary after CUR, or to --=20 2.7.4 --------------6B4F36F83DA3EA4416933FAA-- From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 26 Dec 2016 12:08:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.148275402522100 (code D ref 21763); Mon, 26 Dec 2016 12:08:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 26 Dec 2016 12:07:05 +0000 Received: from localhost ([127.0.0.1]:55354 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cLU3U-0005kM-RJ for submit@debbugs.gnu.org; Mon, 26 Dec 2016 07:07:05 -0500 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:55884) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cLU3S-0005jd-Ih for 21763-done@debbugs.gnu.org; Mon, 26 Dec 2016 07:07:03 -0500 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 59F6E4A090E for <21763-done@debbugs.gnu.org>; Mon, 26 Dec 2016 21:06:55 +0900 (JST) X-matriXscan-loop-detect: 254e150c4fb1c825d91bf0f8eba1195456a39152 Received: from mail01.kcn.ne.jp ([61.86.6.180]) by mxs01-s with ESMTP; Mon, 26 Dec 2016 21:06:54 +0900 (JST) Received: from [10.120.1.67] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail01.kcn.ne.jp (Postfix) with ESMTPA id C96BF5A82E8; Mon, 26 Dec 2016 21:06:53 +0900 (JST) Date: Mon, 26 Dec 2016 21:06:55 +0900 From: Norihiro Tanaka In-Reply-To: <820176a7-a1e4-7b3f-f140-7d2329d25d63@cs.ucla.edu> References: <20161222080441.6C45.27F6AC2D@kcn.ne.jp> <820176a7-a1e4-7b3f-f140-7d2329d25d63@cs.ucla.edu> Message-Id: <20161226210654.554A.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -3.1 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) On Fri, 23 Dec 2016 17:38:42 -0800 Paul Eggert wrote: > No. Thanks, I hadn't considered that possibility. I looked into the > slowdown and installed the attached patches, which cause 'grep' to > run about as fast on this test case as grep 2.25 (though not as fast > as grep 2.26). The main fix is in patch 5. On my platform: Hmm, how about the following test cases, although it is extreame? $ cat pat 0 00 0 00 00 0 00 00 00 0 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 00 00 0 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 0 $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp $ time -p env LC_ALL=C src/grep -w -f pat inp From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 26 Dec 2016 13:02:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka Cc: Paul Eggert , 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.14827573051811 (code D ref 21763); Mon, 26 Dec 2016 13:02:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 26 Dec 2016 13:01:45 +0000 Received: from localhost ([127.0.0.1]:55385 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cLUuP-0000T8-1v for submit@debbugs.gnu.org; Mon, 26 Dec 2016 08:01:45 -0500 Received: from mail-it0-f67.google.com ([209.85.214.67]:33337) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cLUuM-0000Sp-Rl; Mon, 26 Dec 2016 08:01:43 -0500 Received: by mail-it0-f67.google.com with SMTP id c20so30789116itb.0; Mon, 26 Dec 2016 05:01:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=guQGwp8QkhhQv1dXrud9cPRky2JCpUA4K9FhWsSBQwE=; b=YWTxxgV8zuyXZQ2ttNlXh4caomFJ7XoJPpnx8DMkAzCfWjg8hx9ObCkSw50HFmAdKU i3/6siocvaRxNJTgzJcct+M1zKB4+weVG/lD6XM32a3aAcHMyYEZkbsSP4Skjll0Ya0c nFoUZwrbVmn1/+gMvQ4Iu+uOzduVSEp3crgBaOSszlFwMHcSAFQlrlLTfAv9wzlh7kde Eis1H2ASkTnGyr4Fzm0hcu7dqJI4sxmBdkYXua0ePZL9WQPMAY25EWOwBChCrtUdaxWZ UitJA47nRFptP8otqNj3cc5Y8zzEDV6ZQehRhrO5QN9bF9NiV+XsnmRlBIexyWVF9V2I FufQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=guQGwp8QkhhQv1dXrud9cPRky2JCpUA4K9FhWsSBQwE=; b=F7zBe7zWXBdwk0QTKIFZRYIF4d7zljFJVwqnRvVQpgbhlH3xtxCkG+fl5++XymsQW+ hjhfwRG5wpWMLUDgu7tZt8IcC89ZCeZfJ68JoEbBPLVMDdn/tEw/A3AsU9hhv4/m2s8Y sTA96+33y4XzOt60d+NPwv6tbjrRRqSRrDNbqTuvB+z1AZTMJZeDhRQEeR9hX4Cr813J LCI+dch1ZZ0UWXoJxfZTZWMrRnXpWiTphcMDjZhVHMzKoi4mX99Cv3Og7Zfjp3I0gVQv N/5UDyHu+olh3DV3CS8orld2N/fbN+C8BSQfYaDh/MPUcbuPN+yv279+zB8yFfYQnt9U /lbA== X-Gm-Message-State: AIkVDXIZ0bFU9AxAcpKFcnMQ9hDImtMtq6iO/wLlHSfFJh4Eo60T9BD/3O1+cdSDXnxpbFYWUtJXHGXUPF1HGg== X-Received: by 10.36.17.133 with SMTP id 127mr25564240itf.31.1482757296852; Mon, 26 Dec 2016 05:01:36 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.191.5 with HTTP; Mon, 26 Dec 2016 05:01:16 -0800 (PST) In-Reply-To: <20161226210654.554A.27F6AC2D@kcn.ne.jp> References: <20161222080441.6C45.27F6AC2D@kcn.ne.jp> <820176a7-a1e4-7b3f-f140-7d2329d25d63@cs.ucla.edu> <20161226210654.554A.27F6AC2D@kcn.ne.jp> From: Jim Meyering Date: Mon, 26 Dec 2016 14:01:16 +0100 X-Google-Sender-Auth: ARghcFI6yA-cCLNGhp7huJd2GoM Message-ID: Content-Type: text/plain; charset=UTF-8 X-Spam-Score: 3.5 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Mon, Dec 26, 2016 at 1:06 PM, Norihiro Tanaka wrote: > > On Fri, 23 Dec 2016 17:38:42 -0800 > Paul Eggert wrote: > >> No. Thanks, I hadn't considered that possibility. I looked into the >> slowdown and installed the attached patches, which cause 'grep' to >> run about as fast on this test case as grep 2.25 (though not as fast >> as grep 2.26). The main fix is in patch 5. On my platform: > > Hmm, how about the following test cases, although it is extreame? > > $ cat pat > 0 > 00 0 > 00 00 0 > 00 00 00 0 > 00 00 00 00 0 > 00 00 00 00 00 0 > 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | > head -1000000 >inp > $ time -p env LC_ALL=C src/grep -w -f pat inp [...] Content analysis details: (3.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 3.0 MANY_TO_CC Sent to 10+ recipients 0.5 RCVD_IN_SORBS_SPAM RBL: SORBS: sender is a spam source [209.85.214.67 listed in dnsbl.sorbs.net] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [209.85.214.67 listed in list.dnswl.org] -0.0 SPF_PASS SPF: sender matches SPF record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (meyering[at]gmail.com) 0.0 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different -0.0 RCVD_IN_MSPIKE_H3 RBL: Good reputation (+3) [209.85.214.67 listed in wl.mailspike.net] -0.0 RCVD_IN_MSPIKE_WL Mailspike good senders 0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid 0.0 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 3.5 (+++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: On Mon, Dec 26, 2016 at 1:06 PM, Norihiro Tanaka wrote: > > On Fri, 23 Dec 2016 17:38:42 -0800 > Paul Eggert wrote: > >> No. Thanks, I hadn't considered that possibility. I looked into the >> slowdown and installed the attached patches, which cause 'grep' to >> run about as fast on this test case as grep 2.25 (though not as fast >> as grep 2.26). The main fix is in patch 5. On my platform: > > Hmm, how about the following test cases, although it is extreame? > > $ cat pat > 0 > 00 0 > 00 00 0 > 00 00 00 0 > 00 00 00 00 0 > 00 00 00 00 00 0 > 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | > head -1000000 >inp > $ time -p env LC_ALL=C src/grep -w -f pat inp [...] Content analysis details: (3.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.5 RCVD_IN_SORBS_SPAM RBL: SORBS: sender is a spam source [209.85.214.67 listed in dnsbl.sorbs.net] -0.0 RCVD_IN_MSPIKE_H3 RBL: Good reputation (+3) [209.85.214.67 listed in wl.mailspike.net] -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [209.85.214.67 listed in list.dnswl.org] 3.0 MANY_TO_CC Sent to 10+ recipients -0.0 SPF_PASS SPF: sender matches SPF record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (meyering[at]gmail.com) 0.0 HEADER_FROM_DIFFERENT_DOMAINS From and EnvelopeFrom 2nd level mail domains are different -0.0 RCVD_IN_MSPIKE_WL Mailspike good senders 0.0 T_DKIM_INVALID DKIM-Signature header exists but is not valid 0.0 FREEMAIL_FORGED_FROMDOMAIN 2nd level domains in From and EnvelopeFrom freemail headers are different On Mon, Dec 26, 2016 at 1:06 PM, Norihiro Tanaka wrote: > > On Fri, 23 Dec 2016 17:38:42 -0800 > Paul Eggert wrote: > >> No. Thanks, I hadn't considered that possibility. I looked into the >> slowdown and installed the attached patches, which cause 'grep' to >> run about as fast on this test case as grep 2.25 (though not as fast >> as grep 2.26). The main fix is in patch 5. On my platform: > > Hmm, how about the following test cases, although it is extreame? > > $ cat pat > 0 > 00 0 > 00 00 0 > 00 00 00 0 > 00 00 00 00 0 > 00 00 00 00 00 0 > 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | > head -1000000 >inp > $ time -p env LC_ALL=C src/grep -w -f pat inp That took 7 seconds for me. Here is one that is O(N^2) in the length of the literal search string: (the search strings are sequences of '0's, with lengths from 10k.. 320k): $ n=10000; for i in $(seq 6); do env time -f "$n %e" src/grep -f <(printf %0${n}d 0) <<<1; ((n *= 2)); done 10000 0.44 20000 1.44 40000 5.41 80000 21.27 160000 97.27 320000 426.72 From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 26 Dec 2016 20:08:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka Cc: 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.148278287825947 (code D ref 21763); Mon, 26 Dec 2016 20:08:03 +0000 Received: (at 21763-done) by debbugs.gnu.org; 26 Dec 2016 20:07:58 +0000 Received: from localhost ([127.0.0.1]:55967 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cLbYs-0006kQ-Bl for submit@debbugs.gnu.org; Mon, 26 Dec 2016 15:07:58 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:54010) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cLbYq-0006k8-Ll; Mon, 26 Dec 2016 15:07:57 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id ACAE9160079; Mon, 26 Dec 2016 12:07:50 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id ekOKOZrEI9Td; Mon, 26 Dec 2016 12:07:49 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id C4C2E16007E; Mon, 26 Dec 2016 12:07:49 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id GMSYi0oqeaut; Mon, 26 Dec 2016 12:07:49 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 90DFD160079; Mon, 26 Dec 2016 12:07:49 -0800 (PST) References: <20161222080441.6C45.27F6AC2D@kcn.ne.jp> <820176a7-a1e4-7b3f-f140-7d2329d25d63@cs.ucla.edu> <20161226210654.554A.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <9236dc72-5ec4-20ed-56d5-454876c82bec@cs.ucla.edu> Date: Mon, 26 Dec 2016 12:07:49 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161226210654.554A.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------4C2E43CAACF24E1AF65CFC08" X-Spam-Score: -3.1 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) This is a multi-part message in MIME format. --------------4C2E43CAACF24E1AF65CFC08 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Norihiro Tanaka wrote: > Hmm, how about the following test cases, although it is extreame? I don't think we need to worry about performance for the case when -w is = given,=20 and a pattern matches data that contains non-word characters. In practice= , such=20 cases are rare. I expect that most users would be surprised that -w can m= atch=20 non-word characters, and that users wouldn't object to -w rejecting such = matches=20 (if this wouldn't hurt performance significantly). While looking into this I did find a very small performance tweak for the= test=20 case, and installed the attached. --------------4C2E43CAACF24E1AF65CFC08 Content-Type: text/x-diff; name="0001-grep-minor-performance-tweak-for-pure-functions.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename*0="0001-grep-minor-performance-tweak-for-pure-functions.patch" =46rom 7ad36793e6bc43b98a37b1512b057a7a03ed10b2 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Mon, 26 Dec 2016 09:42:29 -0800 Subject: [PATCH] grep: minor performance tweak for pure functions * src/search.h (wordchars_size, wordchar_next, wordchar_prev): Declare to be pure. --- src/search.h | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/src/search.h b/src/search.h index 469fc9c..1aa2f94 100644 --- a/src/search.h +++ b/src/search.h @@ -48,9 +48,10 @@ typedef signed char mb_len_map_t; /* searchutils.c */ extern void wordinit (void); extern kwset_t kwsinit (bool); -extern size_t wordchars_size (char const *, char const *); -extern size_t wordchar_next (char const *, char const *); -extern size_t wordchar_prev (char const *, char const *, char const *); +extern size_t wordchars_size (char const *, char const *) _GL_ATTRIBUTE_= PURE; +extern size_t wordchar_next (char const *, char const *) _GL_ATTRIBUTE_P= URE; +extern size_t wordchar_prev (char const *, char const *, char const *) + _GL_ATTRIBUTE_PURE; extern ptrdiff_t mb_goback (char const **, char const *, char const *); =20 /* dfasearch.c */ --=20 2.7.4 --------------4C2E43CAACF24E1AF65CFC08-- From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 28 Dec 2016 00:22:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.14828844726992 (code D ref 21763); Wed, 28 Dec 2016 00:22:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 28 Dec 2016 00:21:12 +0000 Received: from localhost ([127.0.0.1]:57405 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cM1zU-0001od-0G for submit@debbugs.gnu.org; Tue, 27 Dec 2016 19:21:12 -0500 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:53018) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cM1zR-0001o5-Ng for 21763-done@debbugs.gnu.org; Tue, 27 Dec 2016 19:21:10 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id CDF7D4A08C9 for <21763-done@debbugs.gnu.org>; Wed, 28 Dec 2016 09:21:02 +0900 (JST) X-matriXscan-loop-detect: 812ce1e62af099b3b94bb2ebeb20def189b2a5c4 Received: from mail06.kcn.ne.jp ([61.86.6.185]) by mxs02-s with ESMTP; Wed, 28 Dec 2016 09:21:01 +0900 (JST) Received: from [10.120.1.58] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id 2B3A91BF0092; Wed, 28 Dec 2016 09:21:01 +0900 (JST) Date: Wed, 28 Dec 2016 09:21:02 +0900 From: Norihiro Tanaka In-Reply-To: <9236dc72-5ec4-20ed-56d5-454876c82bec@cs.ucla.edu> References: <20161226210654.554A.27F6AC2D@kcn.ne.jp> <9236dc72-5ec4-20ed-56d5-454876c82bec@cs.ucla.edu> Message-Id: <20161228091446.CE2D.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_5863030800000000CE28_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -3.1 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) --------_5863030800000000CE28_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Mon, 26 Dec 2016 12:07:49 -0800 Paul Eggert wrote: > Norihiro Tanaka wrote: > > Hmm, how about the following test cases, although it is extreame? > > I don't think we need to worry about performance for the case when -w > is given, and a pattern matches data that contains non-word > characters. In practice, such cases are rare. I expect that most > users would be surprised that -w can match non-word characters, and > that users wouldn't object to -w rejecting such matches (if this > wouldn't hurt performance significantly). > > While looking into this I did find a very small performance tweak for > the test case, and installed the attached. Thanks. BTW, with multiple patterns in current master, former uses fgrep matcher, and later users grep matcher. I think that it is not reasonable. env LC_ALL=C grep -w -f pat inp env LC_ALL=C grep -F -w -f pat inp So I wrote the patch to use fgrep matcher for both. --------_5863030800000000CE28_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-imorove-performance-with-multiple-patterns.patch" Content-Disposition: attachment; filename="0001-grep-imorove-performance-with-multiple-patterns.patch" Content-Transfer-Encoding: base64 RnJvbSAwOGQ0MjZkOGEwYmY4NjgzYjBlMzRlYWQyZmQzNTYxNjcxNDE0NzM1IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBXZWQsIDI4IERlYyAyMDE2IDA4OjU3OjU0ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZ3Jl cDogaW1vcm92ZSBwZXJmb3JtYW5jZSB3aXRoIG11bHRpcGxlIHBhdHRlcm5zCgoqIHNyYy9ncmVw LmMgKG1haW4pOiBBdm9pZCBmZ3JlcC10by1ncmVwIGNvbnZlcnNpb24gZm9yIHdvcmQgbWF0Y2hp bmcKd2l0aCBtdWx0aXBsZSBwYXR0ZXJucyBpbiBzaW5nbGUgYnl0ZSBsb2NhbGVzLgotLS0KIHNy Yy9ncmVwLmMgfCAgICAyICstCiAxIGZpbGVzIGNoYW5nZWQsIDEgaW5zZXJ0aW9ucygrKSwgMSBk ZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5jIGIvc3JjL2dyZXAuYwppbmRleCBh ZWJhYjIwLi44Y2Y1MjJjIDEwMDY0NAotLS0gYS9zcmMvZ3JlcC5jCisrKyBiL3NyYy9ncmVwLmMK QEAgLTI4NjIsNyArMjg2Miw3IEBAIG1haW4gKGludCBhcmdjLCBjaGFyICoqYXJndikKICAgICAg Zm9yIC1pRi4gICovCiAgIGlmIChtYXRjaGVyID09IEZfTUFUQ0hFUl9JTkRFWAogICAgICAgJiYg KCEgbG9jYWxlaW5mby5tdWx0aWJ5dGUKLSAgICAgICAgICA/IG1hdGNoX3dvcmRzCisgICAgICAg ICAgPyAobl9wYXR0ZXJucyA9PSAxICYmIG1hdGNoX3dvcmRzKQogICAgICAgICAgIDogKGNvbnRh aW5zX2VuY29kaW5nX2Vycm9yIChrZXlzLCBrZXljYykKICAgICAgICAgICAgICB8fCAobWF0Y2hf aWNhc2UgJiYgIWZncmVwX2ljYXNlX2F2YWlsYWJsZSAoa2V5cywga2V5Y2MpKSkpKQogICAgIHsK LS0gCjEuNy4xCgo= --------_5863030800000000CE28_MULTIPART_MIXED_-- From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 28 Dec 2016 06:38:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka Cc: 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.148290705410763 (code D ref 21763); Wed, 28 Dec 2016 06:38:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 28 Dec 2016 06:37:34 +0000 Received: from localhost ([127.0.0.1]:57560 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cM7rh-0002nW-Sa for submit@debbugs.gnu.org; Wed, 28 Dec 2016 01:37:34 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60530) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cM7rg-0002n9-Hj; Wed, 28 Dec 2016 01:37:33 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6B55D1600A5; Tue, 27 Dec 2016 22:37:26 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id BWnIFIAgBJi5; Tue, 27 Dec 2016 22:37:25 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B442D1600E8; Tue, 27 Dec 2016 22:37:25 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 7fO7XX8cJpoy; Tue, 27 Dec 2016 22:37:25 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 842F01600A5; Tue, 27 Dec 2016 22:37:25 -0800 (PST) References: <20161226210654.554A.27F6AC2D@kcn.ne.jp> <9236dc72-5ec4-20ed-56d5-454876c82bec@cs.ucla.edu> <20161228091446.CE2D.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <7ab75385-62ba-8834-70df-f94784ab0526@cs.ucla.edu> Date: Tue, 27 Dec 2016 22:37:25 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 In-Reply-To: <20161228091446.CE2D.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.1 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) Norihiro Tanaka wrote: > So I wrote the patch to use fgrep matcher for both. Thanks, I installed that after tweaking the commit message and omitting unnecessary parens. From unknown Sun Jun 22 03:59:00 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21763: bug#22239: bug#22357: grep -f not only huge memory usage, but also huge time cost Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 28 Dec 2016 14:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 21763 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 21763-done@debbugs.gnu.org, 22357-done@debbugs.gnu.org, sur-behoffski , "L.A. Walsh" , JQK , 22239@debbugs.gnu.org, Trevor Cordes , Bruno Haible , Ond?ej Cifka Received: via spool by 21763-done@debbugs.gnu.org id=D21763.148293562529099 (code D ref 21763); Wed, 28 Dec 2016 14:34:02 +0000 Received: (at 21763-done) by debbugs.gnu.org; 28 Dec 2016 14:33:45 +0000 Received: from localhost ([127.0.0.1]:57692 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cMFIW-0007ZC-Np for submit@debbugs.gnu.org; Wed, 28 Dec 2016 09:33:44 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:33567) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cMFIU-0007Yd-Qj for 21763-done@debbugs.gnu.org; Wed, 28 Dec 2016 09:33:43 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 1B0F6C8006 for <21763-done@debbugs.gnu.org>; Wed, 28 Dec 2016 23:33:35 +0900 (JST) X-matriXscan-loop-detect: ef7ef3769c2ce2c32e65267f5bcc2fbac092238c Received: from mail05.kcn.ne.jp ([61.86.6.184]) by mxs02-s with ESMTP; Wed, 28 Dec 2016 23:33:33 +0900 (JST) Received: from [10.120.1.74] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail05.kcn.ne.jp (Postfix) with ESMTPA id F1EE67D0098; Wed, 28 Dec 2016 23:33:32 +0900 (JST) Date: Wed, 28 Dec 2016 23:33:34 +0900 From: Norihiro Tanaka In-Reply-To: <7ab75385-62ba-8834-70df-f94784ab0526@cs.ucla.edu> References: <20161228091446.CE2D.27F6AC2D@kcn.ne.jp> <7ab75385-62ba-8834-70df-f94784ab0526@cs.ucla.edu> Message-Id: <20161228233333.5DE8.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -3.1 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.1 (---) On Tue, 27 Dec 2016 22:37:25 -0800 Paul Eggert wrote: > Norihiro Tanaka wrote: > > So I wrote the patch to use fgrep matcher for both. > > Thanks, I installed that after tweaking the commit message and omitting unnecessary parens. Thanks, I confirmed it.