From unknown Sun Jun 22 07:29:47 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#31074 <31074@debbugs.gnu.org> To: bug#31074 <31074@debbugs.gnu.org> Subject: Status: Grep -i is slow Reply-To: bug#31074 <31074@debbugs.gnu.org> Date: Sun, 22 Jun 2025 14:29:47 +0000 retitle 31074 Grep -i is slow reassign 31074 grep submitter 31074 Geoff Kuenning severity 31074 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Fri Apr 06 01:33:01 2018 Received: (at submit) by debbugs.gnu.org; 6 Apr 2018 05:33:01 +0000 Received: from localhost ([127.0.0.1]:39799 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f4Jzh-0001TJ-8S for submit@debbugs.gnu.org; Fri, 06 Apr 2018 01:33:01 -0400 Received: from eggs.gnu.org ([208.118.235.92]:45593) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f4Jze-0001T3-SK for submit@debbugs.gnu.org; Fri, 06 Apr 2018 01:32:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f4JzY-0000RO-Ji for submit@debbugs.gnu.org; Fri, 06 Apr 2018 01:32:53 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:47875) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1f4JzY-0000Qx-GR for submit@debbugs.gnu.org; Fri, 06 Apr 2018 01:32:52 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36557) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1f4JzX-0006eQ-A9 for bug-grep@gnu.org; Fri, 06 Apr 2018 01:32:52 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1f4JzS-0000OZ-DH for bug-grep@gnu.org; Fri, 06 Apr 2018 01:32:51 -0400 Received: from mallet.cs.hmc.edu ([134.173.42.59]:47392) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1f4JzS-0000Nn-6z for bug-grep@gnu.org; Fri, 06 Apr 2018 01:32:46 -0400 Received: from bow.cs.hmc.edu (bow-vpn.cs.hmc.edu [10.81.251.5]) by mallet.cs.hmc.edu (Postfix) with ESMTP id 40DF4DC05AC for ; Thu, 5 Apr 2018 22:32:42 -0700 (PDT) Received: by bow.cs.hmc.edu (Postfix, from userid 13409) id 0BA4A6EA0988; Thu, 5 Apr 2018 22:32:41 -0700 (PDT) From: Geoff Kuenning To: bug-grep@gnu.org Subject: Grep -i is slow Date: Thu, 05 Apr 2018 22:32:41 -0700 Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.1 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.1 (----) The -i switch is slow when searching large files. I haven't dug into the code in detail, although it seems that dfa.c is trying to build an intelligent case-agnostic DFA when -i is specified. But that doesn't seem to be working. Perhaps that's because I'm running the UTF-8 character set? Although I don't see why that would affect the DFA. Here's an example of timing several greps of 151M file named "rawindex", which has already been read so that it is in the file system buffer cache. In each case the grep finds a single match, since the matched line is actually all lowercase; for privacy, I have omitted the match lines themselves. A straightforward match takes only 199 ms even with two .* patterns. Adding -i blows that up to 6917 ms. Finally when I write an explicit case-agnostic pattern to force how the DFA is built, it does run slower (532 ms) but it's nowhere near the -i time. mallet:514> time grep outgoing.*harris.*dcraw rawindex real 0m0.199s user 0m0.170s sys 0m0.029s mallet:515> time grep -i outgoing.*harris.*dcraw rawindex real 0m6.917s user 0m6.879s sys 0m0.036s mallet:516> time grep [Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]' rawindex real 0m0.532s user 0m0.491s sys 0m0.040s -- Geoff Kuenning geoff@cs.hmc.edu http://www.cs.hmc.edu/~geoff/ The DMCA criminalizes curiosity. It would put Susie in jail for taking her stereo apart to see how it works. From debbugs-submit-bounces@debbugs.gnu.org Fri Apr 06 15:35:39 2018 Received: (at 31074) by debbugs.gnu.org; 6 Apr 2018 19:35:39 +0000 Received: from localhost ([127.0.0.1]:40629 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f4X99-0001P9-M0 for submit@debbugs.gnu.org; Fri, 06 Apr 2018 15:35:39 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:41662) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f4X97-0001Ov-Ii for 31074@debbugs.gnu.org; Fri, 06 Apr 2018 15:35:38 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9BE381616D4; Fri, 6 Apr 2018 12:35:31 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id XCQ8mpAP1F-m; Fri, 6 Apr 2018 12:35:30 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E36C71616E8; Fri, 6 Apr 2018 12:35:30 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id VJ_QAqB5kH5i; Fri, 6 Apr 2018 12:35:30 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C7EFC1616E5; Fri, 6 Apr 2018 12:35:30 -0700 (PDT) Subject: Re: bug#31074: Grep -i is slow To: Geoff Kuenning , 31074@debbugs.gnu.org References: From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <48fa7a54-9396-842d-e51d-892693803f8e@cs.ucla.edu> Date: Fri, 6 Apr 2018 12:35:30 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31074 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) It sounds like you've run into a bug that was fixed in grep 2.18=20 (2014-02-20). Please try grep 3.1, the current version. If that doesn't=20 work, it'd be helpful if you could give us a reproducible test case.=20 Here's how I tried (and failed) to reproduce the problem on Fedora 27=20 x86-64, which has grep 3.1: $ shuf -i 1-20000000 >rawindex $ ls -l rawindex -rw-r--r--. 1 eggert eggert 168888897 Apr=C2=A0 6 12:30 rawindex $ time grep outgoing.*harris.*dcraw rawindex real=C2=A0=C2=A0=C2=A0 0m0.069s user=C2=A0=C2=A0=C2=A0 0m0.013s sys=C2=A0=C2=A0 =C2=A0 0m0.055s $ time grep -i outgoing.*harris.*dcraw rawindex real=C2=A0=C2=A0=C2=A0 0m0.418s user=C2=A0=C2=A0=C2=A0 0m0.368s sys=C2=A0 =C2=A0=C2=A0 0m0.048s $ time grep=20 '[Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr]= [Aa][Ww]'=20 rawindex real=C2=A0=C2=A0=C2=A0 0m0.416s user=C2=A0=C2=A0=C2=A0 0m0.357s sys=C2=A0 =C2=A0=C2=A0 0m0.058s $ locale LANG=3Den_US.UTF-8 LC_CTYPE=3D"en_US.UTF-8" LC_NUMERIC=3D"en_US.UTF-8" LC_TIME=3D"en_US.UTF-8" LC_COLLATE=3D"en_US.UTF-8" LC_MONETARY=3D"en_US.UTF-8" LC_MESSAGES=3D"en_US.UTF-8" LC_PAPER=3D"en_US.UTF-8" LC_NAME=3D"en_US.UTF-8" LC_ADDRESS=3D"en_US.UTF-8" LC_TELEPHONE=3D"en_US.UTF-8" LC_MEASUREMENT=3D"en_US.UTF-8" LC_IDENTIFICATION=3D"en_US.UTF-8" LC_ALL=3D From debbugs-submit-bounces@debbugs.gnu.org Mon Apr 09 00:56:29 2018 Received: (at 31074) by debbugs.gnu.org; 9 Apr 2018 04:56:29 +0000 Received: from localhost ([127.0.0.1]:43034 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f5Oqy-0005TR-Qy for submit@debbugs.gnu.org; Mon, 09 Apr 2018 00:56:29 -0400 Received: from mallet.cs.hmc.edu ([134.173.42.59]:51044) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f5Oqx-0005TJ-Ah for 31074@debbugs.gnu.org; Mon, 09 Apr 2018 00:56:27 -0400 Received: from bow.cs.hmc.edu (bow-vpn.cs.hmc.edu [10.81.251.5]) by mallet.cs.hmc.edu (Postfix) with ESMTP id 8CB06DC0679; Sun, 8 Apr 2018 21:56:25 -0700 (PDT) Received: by bow.cs.hmc.edu (Postfix, from userid 13409) id 657F76EA0898; Sun, 8 Apr 2018 21:56:25 -0700 (PDT) From: Geoff Kuenning To: Paul Eggert Subject: Re: bug#31074: Grep -i is slow References: <48fa7a54-9396-842d-e51d-892693803f8e@cs.ucla.edu> Date: Sun, 08 Apr 2018 21:56:25 -0700 In-Reply-To: <48fa7a54-9396-842d-e51d-892693803f8e@cs.ucla.edu> (Paul Eggert's message of "Fri, 6 Apr 2018 12:35:30 -0700") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 31074 Cc: 31074@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Nice catch! It looks like I'm running grep 2.16. No clue why my=20 distro is more than four years behind. I'll whine at them. In=20 the meantime, let's close this bug; I can reopen it in the=20 unlikely event that it's still there when I get around to=20 upgrading. > It sounds like you've run into a bug that was fixed in grep 2.18 > (2014-02-20). Please try grep 3.1, the current version. If that > doesn't work, it'd be helpful if you could give us a=20 > reproducible test > case. Here's how I tried (and failed) to reproduce the problem=20 > on > Fedora 27 x86-64, which has grep 3.1: > > $ shuf -i 1-20000000 >rawindex > $ ls -l rawindex > -rw-r--r--. 1 eggert eggert 168888897 Apr=C2=A0 6 12:30 rawindex > $ time grep outgoing.*harris.*dcraw rawindex > > real=C2=A0=C2=A0=C2=A0 0m0.069s > user=C2=A0=C2=A0=C2=A0 0m0.013s > sys=C2=A0=C2=A0 =C2=A0 0m0.055s > $ time grep -i outgoing.*harris.*dcraw rawindex > > real=C2=A0=C2=A0=C2=A0 0m0.418s > user=C2=A0=C2=A0=C2=A0 0m0.368s > sys=C2=A0 =C2=A0=C2=A0 0m0.048s > $ time grep > [Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][= Aa][Ww]' > rawindex > > real=C2=A0=C2=A0=C2=A0 0m0.416s > user=C2=A0=C2=A0=C2=A0 0m0.357s > sys=C2=A0 =C2=A0=C2=A0 0m0.058s > $ locale > LANG=3Den_US.UTF-8 > LC_CTYPE=3D"en_US.UTF-8" > LC_NUMERIC=3D"en_US.UTF-8" > LC_TIME=3D"en_US.UTF-8" > LC_COLLATE=3D"en_US.UTF-8" > LC_MONETARY=3D"en_US.UTF-8" > LC_MESSAGES=3D"en_US.UTF-8" > LC_PAPER=3D"en_US.UTF-8" > LC_NAME=3D"en_US.UTF-8" > LC_ADDRESS=3D"en_US.UTF-8" > LC_TELEPHONE=3D"en_US.UTF-8" > LC_MEASUREMENT=3D"en_US.UTF-8" > LC_IDENTIFICATION=3D"en_US.UTF-8" > LC_ALL=3D > --=20 Geoff Kuenning geoff@cs.hmc.edu=20 http://www.cs.hmc.edu/~geoff/ I have always wished for my computer to be as easy to use as my telephone; my wish has come true because I can no longer figure=20 out how to use my telephone. -- Bjarne Stroustrup From debbugs-submit-bounces@debbugs.gnu.org Mon Apr 09 14:38:46 2018 Received: (at control) by debbugs.gnu.org; 9 Apr 2018 18:38:46 +0000 Received: from localhost ([127.0.0.1]:44239 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f5bgk-0003kn-0q for submit@debbugs.gnu.org; Mon, 09 Apr 2018 14:38:46 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:56294) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1f5bgj-0003kZ-2T for control@debbugs.gnu.org; Mon, 09 Apr 2018 14:38:45 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 359911616FB for ; Mon, 9 Apr 2018 11:38:39 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id ymvomoVthuNH for ; Mon, 9 Apr 2018 11:38:38 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 2F4581616FE for ; Mon, 9 Apr 2018 11:38:38 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id xIVtsCiJI7gw for ; Mon, 9 Apr 2018 11:38:38 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E767D1616FB for ; Mon, 9 Apr 2018 11:38:37 -0700 (PDT) To: GNU bug control From: Paul Eggert Subject: close 31074 Organization: UCLA Computer Science Department Message-ID: <6893daf3-3445-271e-51ad-18bcfb49adff@cs.ucla.edu> Date: Mon, 9 Apr 2018 11:38:37 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) close 31074 From unknown Sun Jun 22 07:29:47 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Tue, 08 May 2018 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator