From unknown Fri Jun 20 07:20:37 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#22103 <22103@debbugs.gnu.org> To: bug#22103 <22103@debbugs.gnu.org> Subject: Status: [PATCH] grep: improve performance for grep -P in UTF-8 Reply-To: bug#22103 <22103@debbugs.gnu.org> Date: Fri, 20 Jun 2025 14:20:37 +0000 retitle 22103 [PATCH] grep: improve performance for grep -P in UTF-8 reassign 22103 grep submitter 22103 Norihiro Tanaka severity 22103 normal tag 22103 patch thanks From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 06 18:01:55 2015 Received: (at submit) by debbugs.gnu.org; 6 Dec 2015 23:01:55 +0000 Received: from localhost ([127.0.0.1]:41061 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1a5iJX-0005vm-CW for submit@debbugs.gnu.org; Sun, 06 Dec 2015 18:01:55 -0500 Received: from eggs.gnu.org ([208.118.235.92]:60204) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1a5iJD-0005vJ-GD for submit@debbugs.gnu.org; Sun, 06 Dec 2015 18:01:54 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a5iJB-0004r5-VA for submit@debbugs.gnu.org; Sun, 06 Dec 2015 18:01:35 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59665) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a5iJB-0004r1-Si for submit@debbugs.gnu.org; Sun, 06 Dec 2015 18:01:33 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41965) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a5iJA-0003QH-Cp for bug-grep@gnu.org; Sun, 06 Dec 2015 18:01:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1a5iJ7-0004p6-1c for bug-grep@gnu.org; Sun, 06 Dec 2015 18:01:32 -0500 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:55435) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1a5iJ6-0004nR-IU for bug-grep@gnu.org; Sun, 06 Dec 2015 18:01:28 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 47AF380015 for ; Mon, 7 Dec 2015 08:01:24 +0900 (JST) X-matriXscan-loop-detect: 290c8523ed2c7339f511c67656f56e2b08b51380 Received: from mail09.kcn.ne.jp ([61.86.6.188]) by mxs02-s with ESMTP; Mon, 07 Dec 2015 08:01:23 +0900 (JST) Received: from [10.120.1.72] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail09.kcn.ne.jp (Postfix) with ESMTPA id 0419D1BD00BF for ; Mon, 7 Dec 2015 08:01:22 +0900 (JST) Date: Mon, 07 Dec 2015 08:01:23 +0900 From: Norihiro Tanaka To: Subject: [PATCH] grep: improve performance for grep -P in UTF-8 Message-Id: <20151207080123.8BBA.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_5664B90F000000008BAB_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --------_5664B90F000000008BAB_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit After grep -P found first match, TEXTBIN_UNKNOWN optimizations is not used. Therefore, if grep -P found early match, grep -P is very slow in UTF-8. $ time -p grep -P ^1$ <(seq 999999) 1 real 14.55 user 13.77 sys 1.12 Or grep -Pa is not used TEXTBIN_UNKNOWN optimizations. Therefere, it is also very slow in UTF-8. grep -P ^1$ <(seq 999999) $ time -p grep -Pa a <(seq 999999) real 14.53 user 13.65 sys 1.35 This change makes deference to leave TEXTBIN_UNKNOWN optimizations until grep -P finds a binary character. It will bring more than 10x speed up. $ time -p src/grep -P ^1$ <(seq 999999) 1 real 0.97 user 0.79 sys 0.24 $ time -p src/grep -Pa a <(seq 999999) real 0.98 user 0.23 sys 0.99 BTW, this change conflicts with proposal in bug#22028. --------_5664B90F000000008BAB_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-improve-performance-for-grep-P-in-UTF-8.patch" Content-Disposition: attachment; filename="0001-grep-improve-performance-for-grep-P-in-UTF-8.patch" Content-Transfer-Encoding: base64 RnJvbSAyY2Y5ODU5NGUxYjdjZTc0OTBkMGI2ZDc1NTFmNTJkNjVjY2Q0NGE0IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBUaHUsIDI2IE5vdiAyMDE1IDE1OjM0OjEzICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZ3Jl cDogaW1wcm92ZSBwZXJmb3JtYW5jZSBmb3IgZ3JlcCAtUCBpbiBVVEYtOAoKZ3JlcCAtUCB1c2Vz IGxpbmUgYnkgbGluZSBzZWFyY2ggYWZ0ZXIgZm91bmQgZmlyc3QgbWF0Y2ggb3Igc3BlY2lmaWVk IC1hCm9wdGlvbiwgYnV0IGl0IGlzIHZlcnkgc2xvdy4gIFRoaXMgY2hhbmdlIGFsc28gdHJpZXMg dG8gdXNlIG11bHRpLWxpbmUKc2VhcmNoIGFmdGVyIHRoZW0gdW50aWwgZm91bmQgbm90IHRleHQg Y2hhcmFjdGVyLgoKKiBzcmMvZ3JlcC5jIChncmVwKTogRG8gaXQuCiogTkVXUzogTWVudGlvbiBp dC4KLS0tCiBORVdTICAgICAgIHwgIDYgKysrKysrCiBzcmMvZ3JlcC5jIHwgMjggKysrKysrKysr KysrKystLS0tLS0tLS0tLS0tLQogMiBmaWxlcyBjaGFuZ2VkLCAyMCBpbnNlcnRpb25zKCspLCAx NCBkZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9ORVdTIGIvTkVXUwppbmRleCBhYzYzMmQ3Li5h OWE3MDQyIDEwMDY0NAotLS0gYS9ORVdTCisrKyBiL05FV1MKQEAgLTIsNiArMiwxMiBAQCBHTlUg Z3JlcCBORVdTICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLSotIG91dGxpbmUg LSotCiAKICogTm90ZXdvcnRoeSBjaGFuZ2VzIGluIHJlbGVhc2UgPy4/ICg/Pz8/LT8/LT8/KSBb P10KIAorKiogSW1wcm92ZW1lbnRzCisKKyAgUGVyZm9ybWFuY2UgaGFzIGltcHJvdmVkIGZvciBn cmVwIC1QIGluIFVURi04LiAgQmVmb3JlLCBjb21tYW5kcworICBsaWtlIHRoZSBmb2xsb3dpbmcg d291bGQgc3BlZWQgdXAgbW9yZSB0aGFuIDEweDoKKyAgICBncmVwIC1QIF4xJCA8KHNlcSA5OTk5 OTkpCisgICAgZ3JlcCAtYVAgYSA8KHNlcSA5OTk5OTkpCiAKICogTm90ZXdvcnRoeSBjaGFuZ2Vz IGluIHJlbGVhc2UgMi4yMiAoMjAxNS0xMS0wMSkgW3N0YWJsZV0KIApkaWZmIC0tZ2l0IGEvc3Jj L2dyZXAuYyBiL3NyYy9ncmVwLmMKaW5kZXggMmM1ZTA5YS4uYTFlZTE4MyAxMDA2NDQKLS0tIGEv c3JjL2dyZXAuYworKysgYi9zcmMvZ3JlcC5jCkBAIC0xMzQ1LDcgKzEzNDUsNyBAQCBncmVwIChp bnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgICAgIHJldHVybiAwOwogICAgIH0KIAot ICBpZiAoYmluYXJ5X2ZpbGVzID09IFRFWFRfQklOQVJZX0ZJTEVTKQorICBpZiAoYmluYXJ5X2Zp bGVzID09IFRFWFRfQklOQVJZX0ZJTEVTICYmIGV4ZWN1dGUgIT0gUGV4ZWN1dGUpCiAgICAgdGV4 dGJpbiA9IFRFWFRCSU5fVEVYVDsKICAgZWxzZQogICAgIHsKQEAgLTE0MTUsMTMgKzE0MTUsOCBA QCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgICAgICAgfQogCiAgICAg ICAvKiBEZXRlY3Qgd2hldGhlciBsZWFkaW5nIGNvbnRleHQgaXMgYWRqYWNlbnQgdG8gcHJldmlv dXMgb3V0cHV0LiAgKi8KLSAgICAgIGlmIChsYXN0b3V0KQotICAgICAgICB7Ci0gICAgICAgICAg aWYgKHRleHRiaW4gPT0gVEVYVEJJTl9VTktOT1dOKQotICAgICAgICAgICAgdGV4dGJpbiA9IFRF WFRCSU5fVEVYVDsKLSAgICAgICAgICBpZiAoYmVnICE9IGxhc3RvdXQpCi0gICAgICAgICAgICBs YXN0b3V0ID0gMDsKLSAgICAgICAgfQorICAgICAgaWYgKGJlZyAhPSBsYXN0b3V0KQorICAgICAg ICBsYXN0b3V0ID0gTlVMTDsKIAogICAgICAgLyogSGFuZGxlIHNvbWUgZGV0YWlscyBhbmQgcmVh ZCBtb3JlIGRhdGEgdG8gc2Nhbi4gICovCiAgICAgICBzYXZlID0gcmVzaWR1ZSArIGxpbSAtIGJl ZzsKQEAgLTE0NDIsMTIgKzE0MzcsMTcgQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25z dCAqc3QpCiAgICAgICAgICAgZW51bSB0ZXh0YmluIHRiID0gYnVmZmVyX3RleHRiaW4gKGJ1ZmJl ZywgYnVmbGltIC0gYnVmYmVnKTsKICAgICAgICAgICBpZiAodGV4dGJpbl9pc19iaW5hcnkgKHRi KSkKICAgICAgICAgICAgIHsKLSAgICAgICAgICAgICAgaWYgKGJpbmFyeV9maWxlcyA9PSBXSVRI T1VUX01BVENIX0JJTkFSWV9GSUxFUykKLSAgICAgICAgICAgICAgICByZXR1cm4gMDsKLSAgICAg ICAgICAgICAgdGV4dGJpbiA9IHRiOwotICAgICAgICAgICAgICBkb25lX29uX21hdGNoID0gb3V0 X3F1aWV0ID0gdHJ1ZTsKLSAgICAgICAgICAgICAgbnVsX3phcHBlciA9IGVvbDsKLSAgICAgICAg ICAgICAgc2tpcF9udWxzID0gc2tpcF9lbXB0eV9saW5lczsKKyAgICAgICAgICAgICAgaWYgKG5s aW5lcyB8fCBiaW5hcnlfZmlsZXMgPT0gVEVYVF9CSU5BUllfRklMRVMpCisgICAgICAgICAgICAg ICAgdGV4dGJpbiA9IFRFWFRCSU5fVEVYVDsKKyAgICAgICAgICAgICAgZWxzZQorICAgICAgICAg ICAgICAgIHsKKyAgICAgICAgICAgICAgICAgIGlmIChiaW5hcnlfZmlsZXMgPT0gV0lUSE9VVF9N QVRDSF9CSU5BUllfRklMRVMpCisgICAgICAgICAgICAgICAgICAgIHJldHVybiAwOworICAgICAg ICAgICAgICAgICAgdGV4dGJpbiA9IHRiOworICAgICAgICAgICAgICAgICAgZG9uZV9vbl9tYXRj aCA9IG91dF9xdWlldCA9IHRydWU7CisgICAgICAgICAgICAgICAgICBudWxfemFwcGVyID0gZW9s OworICAgICAgICAgICAgICAgICAgc2tpcF9udWxzID0gc2tpcF9lbXB0eV9saW5lczsKKyAgICAg ICAgICAgICAgICB9CiAgICAgICAgICAgICB9CiAgICAgICAgIH0KICAgICB9Ci0tIAoyLjQuNgoK --------_5664B90F000000008BAB_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 08 08:46:44 2016 Received: (at 22103-done) by debbugs.gnu.org; 8 Jan 2016 13:46:44 +0000 Received: from localhost ([127.0.0.1]:42398 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHXNM-0006Ms-Gk for submit@debbugs.gnu.org; Fri, 08 Jan 2016 08:46:44 -0500 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:36025) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHXNK-0006Mb-9c for 22103-done@debbugs.gnu.org; Fri, 08 Jan 2016 08:46:42 -0500 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id AD08980241 for <22103-done@debbugs.gnu.org>; Fri, 8 Jan 2016 22:46:35 +0900 (JST) X-matriXscan-loop-detect: eb0ad9b332750cc9038946b976ca17175a5ec7ec Received: from mail05.kcn.ne.jp ([61.86.6.184]) by mxs01-s with ESMTP; Fri, 08 Jan 2016 22:46:32 +0900 (JST) Received: from [10.120.1.74] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail05.kcn.ne.jp (Postfix) with ESMTPA id C427D7D0099 for <22103-done@debbugs.gnu.org>; Fri, 8 Jan 2016 22:46:32 +0900 (JST) Date: Fri, 08 Jan 2016 22:46:33 +0900 From: Norihiro Tanaka To: 22103-done@debbugs.gnu.org Subject: Re: bug#20526: grep BUG: text file is detected as binary In-Reply-To: <568D559A.6050000@cs.ucla.edu> References: <568CD111.5010801@cs.ucla.edu> <568D559A.6050000@cs.ucla.edu> Message-Id: <20160108224632.A9BA.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22103-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Wed, 6 Jan 2016 09:57:46 -0800 Paul Eggert wrote: > On 01/06/2016 12:32 AM, Paul Eggert wrote: > > I installed the attached patch, which fixed this performance bug for me. > Whoops! I forgot to 'git add src/search.h' before committing. We also need the attached followup patch, which I installed. Great! Thanks, many issues including for output of invalid sequence are fixed by your patches. bug#22103 is also fixed in them, so I am closing it. From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 08 16:35:40 2016 Received: (at 22103) by debbugs.gnu.org; 8 Jan 2016 21:35:40 +0000 Received: from localhost ([127.0.0.1]:43480 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHehA-00049v-04 for submit@debbugs.gnu.org; Fri, 08 Jan 2016 16:35:40 -0500 Received: from mail-io0-f175.google.com ([209.85.223.175]:35814) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHeh8-00049e-44; Fri, 08 Jan 2016 16:35:38 -0500 Received: by mail-io0-f175.google.com with SMTP id 77so268263096ioc.2; Fri, 08 Jan 2016 13:35:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=kzqcHwK3hncilTAOuJfapf5mh8X/ZbdmldIzreZbZz4=; b=x277vbLHT+jF2PXb6eft11g0lTkmSzmFlmuDcWrLd184Rvmo1hyofNSECWbbckC3NZ EoSdyJjwFPCh8lm8TXgoXG8OAWICu+z37l7R+0cSpM2z9lJmuJV493hy96hjEk683VDm 0zPyj8n3ag372nu2GvtPZmmpc+j+DpVQWuj/jiTrLSOBU0RzcLlFgW/IRQjQR+S97D0v E51uI63cueE0vHlDkPzck+zTOjcEmnJ7+uMeFt9hkLR9SGr3bQdWDHaDSb+KGXYNns35 CEIZ5csU2KTQNGAle6BRrqOZhmBgfJWUj06rVfXrFd9TKbXHVmEyzc5IVsD2wNLyrKDK GmVg== X-Received: by 10.107.27.6 with SMTP id b6mr105747327iob.163.1452288932642; Fri, 08 Jan 2016 13:35:32 -0800 (PST) MIME-Version: 1.0 Received: by 10.36.10.18 with HTTP; Fri, 8 Jan 2016 13:35:12 -0800 (PST) In-Reply-To: <20160108224632.A9BA.27F6AC2D@kcn.ne.jp> References: <568CD111.5010801@cs.ucla.edu> <568D559A.6050000@cs.ucla.edu> <20160108224632.A9BA.27F6AC2D@kcn.ne.jp> From: Jim Meyering Date: Fri, 8 Jan 2016 13:35:12 -0800 X-Google-Sender-Auth: 0bKW8DwdZffhNxkI5fbYOT95Ifo Message-ID: Subject: Re: bug#22103: bug#20526: grep BUG: text file is detected as binary To: 22103@debbugs.gnu.org, Norihiro Tanaka Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.4 (/) X-Debbugs-Envelope-To: 22103 Cc: 22103-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.4 (/) On Fri, Jan 8, 2016 at 5:46 AM, Norihiro Tanaka wrote: > > On Wed, 6 Jan 2016 09:57:46 -0800 > Paul Eggert wrote: > >> On 01/06/2016 12:32 AM, Paul Eggert wrote: >> > I installed the attached patch, which fixed this performance bug for me. >> Whoops! I forgot to 'git add src/search.h' before committing. We also need the attached followup patch, which I installed. > > Great! Thanks, many issues including for output of invalid sequence > are fixed by your patches. bug#22103 is also fixed in them, so I am > closing it. Thank you for helping with bug triage. From unknown Fri Jun 20 07:20:37 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sat, 06 Feb 2016 12:24:03 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator