From unknown Fri Sep 12 15:16:35 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#33249 <33249@debbugs.gnu.org> To: bug#33249 <33249@debbugs.gnu.org> Subject: Status: [PATCH] grep: grouping of patterns including back reference Reply-To: bug#33249 <33249@debbugs.gnu.org> Date: Fri, 12 Sep 2025 22:16:35 +0000 retitle 33249 [PATCH] grep: grouping of patterns including back reference reassign 33249 grep submitter 33249 Norihiro Tanaka severity 33249 normal tag 33249 patch thanks From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 03 07:42:18 2018 Received: (at submit) by debbugs.gnu.org; 3 Nov 2018 11:42:18 +0000 Received: from localhost ([127.0.0.1]:60663 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gIuJl-0005HS-Mt for submit@debbugs.gnu.org; Sat, 03 Nov 2018 07:42:18 -0400 Received: from eggs.gnu.org ([208.118.235.92]:40379) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gIuJi-0005HD-02 for submit@debbugs.gnu.org; Sat, 03 Nov 2018 07:42:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gIuJY-0007gQ-V1 for submit@debbugs.gnu.org; Sat, 03 Nov 2018 07:42:08 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:41125) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gIuJU-0007bH-01 for submit@debbugs.gnu.org; Sat, 03 Nov 2018 07:42:02 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49920) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gIuJS-0000XI-MM for bug-grep@gnu.org; Sat, 03 Nov 2018 07:41:59 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gIuJN-0007RG-A4 for bug-grep@gnu.org; Sat, 03 Nov 2018 07:41:56 -0400 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:54963) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gIuJM-0007N0-OH for bug-grep@gnu.org; Sat, 03 Nov 2018 07:41:53 -0400 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id CCB5288069C for ; Sat, 3 Nov 2018 20:41:48 +0900 (JST) X-matriXscan-loop-detect: 089dc09e96448a539c23c38f5bfe0935893dffd7 Received: from mail06.kcn.ne.jp ([61.86.6.185]) by mxs01-s with ESMTP; Sat, 03 Nov 2018 20:41:47 +0900 (JST) Received: from [10.120.1.92] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id 9ABFE1BF0091 for ; Sat, 3 Nov 2018 20:41:47 +0900 (JST) Date: Sat, 03 Nov 2018 20:41:46 +0900 From: Norihiro Tanaka To: Subject: [PATCH] grep: grouping of patterns including back reference Message-Id: <20181103204120.8D34.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_5BDD777D000000008D2C_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.73 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) --------_5BDD777D000000008D2C_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Hi, When grep uses regex, it splits a pattern with multiple lines by newline character. Compilation and executution run for each fragment. That causes slowdown. By this change, each fragment is divided into groups by whether the fragment includes back reference in a pattern or not. a frgment which includes back reference constitutes group, and all frgments which include back reference also constitute a group. This change extremely speeds-up following case. - before $ yes 00000000000000000000000000000000000000000x | head -10 >in $ time -p env LC_ALL=C src/grep -f pat in real 1.94 user 1.70 sys 0.24 $ yes 00000000000000000000000000000000000000000x | head -100 >in $ time -p env LC_ALL=C src/grep -f pat in real 13.04 user 12.78 sys 0.25 - after $ yes 00000000000000000000000000000000000000000x | head -10 >in $ time -p env LC_ALL=C src/grep -f pat in real 0.75 user 0.56 sys 0.17 $ yes 00000000000000000000000000000000000000000x | head -100 >in $ time -p env LC_ALL=C src/grep -f pat in real 0.74 user 0.57 sys 0.16 Thanks, Norihiro --------_5BDD777D000000008D2C_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-grouping-of-patterns-including-back-reference.patch" Content-Disposition: attachment; filename="0001-grep-grouping-of-patterns-including-back-reference.patch" Content-Transfer-Encoding: base64 RnJvbSA1M2RkMDNlYjBkZjQ3ZDVmMzkwYTM4NjQ3YWUxNTZmZTVjZDYxODA1IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBTYXQsIDMgTm92IDIwMTggMTg6NTY6MTggKzA5MDAKU3ViamVjdDogW1BBVENIXSBncmVw OiBncm91cGluZyBvZiBhIHBhdHRlcm4gd2l0aCBtdWx0aXBsZSBsaW5lcwoKV2hlbiBncmVwIHVz ZXMgcmVnZXgsIGl0IHNwbGl0cyBhIHBhdHRlcm4gd2l0aCBtdWx0aXBsZSBsaW5lcyBieSBuZXds aW5lCmNoYXJhY3RlciBpbnRvIGZyYWdtZW50cy4gIENvbXBpbGF0aW9uIGFuZCBleGVjdXR1dGlv biBydW4gZm9yIGVhY2gKZnJhZ21lbnQuICBUaGF0IGNhdXNlcyBzbG93ZG93bi4gIEJ5IHRoaXMg Y2hhbmdlLCBlYWNoIGZyYWdtZW50IGlzIGRpdmlkZWQKaW50byBncm91cHMgYnkgd2hldGhlciB0 aGUgZnJhZ21lbnQgaW5jbHVkZXMgYmFjayByZWZlcmVuY2UgaW4gYSBwYXR0ZXJuCm9yIG5vdC4g QSBmcmdtZW50IHdoaWNoIGluY2x1ZGVzIGJhY2sgcmVmZXJlbmNlIGNvbnN0aXR1dGVzIGdyb3Vw LCBhbmQgYWxsCmZyZ21lbnRzIHdoaWNoIGluY2x1ZGUgYmFjayByZWZlcmVuY2UgYWxzbyBjb25z dGl0dXRlIGEgZ3JvdXAuCgpUaGlzIGNoYW5nZSBleHRyZW1lbHkgc3BlZWRzLXVwIGZvbGxvd2lu ZyBjYXNlLgoKICAkIHNlcSAtZiAnJTA0MGcnIDAgOTk5OSB8IHNlZCAnMXMvJC9cXCgwXFwpXFwx LycgPnBhdAogICQgeWVzIDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAwMDAw eCB8IGhlYWQgLTEwMDAwID5pbgogICQgdGltZSAtcCBlbnYgTENfQUxMPUMgc3JjL2dyZXAgLWYg cGF0IGluCi0tLQogc3JjL2RmYXNlYXJjaC5jIHwgIDEyOSArKysrKysrKysrKysrKysrKysrKysr KysrKysrKysrKysrKysrKysrKysrKysrLS0tLS0tLS0tCiAxIGZpbGVzIGNoYW5nZWQsIDEwOCBp bnNlcnRpb25zKCspLCAyMSBkZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9zcmMvZGZhc2VhcmNo LmMgYi9zcmMvZGZhc2VhcmNoLmMKaW5kZXggNDZhMGRiMS4uODdlMzA5ZSAxMDA2NDQKLS0tIGEv c3JjL2RmYXNlYXJjaC5jCisrKyBiL3NyYy9kZmFzZWFyY2guYwpAQCAtMTAzLDYgKzEwMyw1OSBA QCBrd3NtdXN0cyAoc3RydWN0IGRmYV9jb21wICpkYykKICAgZGZhbXVzdGZyZWUgKGRtKTsKIH0K IAorc3RhdGljIGJvb2wKK2ZpbmRfYmFja3JlZl9pbl9wYXR0ZXJuIChjb25zdCBjaGFyICprZXlz LCBzaXplX3QgbGVuKQoreworICBmb3IgKDsgbGVuOyBrZXlzKyssIGxlbi0tKQorICAgIGlmIChr ZXlzWzBdID09ICdcXCcpCisgICAgICB7CisgICAgICAgIGlmICgxIDw9IGtleXNbMV0gJiYga2V5 c1sxXSA8PSAnOScpCisgICAgICAgICAgcmV0dXJuIHRydWU7CisKKyAgICAgICAgaWYgKGtleXNb MV0gPT0gJ1xcJykKKyAgICAgICAgICBrZXlzKys7CisgICAgICB9CisKKyAgcmV0dXJuIGZhbHNl OworfQorCitzdGF0aWMgYm9vbAorcmVnZXhfY29tcGlsZSAoc3RydWN0IGRmYV9jb21wICpkYywg Y29uc3QgY2hhciAqcCwgc2l6ZV90IGxlbiwgc2l6ZV90IHBjb3VudCwKKyAgICAgICAgICAgICAg IHNpemVfdCBsaW5lbm8sIGJvb2wgc3ludGF4X29ubHkpCit7CisgIHN0cnVjdCByZV9wYXR0ZXJu X2J1ZmZlciBwYXQwOworICBzdHJ1Y3QgcmVfcGF0dGVybl9idWZmZXIgKnBhdCA9IHN5bnRheF9v bmx5ID8gJnBhdDAgOiAmZGMtPnBhdHRlcm5zW3Bjb3VudF07CisgIHBhdC0+YnVmZmVyID0gTlVM TDsKKyAgcGF0LT5hbGxvY2F0ZWQgPSAwOworCisgIC8qIERvIG5vdCB1c2UgYSBmYXN0bWFwIHdp dGggLWksIHRvIHdvcmsgYXJvdW5kIGdsaWJjIEJ1ZyMyMDM4MS4gICovCisgIHBhdC0+ZmFzdG1h cCA9IChzeW50YXhfb25seSB8fCBtYXRjaF9pY2FzZSkgPyBOVUxMIDogeG1hbGxvYyAoVUNIQVJf TUFYICsgMSk7CisKKyAgcGF0LT50cmFuc2xhdGUgPSBOVUxMOworCisgIGNoYXIgY29uc3QgKmVy ciA9IHJlX2NvbXBpbGVfcGF0dGVybiAocCwgbGVuLCBwYXQpOworCisgIGlmICghZXJyKQorICAg IHJldHVybiB0cnVlOworCisgIC8qIFdpdGggcGF0dGVybnMgc3BlY2lmaWVkIG9ubHkgb24gdGhl IGNvbW1hbmQgbGluZSwgZW1pdCB0aGUgYmFyZQorICAgICBkaWFnbm9zdGljLiAgT3RoZXJ3aXNl LCBpbmNsdWRlIGEgZmlsZW5hbWU6bGluZW5vOiBwcmVmaXguICAqLworICBzaXplX3QgbmV3X2xp bmVubzsKKyAgY29uc3QgY2hhciAqcGF0X2ZpbGVuYW1lOworCisgIGlmIChsaW5lbm8gIT0gKHNp emVfdCkgLTEpCisgICAgcGF0X2ZpbGVuYW1lID0gcGF0dGVybl9maWxlX25hbWUgKGxpbmVubyAr IDEsICZuZXdfbGluZW5vKTsKKyAgZWxzZQorICAgIHBhdF9maWxlbmFtZSA9ICJcMCI7CisKKyAg aWYgKCpwYXRfZmlsZW5hbWUgPT0gJ1wwJykKKyAgICBlcnJvciAoMCwgMCwgIiVzIiwgZXJyKTsK KyAgZWxzZQorICAgIGVycm9yICgwLCAwLCAiJXM6JXp1OiAlcyIsIHBhdF9maWxlbmFtZSwgbmV3 X2xpbmVubywgZXJyKTsKKworICByZXR1cm4gZmFsc2U7Cit9CisKIHZvaWQgKgogR0VBY29tcGls ZSAoY2hhciAqcGF0dGVybiwgc2l6ZV90IHNpemUsIHJlZ19zeW50YXhfdCBzeW50YXhfYml0cykK IHsKQEAgLTEyNiw2ICsxNzksMTEgQEAgR0VBY29tcGlsZSAoY2hhciAqcGF0dGVybiwgc2l6ZV90 IHNpemUsIHJlZ19zeW50YXhfdCBzeW50YXhfYml0cykKICAgYm9vbCBjb21waWxhdGlvbl9mYWls ZWQgPSBmYWxzZTsKICAgc2l6ZV90IHBhbGxvYyA9IDA7CiAKKyAgY2hhciBjb25zdCAqcHJldiA9 IHBhdHRlcm47CisgIGNoYXIgKmJ1ZiA9IE5VTEw7CisgIHNpemVfdCBidWZsZW4gPSAwOworICBz aXplX3QgbGluZW5vID0gMDsKKwogICBkbwogICAgIHsKICAgICAgIHNpemVfdCBsZW47CkBAIC0x MzgsMzkgKzE5Niw2OCBAQCBHRUFjb21waWxlIChjaGFyICpwYXR0ZXJuLCBzaXplX3Qgc2l6ZSwg cmVnX3N5bnRheF90IHN5bnRheF9iaXRzKQogICAgICAgZWxzZQogICAgICAgICBsZW4gPSBwYXRs aW0gLSBwOwogCisgICAgICBib29sIGJhY2tyZWYgPSBmaW5kX2JhY2tyZWZfaW5fcGF0dGVybiAo cCwgbGVuKTsKKworICAgICAgaWYgKGJhY2tyZWYgJiYgcHJldiA8IHApCisgICAgICAgIHsKKyAg ICAgICAgICBsZW4gPSBwIC0gcHJldjsKKyAgICAgICAgICBidWYgPSB4cmVhbGxvYyAoYnVmLCAo YnVmbGVuICsgbGVuKSAqIHNpemVvZiAqYnVmKTsKKyAgICAgICAgICBtZW1jcHkgKGJ1ZiArIGJ1 ZmxlbiwgcCwgbGVuKTsKKyAgICAgICAgICBidWZsZW4gKz0gbGVuOworICAgICAgICB9CisKICAg ICAgIGlmIChwYWxsb2MgPD0gZGMtPnBjb3VudCkKICAgICAgICAgZGMtPnBhdHRlcm5zID0geDJu cmVhbGxvYyAoZGMtPnBhdHRlcm5zLCAmcGFsbG9jLCBzaXplb2YgKmRjLT5wYXR0ZXJucyk7Ci0g ICAgICBzdHJ1Y3QgcmVfcGF0dGVybl9idWZmZXIgKnBhdCA9ICZkYy0+cGF0dGVybnNbZGMtPnBj b3VudF07Ci0gICAgICBwYXQtPmJ1ZmZlciA9IE5VTEw7Ci0gICAgICBwYXQtPmFsbG9jYXRlZCA9 IDA7CiAKLSAgICAgIC8qIERvIG5vdCB1c2UgYSBmYXN0bWFwIHdpdGggLWksIHRvIHdvcmsgYXJv dW5kIGdsaWJjIEJ1ZyMyMDM4MS4gICovCi0gICAgICBwYXQtPmZhc3RtYXAgPSBtYXRjaF9pY2Fz ZSA/IE5VTEwgOiB4bWFsbG9jIChVQ0hBUl9NQVggKyAxKTsKKyAgICAgIGlmICghcmVnZXhfY29t cGlsZSAoZGMsIHAsIGxlbiwgZGMtPnBjb3VudCwgbGluZW5vLCAhYmFja3JlZikpCisgICAgICAg IGNvbXBpbGF0aW9uX2ZhaWxlZCA9IHRydWU7CiAKLSAgICAgIHBhdC0+dHJhbnNsYXRlID0gTlVM TDsKKyAgICAgIHAgPSBzZXA7CisgICAgICBsaW5lbm8rKzsKIAotICAgICAgY2hhciBjb25zdCAq ZXJyID0gcmVfY29tcGlsZV9wYXR0ZXJuIChwLCBsZW4sIHBhdCk7Ci0gICAgICBpZiAoZXJyKQot ICAgICAgICB7Ci0gICAgICAgICAgLyogV2l0aCBwYXR0ZXJucyBzcGVjaWZpZWQgb25seSBvbiB0 aGUgY29tbWFuZCBsaW5lLCBlbWl0IHRoZSBiYXJlCi0gICAgICAgICAgICAgZGlhZ25vc3RpYy4g IE90aGVyd2lzZSwgaW5jbHVkZSBhIGZpbGVuYW1lOmxpbmVubzogcHJlZml4LiAgKi8KLSAgICAg ICAgICBzaXplX3QgbGluZW5vOwotICAgICAgICAgIGNoYXIgY29uc3QgKnBhdF9maWxlbmFtZSA9 IHBhdHRlcm5fZmlsZV9uYW1lIChkYy0+cGNvdW50ICsgMSwKLSAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgJmxpbmVubyk7Ci0gICAgICAgICAg aWYgKCpwYXRfZmlsZW5hbWUgPT0gJ1wwJykKLSAgICAgICAgICAgIGVycm9yICgwLCAwLCAiJXMi LCBlcnIpOwotICAgICAgICAgIGVsc2UKLSAgICAgICAgICAgIGVycm9yICgwLCAwLCAiJXM6JXp1 OiAlcyIsIHBhdF9maWxlbmFtZSwgbGluZW5vLCBlcnIpOwotICAgICAgICAgIGNvbXBpbGF0aW9u X2ZhaWxlZCA9IHRydWU7CisgICAgICBpZiAoYmFja3JlZikKKyAgICAgICAgeyAKKyAgICAgICAg ICBkYy0+cGNvdW50Kys7CisgICAgICAgICAgcHJldiA9IHA7CiAgICAgICAgIH0KLSAgICAgIGRj LT5wY291bnQrKzsKLSAgICAgIHAgPSBzZXA7CiAgICAgfQogICB3aGlsZSAocCk7CiAKICAgaWYg KGNvbXBpbGF0aW9uX2ZhaWxlZCkKICAgICBleGl0IChFWElUX1RST1VCTEUpOwogCisgIGlmIChw cmV2ICE9IE5VTEwpCisgICAgeworICAgICAgaWYgKHBhdHRlcm4gPCBwcmV2KQorICAgICAgICB7 CisgICAgICAgICAgc2l6ZV90IGxlbiA9IHBhdGxpbSAtIHByZXY7CisgICAgICAgICAgYnVmID0g eHJlYWxsb2MgKGJ1ZiwgKGJ1ZmxlbiArIGxlbikgKiBzaXplb2YgKmJ1Zik7CisgICAgICAgICAg bWVtY3B5IChidWYgKyBidWZsZW4sIHByZXYsIGxlbik7CisgICAgICAgICAgYnVmbGVuICs9IGxl bjsKKyAgICAgICAgfQorICAgICAgZWxzZQorICAgICAgICB7CisgICAgICAgICAgYnVmID0gcGF0 dGVybjsKKyAgICAgICAgICBidWZsZW4gPSBzaXplOworICAgICAgICB9CisgICAgfQorCisgIGlm IChidWYgIT0gTlVMTCkKKyAgICB7CisgICAgICBkYy0+cGF0dGVybnMgPSB4Mm5yZWFsbG9jIChk Yy0+cGF0dGVybnMsICZwYWxsb2MsIHNpemVvZiAqZGMtPnBhdHRlcm5zKTsgICAgIAorCisgICAg ICBmb3IgKHNpemVfdCBpID0gMDsgaSA8IGRjLT5wY291bnQ7IGkrKykKKyAgICAgICAgZGMtPnBh dHRlcm5zW2kgKyAxXSA9IGRjLT5wYXR0ZXJuc1tpXTsKKworICAgICAgaWYgKCFyZWdleF9jb21w aWxlIChkYywgYnVmLCBidWZsZW4sIDAsIC0xLCBmYWxzZSkpCisgICAgICAgIGFib3J0ICgpOwor CisgICAgICBkYy0+cGNvdW50Kys7CisKKyAgICAgIGlmIChidWYgIT0gcGF0dGVybikKKyAgICAg ICAgZnJlZSAoYnVmKTsKKyAgICB9CisKICAgLyogSW4gdGhlIG1hdGNoX3dvcmRzIGFuZCBtYXRj aF9saW5lcyBjYXNlcywgd2UgdXNlIGEgZGlmZmVyZW50IHBhdHRlcm4KICAgICAgZm9yIHRoZSBE RkEgbWF0Y2hlciB0aGF0IHdpbGwgcXVpY2tseSB0aHJvdyBvdXQgY2FzZXMgdGhhdCB3b24ndCB3 b3JrLgogICAgICBUaGVuIGlmIERGQSBzdWNjZWVkcyB3ZSBkbyBzb21lIGhhaXJ5IHN0dWZmIHVz aW5nIHRoZSByZWdleCBtYXRjaGVyCi0tIAoxLjcuMQoK --------_5BDD777D000000008D2C_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 03 11:29:48 2018 Received: (at 33249) by debbugs.gnu.org; 3 Nov 2018 15:29:48 +0000 Received: from localhost ([127.0.0.1]:33086 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gIxrv-0006Pw-TI for submit@debbugs.gnu.org; Sat, 03 Nov 2018 11:29:48 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39684) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gIxru-0006Pk-Rp for 33249@debbugs.gnu.org; Sat, 03 Nov 2018 11:29:47 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D4446160066; Sat, 3 Nov 2018 08:29:40 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id L3xjWMqcUKgF; Sat, 3 Nov 2018 08:29:40 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1682116006E; Sat, 3 Nov 2018 08:29:40 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id rOT8IwnvxSEJ; Sat, 3 Nov 2018 08:29:39 -0700 (PDT) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id DDD5A160066; Sat, 3 Nov 2018 08:29:39 -0700 (PDT) Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back reference To: Norihiro Tanaka , 33249@debbugs.gnu.org References: <20181103204120.8D34.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sat, 3 Nov 2018 08:29:39 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <20181103204120.8D34.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 33249 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Norihiro Tanaka wrote: > By this change, each fragment is divided into > groups by whether the fragment includes back reference in a pattern or > not. a frgment which includes back reference constitutes group, and all > frgments which include back reference also constitute a group. Surely this is not sufficient. An invocation of grep like this: grep -E '(a b)' should be an error, but with the proposed patch won't it be equivalent to "grep -E '(a|b)'" since the pattern has no back-references? From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 03 19:21:32 2018 Received: (at 33249) by debbugs.gnu.org; 3 Nov 2018 23:21:32 +0000 Received: from localhost ([127.0.0.1]:33211 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gJ5EO-0003lA-4b for submit@debbugs.gnu.org; Sat, 03 Nov 2018 19:21:30 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:37267) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gJ5EM-0003kz-1p for 33249@debbugs.gnu.org; Sat, 03 Nov 2018 19:21:26 -0400 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 833CF4A09BB for <33249@debbugs.gnu.org>; Sun, 4 Nov 2018 08:21:22 +0900 (JST) X-matriXscan-loop-detect: 02aa685b651accae75ce69f763b985b57e24b563 Received: from mail03.kcn.ne.jp ([61.86.6.182]) by mxs01-s with ESMTP; Sun, 04 Nov 2018 08:21:20 +0900 (JST) Received: from [10.120.1.92] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail03.kcn.ne.jp (Postfix) with ESMTPA id DCCC01410099; Sun, 4 Nov 2018 08:21:19 +0900 (JST) Date: Sun, 04 Nov 2018 08:21:18 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back reference In-Reply-To: References: <20181103204120.8D34.27F6AC2D@kcn.ne.jp> Message-Id: <20181104082118.8D3C.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.73 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 33249 Cc: 33249@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On Sat, 3 Nov 2018 08:29:39 -0700 Paul Eggert wrote: > Norihiro Tanaka wrote: > > By this change, each fragment is divided into > > groups by whether the fragment includes back reference in a pattern or > > not. a frgment which includes back reference constitutes group, and all > > frgments which include back reference also constitute a group. > > Surely this is not sufficient. An invocation of grep like this: > > grep -E '(a > b)' > > should be an error, but with the proposed patch won't it be equivalent to "grep -E '(a|b)'" since the pattern has no back-references? Even the pattern has no back-references, compilation by regex run for each line. So Syntax errors will be detected as even your present. 212: if (!regex_compile (dc, p, len, dc->pcount, lineno, !backref)) 213: compilation_failed = true; $ env LC_ALL=C src/grep -E '(a b)' ~/in src/grep: Unmatched ( or \( From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 04 00:02:28 2018 Received: (at 33249) by debbugs.gnu.org; 4 Nov 2018 04:02:28 +0000 Received: from localhost ([127.0.0.1]:33266 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gJ9cK-0002i9-Kj for submit@debbugs.gnu.org; Sun, 04 Nov 2018 00:02:28 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:50502) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gJ9cJ-0002hv-1G for 33249@debbugs.gnu.org; Sun, 04 Nov 2018 00:02:27 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9755016005B; Sat, 3 Nov 2018 21:02:20 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 2AJ23khIUwmO; Sat, 3 Nov 2018 21:02:19 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E49D1160061; Sat, 3 Nov 2018 21:02:19 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id vy-tBAXsrkWQ; Sat, 3 Nov 2018 21:02:19 -0700 (PDT) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C519416005B; Sat, 3 Nov 2018 21:02:19 -0700 (PDT) Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back reference To: Norihiro Tanaka References: <20181103204120.8D34.27F6AC2D@kcn.ne.jp> <20181104082118.8D3C.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sat, 3 Nov 2018 21:02:19 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.2.1 MIME-Version: 1.0 In-Reply-To: <20181104082118.8D3C.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 33249 Cc: 33249@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Norihiro Tanaka wrote: > Even the pattern has no back-references, compilation by regex run for > each line. So Syntax errors will be detected as even your present. OK, but then I'm afraid I don't understand the motivation for the patch. Perhaps if you gave a self-contained example? The examples in the original bug report are incomplete because they're missing the pattern file; when I tried the first one I observed this: $ yes 00000000000000000000000000000000000000000x | head -10 >in $ time -p env LC_ALL=C grep -f pat in grep: pat: No such file or directory real 0.00 user 0.00 sys 0.00 From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 04 00:26:06 2018 Received: (at 33249) by debbugs.gnu.org; 4 Nov 2018 04:26:06 +0000 Received: from localhost ([127.0.0.1]:33272 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gJ9zC-0003J9-KQ for submit@debbugs.gnu.org; Sun, 04 Nov 2018 00:26:06 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:35593) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gJ9zB-0003Ie-2S for 33249@debbugs.gnu.org; Sun, 04 Nov 2018 00:26:05 -0400 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 004614A0AB6 for <33249@debbugs.gnu.org>; Sun, 4 Nov 2018 13:25:57 +0900 (JST) X-matriXscan-loop-detect: 02aa685b651accae75ce69f763b985b57e24b563 Received: from mail03.kcn.ne.jp ([61.86.6.182]) by mxs02-s with ESMTP; Sun, 04 Nov 2018 13:25:56 +0900 (JST) Received: from [10.120.1.92] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail03.kcn.ne.jp (Postfix) with ESMTPA id 9D3281410099; Sun, 4 Nov 2018 13:25:56 +0900 (JST) Date: Sun, 04 Nov 2018 13:25:55 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back reference In-Reply-To: References: <20181104082118.8D3C.27F6AC2D@kcn.ne.jp> Message-Id: <20181104132555.8D48.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.73 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 33249 Cc: 33249@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On Sat, 3 Nov 2018 21:02:19 -0700 Paul Eggert wrote: > Norihiro Tanaka wrote: > > Even the pattern has no back-references, compilation by regex run for > > each line. So Syntax errors will be detected as even your present. > > OK, but then I'm afraid I don't understand the motivation for the patch. Perhaps if you gave a self-contained example? The examples in the original bug report are incomplete because they're missing the pattern file; when I tried the first one I observed this: > > $ yes 00000000000000000000000000000000000000000x | head -10 >in > $ time -p env LC_ALL=C grep -f pat in > grep: pat: No such file or directory > real 0.00 > user 0.00 > sys 0.00 Sorry, It's missing. The pattern for all cases is as following. $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat Norihiro From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 22 19:57:23 2019 Received: (at 33249-done) by debbugs.gnu.org; 23 Dec 2019 00:57:23 +0000 Received: from localhost ([127.0.0.1]:50387 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ijC2F-0002F6-9J for submit@debbugs.gnu.org; Sun, 22 Dec 2019 19:57:23 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:41588) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ijC2C-0002Er-O9 for 33249-done@debbugs.gnu.org; Sun, 22 Dec 2019 19:57:22 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 5B6791600F1; Sun, 22 Dec 2019 16:57:14 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id sKK__HeVVN-m; Sun, 22 Dec 2019 16:57:13 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 03BFC160158; Sun, 22 Dec 2019 16:57:13 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ykLnDTBPwxak; Sun, 22 Dec 2019 16:57:12 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id CA1CD1600F1; Sun, 22 Dec 2019 16:57:12 -0800 (PST) Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back reference To: Norihiro Tanaka References: <20181104082118.8D3C.27F6AC2D@kcn.ne.jp> <20181104132555.8D48.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <49e7c978-7da6-fc38-1285-7c7c3e6eb50a@cs.ucla.edu> Date: Sun, 22 Dec 2019 16:57:12 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <20181104132555.8D48.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------961864418DD207CF032A3E2E" Content-Language: en-US X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 33249-done Cc: 33249-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) This is a multi-part message in MIME format. --------------961864418DD207CF032A3E2E Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 11/3/18 9:25 PM, Norihiro Tanaka wrote: > $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat Thanks for the test case and sorry about the delay. And thanks for spotting the speedup opportunity. I found a few problems with the proposed patch, though: > + if (keys[1] == '\\') > + keys++; This mishandles the case where the input byte sequence contains '\', '\', '1' where the first '\' is the last byte of a multibyte character. Such a byte sequence can contain backreferences but the function will say it doesn't. > + if (backref && prev < p) > + { > + len = p - prev; > + buf = xrealloc (buf, (buflen + len) * sizeof *buf); > + memcpy (buf + buflen, p, len); > + buflen += len; > + } This seems to have three problems. First, the memcpy copies from P, but it should copy from PREV. Second, this code assigns to LEN, which breaks a later use of LEN. Third, if there are many patterns with backreferences, repeated use of realloc will have O(N**2) behavior. > + for (size_t i = 0; i < dc->pcount; i++) > + dc->patterns[i + 1] = dc->patterns[i]; This copies dc->patterns[0] to the later values in that array, when a memmove was intended. So, after installing the patch, I immediately installed the attached patch, which should address the abovementioned issues. Thanks again. You did the hard work - I merely proofread it. --------------961864418DD207CF032A3E2E Content-Type: text/x-patch; charset=UTF-8; name="0001-grep-fix-some-bugs-in-pattern-grouping-speedup.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="0001-grep-fix-some-bugs-in-pattern-grouping-speedup.patch" >From c2ec762dbc132d3c4a727c8e2ecab2a7286729d6 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 22 Dec 2019 16:39:09 -0800 Subject: [PATCH] grep: fix some bugs in pattern-grouping speedup This fixes some bugs in the previous commit, and should finish the fix for Bug#33249. * NEWS: Mention fix for Bug#33249. * src/dfasearch.c (possible_backrefs_in_pattern, regex_compile) (GEAcompile): In new code, prefer ptrdiff_t to size_t when either will do, since ptrdiff_t has better error checking. At some point we should adjust the old code too. (possible_backrefs_in_pattern): Rename from find_backref_in_pattern. New arg BS_SAFE. All uses changed. Fix false negative if a multibyte character ends in a single '\\' byte, followed by the two bytes '\\', '1'. (regex_compile): Simplify. (GEAcompile): Avoid quadratic behavior when reallocating growing buffers. Fix a couple of bugs in copying pattern data involving backreferences. Fix another bug in copying pattern metadata involving backreferences, by removing the need to copy it. --- NEWS | 4 ++ src/dfasearch.c | 125 ++++++++++++++++++++++++++++++------------------ 2 files changed, 82 insertions(+), 47 deletions(-) diff --git a/NEWS b/NEWS index 8eda865..718659c 100644 --- a/NEWS +++ b/NEWS @@ -21,6 +21,10 @@ GNU grep NEWS -*- outline -*- output is /dev/null. [Bug#37716 introduced in grep 3.2] + A performance bug has been fixed when grep is given many patterns + each of which lack backreferences. + [Bug#33249 introduced in grep 2.5] + A performance bug has been fixed for patterns like '01.2' that cause grep to reorder tokens internally. [Bug#34951 introduced in grep 3.2] diff --git a/src/dfasearch.c b/src/dfasearch.c index 42d16dc..eb7732b 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -35,7 +35,7 @@ struct dfa_comp struct dfa *dfa; /* Regex compiled regexps. */ - struct re_pattern_buffer* patterns; + struct re_pattern_buffer *patterns; size_t pcount; struct re_registers regs; @@ -103,25 +103,49 @@ kwsmusts (struct dfa_comp *dc) dfamustfree (dm); } -static bool -find_backref_in_pattern (const char *keys, size_t len) +/* Return true if KEYS, of length LEN, might contain a backreference. + Return false if KEYS cannot contain a backreference. + BS_SAFE is true of encodings where a backslash cannot appear as the + last byte of a multibyte character. */ +static bool _GL_ATTRIBUTE_PURE +possible_backrefs_in_pattern (char const *keys, ptrdiff_t len, bool bs_safe) { - for (; len; keys++, len--) - if (keys[0] == '\\') - { - if (1 <= keys[1] && keys[1] <= '9') - return true; - - if (keys[1] == '\\') - keys++; - } - + /* Normally a backslash, but in an unsafe encoding this is a a + non-char value so that the comparison below always fails, because + if there are two adjacent '\' bytes the first might the last byte + of a multibyte character. */ + int second_backslash = bs_safe ? '\\' : CHAR_MAX + 1; + + /* This code can return true even if KEYS lacks a backreference, for + patterns like [\2], or for encodings where '\' appears as + the last byte of a multibyte character. However, false alarms + should be rare and do not affect correctness. */ + + /* Do not look for a backslash in the pattern's last byte, since it + can't be part of a backreference and this streamlines the code. */ + len--; + + if (0 <= len) + { + char const *lim = keys + len; + for (char const *p = keys; (p = memchr (p, '\\', lim - p)); p++) + { + if ('1' <= p[1] && p[1] <= '9') + return true; + if (p[1] == second_backslash) + { + p++; + if (p == lim) + break; + } + } + } return false; } static bool -regex_compile (struct dfa_comp *dc, const char *p, size_t len, size_t pcount, - size_t lineno, bool syntax_only) +regex_compile (struct dfa_comp *dc, char const *p, ptrdiff_t len, + ptrdiff_t pcount, ptrdiff_t lineno, bool syntax_only) { struct re_pattern_buffer pat0; struct re_pattern_buffer *pat = syntax_only ? &pat0 : &dc->patterns[pcount]; @@ -129,29 +153,23 @@ regex_compile (struct dfa_comp *dc, const char *p, size_t len, size_t pcount, pat->allocated = 0; /* Do not use a fastmap with -i, to work around glibc Bug#20381. */ - pat->fastmap = (syntax_only || match_icase) ? NULL : xmalloc (UCHAR_MAX + 1); + pat->fastmap = syntax_only | match_icase ? NULL : xmalloc (UCHAR_MAX + 1); pat->translate = NULL; char const *err = re_compile_pattern (p, len, pat); - if (!err) return true; - /* With patterns specified only on the command line, emit the bare - diagnostic. Otherwise, include a filename:lineno: prefix. */ - size_t new_lineno; - const char *pat_filename; - - if (lineno != (size_t) -1) - pat_filename = pattern_file_name (lineno + 1, &new_lineno); - else - pat_filename = "\0"; + /* Emit a filename:lineno: prefix for patterns taken from files. */ + size_t pat_lineno = lineno; + char const *pat_filename + = lineno < 0 ? "\0" : pattern_file_name (lineno + 1, &pat_lineno); if (*pat_filename == '\0') error (0, 0, "%s", err); else - error (0, 0, "%s:%zu: %s", pat_filename, new_lineno, err); + error (0, 0, "%s:%zu: %s", pat_filename, pat_lineno, err); return false; } @@ -169,6 +187,7 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits) re_set_syntax (syntax_bits); int dfaopts = eolbyte ? 0 : DFA_EOL_NUL; dfasyntax (dc->dfa, &localeinfo, syntax_bits, dfaopts); + bool bs_safe = !localeinfo.multibyte | localeinfo.using_utf8; /* For GNU regex, pass the patterns separately to detect errors like "[\nallo\n]\n", where the patterns are "[", "allo" and "]", and @@ -177,12 +196,20 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits) char const *p = pattern; char const *patlim = pattern + size; bool compilation_failed = false; - size_t palloc = 0; + + dc->patterns = xmalloc (sizeof *dc->patterns); + dc->patterns++; + dc->pcount = 0; + size_t palloc = 1; char const *prev = pattern; + + /* Buffer containing backreference-free patterns. */ char *buf = NULL; - size_t buflen = 0; - size_t lineno = 0; + ptrdiff_t buflen = 0; + size_t bufalloc = 0; + + ptrdiff_t lineno = 0; do { @@ -196,18 +223,26 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits) else len = patlim - p; - bool backref = find_backref_in_pattern (p, len); + bool backref = possible_backrefs_in_pattern (p, len, bs_safe); if (backref && prev < p) { - len = p - prev; - buf = xrealloc (buf, (buflen + len) * sizeof *buf); - memcpy (buf + buflen, p, len); - buflen += len; + ptrdiff_t prevlen = p - prev; + while (bufalloc < buflen + prevlen) + buf = x2realloc (buf, &bufalloc); + memcpy (buf + buflen, prev, prevlen); + buflen += prevlen; } - if (palloc <= dc->pcount) - dc->patterns = x2nrealloc (dc->patterns, &palloc, sizeof *dc->patterns); + /* Ensure room for at least two more patterns. The extra one is + for the regex_compile that may be executed after this loop + exits, and its (unused) slot is patterns[-1] until then. */ + while (palloc <= dc->pcount + 1) + { + dc->patterns = x2nrealloc (dc->patterns - 1, &palloc, + sizeof *dc->patterns); + dc->patterns++; + } if (!regex_compile (dc, p, len, dc->pcount, lineno, !backref)) compilation_failed = true; @@ -230,10 +265,10 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits) { if (pattern < prev) { - size_t len = patlim - prev; - buf = xrealloc (buf, (buflen + len) * sizeof *buf); - memcpy (buf + buflen, prev, len); - buflen += len; + ptrdiff_t prevlen = patlim - prev; + buf = xrealloc (buf, buflen + prevlen); + memcpy (buf + buflen, prev, prevlen); + buflen += prevlen; } else { @@ -244,16 +279,12 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits) if (buf != NULL) { - dc->patterns = x2nrealloc (dc->patterns, &palloc, sizeof *dc->patterns); - - for (size_t i = 0; i < dc->pcount; i++) - dc->patterns[i + 1] = dc->patterns[i]; + dc->patterns--; + dc->pcount++; if (!regex_compile (dc, buf, buflen, 0, -1, false)) abort (); - dc->pcount++; - if (buf != pattern) free (buf); } -- 2.17.1 --------------961864418DD207CF032A3E2E-- From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 23 10:09:14 2019 Received: (at 33249-done) by debbugs.gnu.org; 23 Dec 2019 15:09:14 +0000 Received: from localhost ([127.0.0.1]:51567 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ijPKc-00021H-EE for submit@debbugs.gnu.org; Mon, 23 Dec 2019 10:09:14 -0500 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:51644) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ijPKa-00020x-OC for 33249-done@debbugs.gnu.org; Mon, 23 Dec 2019 10:09:13 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id 738F1880671 for <33249-done@debbugs.gnu.org>; Tue, 24 Dec 2019 00:09:05 +0900 (JST) X-matriXscan-loop-detect: 37122e158c711b409056f4ca1124a51384c2ef79 Received: from mail10.kcn.ne.jp ([61.86.6.128]) by mxs02-s with ESMTP; Tue, 24 Dec 2019 00:09:05 +0900 (JST) Received: from [10.120.1.131] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail10.kcn.ne.jp (Postfix) with ESMTPA id EA4FA40EC501; Tue, 24 Dec 2019 00:09:04 +0900 (JST) Date: Tue, 24 Dec 2019 00:09:03 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#33249: [PATCH] grep: grouping of patterns including back reference In-Reply-To: <49e7c978-7da6-fc38-1285-7c7c3e6eb50a@cs.ucla.edu> References: <20181104132555.8D48.27F6AC2D@kcn.ne.jp> <49e7c978-7da6-fc38-1285-7c7c3e6eb50a@cs.ucla.edu> Message-Id: <20191224000902.0E88.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.74.02 [ja] X-matriXscan-msec-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 33249-done Cc: 33249-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On Sun, 22 Dec 2019 16:57:12 -0800 Paul Eggert wrote: > On 11/3/18 9:25 PM, Norihiro Tanaka wrote: > > > $ seq -f '%040g' 0 9999 | sed '1s/$/\\(0\\)\\1/' >pat > > Thanks for the test case and sorry about the delay. And thanks for spotting the > speedup opportunity. I found a few problems with the proposed patch, though: > > > + if (keys[1] == '\\') > > + keys++; > > This mishandles the case where the input byte sequence contains '\', '\', '1' > where the first '\' is the last byte of a multibyte character. Such a byte > sequence can contain backreferences but the function will say it doesn't. > > > + if (backref && prev < p) > > + { > > + len = p - prev; > > + buf = xrealloc (buf, (buflen + len) * sizeof *buf); > > + memcpy (buf + buflen, p, len); > > + buflen += len; > > + } > > This seems to have three problems. First, the memcpy copies from P, but it > should copy from PREV. Second, this code assigns to LEN, which breaks a later > use of LEN. Third, if there are many patterns with backreferences, repeated use > of realloc will have O(N**2) behavior. > > > + for (size_t i = 0; i < dc->pcount; i++) > > + dc->patterns[i + 1] = dc->patterns[i]; > > This copies dc->patterns[0] to the later values in that array, when a memmove > was intended. > > So, after installing the patch, I immediately installed the attached patch, > which should address the abovementioned issues. > > Thanks again. You did the hard work - I merely proofread it. Sorry and thanks for the review and fixing. possible_backrefs_in_pattern is greatful! I can't find any solution better than it. From unknown Fri Sep 12 15:16:35 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Tue, 21 Jan 2020 12:24:03 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator