From unknown Wed Jun 25 00:24:01 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#22655 <22655@debbugs.gnu.org> To: bug#22655 <22655@debbugs.gnu.org> Subject: Status: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) Reply-To: bug#22655 <22655@debbugs.gnu.org> Date: Wed, 25 Jun 2025 07:24:01 +0000 retitle 22655 grep-2.21 (and git master): --null-data and ranges work in an= odd way (-P works fine) reassign 22655 grep submitter 22655 Sergei Trofimovich severity 22655 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Sat Feb 13 18:23:34 2016 Received: (at submit) by debbugs.gnu.org; 13 Feb 2016 23:23:34 +0000 Received: from localhost ([127.0.0.1]:38413 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aUjXK-00089i-EI for submit@debbugs.gnu.org; Sat, 13 Feb 2016 18:23:34 -0500 Received: from eggs.gnu.org ([208.118.235.92]:46492) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aUjV1-000864-MH for submit@debbugs.gnu.org; Sat, 13 Feb 2016 18:21:12 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aUjUv-0000kO-NS for submit@debbugs.gnu.org; Sat, 13 Feb 2016 18:21:06 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:35692) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUjUv-0000kK-Kf for submit@debbugs.gnu.org; Sat, 13 Feb 2016 18:21:05 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:37462) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUjUu-0006p0-Re for bug-grep@gnu.org; Sat, 13 Feb 2016 18:21:05 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aUjUr-0000k5-GX for bug-grep@gnu.org; Sat, 13 Feb 2016 18:21:04 -0500 Received: from smtp.gentoo.org ([2001:470:ea4a:1:5054:ff:fec7:86e4]:47956) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUjUr-0000jr-9V for bug-grep@gnu.org; Sat, 13 Feb 2016 18:21:01 -0500 Received: from sf (host81-129-87-168.range81-129.btcentralplus.com [81.129.87.168]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: slyfox) by smtp.gentoo.org (Postfix) with ESMTPSA id 11787340B69; Sat, 13 Feb 2016 23:20:57 +0000 (UTC) Date: Sat, 13 Feb 2016 23:20:52 +0000 From: Sergei Trofimovich To: bug-grep@gnu.org Subject: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) Message-ID: <20160213232052.7a5d14a2@sf> X-Mailer: Claws Mail 3.9.0 (GTK+ 2.24.29; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/ZSFjhigyl7xRGfDrHzRyyF."; protocol="application/pgp-signature" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.4 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 13 Feb 2016 18:23:32 -0500 Cc: Jim Meyering , skvadrik@gmail.com, Paul Eggert , ulm@gentoo.org, Norihiro Tanaka X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.4 (----) --Sig_/ZSFjhigyl7xRGfDrHzRyyF. Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable The issue is found by Ulrich Mueller: It seems DFA engine does not understand --null-data: ~/dev/git/grep $ cat a-test.sh=20 #!/bin/bash printf '12\n34\0' | LC_ALL=3Den_US.utf-8 src/grep -z '^[1234yz]*$' | wc= -c printf '12\n34\0' | LC_ALL=3Den_US.utf-8 src/grep -P -z '^[1234yz]*$' | wc= -c printf '12\n34\0' | LC_ALL=3Den_US.utf-8 src/grep -z '^[1234y-z]*$' | wc= -c printf '12\n34\0' | LC_ALL=3Den_US.utf-8 src/grep -P -z '^[1234y-z]*$' | wc= -c ~/dev/git/grep $ ./a-test.sh=20 0 6 6 6 All 4 should return 6 but first is not correct. It seems that 'y-z' range disables dfa.c code and works fine. --=20 Sergei --Sig_/ZSFjhigyl7xRGfDrHzRyyF. Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAla/ulQACgkQcaHudmEf86pN1ACePKmX/3E6edjWcTHiLMUM05/g xJMAn0t2DXQhCnZ2hlel0hkOyzfMzBlU =SlSv -----END PGP SIGNATURE----- --Sig_/ZSFjhigyl7xRGfDrHzRyyF.-- From debbugs-submit-bounces@debbugs.gnu.org Sat Feb 13 19:09:27 2016 Received: (at submit) by debbugs.gnu.org; 14 Feb 2016 00:09:27 +0000 Received: from localhost ([127.0.0.1]:38423 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aUkFj-0000pH-Hr for submit@debbugs.gnu.org; Sat, 13 Feb 2016 19:09:27 -0500 Received: from eggs.gnu.org ([208.118.235.92]:55142) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aUkFi-0000p5-JE for submit@debbugs.gnu.org; Sat, 13 Feb 2016 19:09:26 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aUkFc-0003sR-Gm for submit@debbugs.gnu.org; Sat, 13 Feb 2016 19:09:21 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:60288) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUkFc-0003sN-DP for submit@debbugs.gnu.org; Sat, 13 Feb 2016 19:09:20 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46113) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUkFb-0005va-Gh for bug-grep@gnu.org; Sat, 13 Feb 2016 19:09:20 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aUkFY-0003r7-99 for bug-grep@gnu.org; Sat, 13 Feb 2016 19:09:19 -0500 Received: from a1www.kph.uni-mainz.de ([134.93.134.1]:36052) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aUkFX-0003qS-W9 for bug-grep@gnu.org; Sat, 13 Feb 2016 19:09:16 -0500 Received: from a1i15.kph.uni-mainz.de (a1i15.kph.uni-mainz.de [134.93.134.92]) by a1www.kph.uni-mainz.de (8.14.9/8.14.7) with ESMTP id u1E08jcD023432; Sun, 14 Feb 2016 01:08:45 +0100 Received: from a1i15.kph.uni-mainz.de (localhost [127.0.0.1]) by a1i15.kph.uni-mainz.de (8.14.8/8.14.2) with ESMTP id u1E08joC027323; Sun, 14 Feb 2016 01:08:45 +0100 Received: (from ulm@localhost) by a1i15.kph.uni-mainz.de (8.14.8/8.14.8/Submit) id u1E08hMF027322; Sun, 14 Feb 2016 01:08:43 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> Date: Sun, 14 Feb 2016 01:08:43 +0100 To: Sergei Trofimovich Subject: Re: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) In-Reply-To: <20160213232052.7a5d14a2@sf> References: <20160213232052.7a5d14a2@sf> X-Mailer: VM 8.2.0b under 24.3.1 (x86_64-pc-linux-gnu) From: Ulrich Mueller X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit Cc: Jim Meyering , skvadrik@gmail.com, bug-grep@gnu.org, Norihiro Tanaka , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) >>>>> On Sat, 13 Feb 2016, Sergei Trofimovich wrote: > The issue is found by Ulrich Mueller: > It seems DFA engine does not understand --null-data: > ~/dev/git/grep $ cat a-test.sh > #!/bin/bash > printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -z '^[1234yz]*$' | wc -c > printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -P -z '^[1234yz]*$' | wc -c > printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -z '^[1234y-z]*$' | wc -c > printf '12\n34\0' | LC_ALL=en_US.utf-8 src/grep -P -z '^[1234y-z]*$' | wc -c > ~/dev/git/grep $ ./a-test.sh > 0 > 6 > 6 > 6 > All 4 should return 6 but first is not correct. > It seems that 'y-z' range disables dfa.c code and works fine. Hm, I think it is the other way around. \n is a normal char, so the regex shouldn't match (and all four should return 0). In any case, behaviour should be the same for all of them. Downstream bug: https://bugs.gentoo.org/show_bug.cgi?id=574662 Ulrich From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 14 16:58:14 2016 Received: (at submit) by debbugs.gnu.org; 14 Feb 2016 21:58:14 +0000 Received: from localhost ([127.0.0.1]:39122 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aV4gI-0002X4-Dc for submit@debbugs.gnu.org; Sun, 14 Feb 2016 16:58:14 -0500 Received: from eggs.gnu.org ([208.118.235.92]:43073) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aV2uB-0008If-Q0 for submit@debbugs.gnu.org; Sun, 14 Feb 2016 15:04:28 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aV2u5-0001zr-T5 for submit@debbugs.gnu.org; Sun, 14 Feb 2016 15:04:22 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,FREEMAIL_FROM, HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59934) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aV2u5-0001zn-Pz for submit@debbugs.gnu.org; Sun, 14 Feb 2016 15:04:21 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:34025) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aV2u4-0005CY-Lr for bug-grep@gnu.org; Sun, 14 Feb 2016 15:04:21 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aV2u1-0001yw-En for bug-grep@gnu.org; Sun, 14 Feb 2016 15:04:20 -0500 Received: from mail-wm0-x236.google.com ([2a00:1450:400c:c09::236]:35467) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aV2u1-0001y1-33 for bug-grep@gnu.org; Sun, 14 Feb 2016 15:04:17 -0500 Received: by mail-wm0-x236.google.com with SMTP id c200so86640873wme.0 for ; Sun, 14 Feb 2016 12:04:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-type; bh=tA0Zk8CZT66p65K2UyvhiFPIAT5LfpE1FkN6WtBuvfM=; b=QLrBrBUzqnWSLRHwmXfRdaZcm8GXmNqqlXE6TbYVSk1UO7PD/aFWhwSAKOLfFqKqU0 OEyk1X6K0VlIzARjK0WggvmEfCDBwUaHpZgaLJGelQ5CMOkLaJSvbNq5Z6bdNeY55aTV YxuX1x0D1O8J41WvbyB2oiR3O7j9OaGopRPqcKScx94+vxMQpOL2A/Pcke5ykDF25eih ei7t2jqE5gD/6R47xsqRFFtMeW1XyjqfhWwpaB6TaVcmWcjAE9l+Mj69cLadZjF/cl4e CYfF926J+Gx3gVYpUI8piu1gp3hwbeN3S8zknLIcLNzj7Fafb6RjYlaX4hDMKLYZENP5 fDFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-type; bh=tA0Zk8CZT66p65K2UyvhiFPIAT5LfpE1FkN6WtBuvfM=; b=f4v6+VeSdj6Cwq2odvxFY+qZODa+dab89WIc6jsV92+Q1WnJqRL9rGSyxPTl+/KXVN j58iPir4pA2qSnKV0C57JaOQrbtfmgN33a3vKiZ0jUrw9Z6yiuggpXIvLzt9UHT8OXmj DdWWHihy32sw95aPxIP8CbtrSU7mnr0dWLwM8c7Yww/ihEbdIfaCmryvZVP8K3nqq7zi W609vau98lOCqbpO1vszzN2wkzlqj/34ziYy9WfXiFvG8h3z0u4J4XfkWXEYC3Tsm8FY TqkO9PwAKkHzxHGQozVyOze/qA3imzdTxc/QWxXm+rz9jUN9t2F+JPTUI7D3MgMNlF/I O95w== X-Gm-Message-State: AG10YOSHrqbktIdWEe1xQpW2n9zFZCwiAuH+UsmzrWWaSjNK5mXYNDeaNGvGcIDgfWtZUg== X-Received: by 10.28.145.7 with SMTP id t7mr9079194wmd.98.1455480256234; Sun, 14 Feb 2016 12:04:16 -0800 (PST) Received: from [192.168.1.250] (host81-129-87-168.range81-129.btcentralplus.com. [81.129.87.168]) by smtp.googlemail.com with ESMTPSA id t3sm22023453wjz.11.2016.02.14.12.04.15 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 14 Feb 2016 12:04:15 -0800 (PST) Subject: Re: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Ulrich Mueller , Sergei Trofimovich References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> From: Ulya Fokanova Message-ID: <56C0DD45.90600@gmail.com> Date: Sun, 14 Feb 2016 20:02:13 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> Content-Type: multipart/alternative; boundary="------------060509030208050405010709" X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sun, 14 Feb 2016 16:58:13 -0500 Cc: Jim Meyering , bug-grep@gnu.org, Norihiro Tanaka , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) This is a multi-part message in MIME format. --------------060509030208050405010709 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit I've explored the following case: $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c 6 It's a bug (there should be no match). This is what grep does: * triesto build DFA (as indfa.c) * fails to expand character range [1-4] because of multibyte localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF that suppressesall dfa.c-related code), note the difference with [1234] casein whichthere's no need to expand multibyte range * falls back to Regex (gnulib extension of regex.h) * Regex doesn't support '-z'semantics(the closest configuration to '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP set), so '\n'is treated as newline and match erroneously succeeds I think this should be worked around in grep: before calling 're_search' it should split the input string by 'eolbyte'. The bug also present with PCRE engine: $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c 6 $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c 6 Ulya --------------060509030208050405010709 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: 7bit I've explored the following case:
$ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c
6
It's a bug (there should be no match).

This is what grep does:
  • tries to build DFA (as in dfa.c)
  • fails to expand character range [1-4] because of multibyte locale en_US.utf-8 and gives up building DFA (marks [1-4] as BACKREF that suppresses all dfa.c-related code), note the difference with [1234] case in which there's no need to expand multibyte range
  • falls back to Regex (gnulib extension of regex.h)
  • Regex doesn't support '-z' semantics (the closest configuration to '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP set), so '\n' is treated as newline and match erroneously succeeds
I think this should be worked around in grep: before calling 're_search' it should split the input string by 'eolbyte'.

The bug also present with PCRE engine:
$ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
6
$ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
6

Ulya

--------------060509030208050405010709-- From debbugs-submit-bounces@debbugs.gnu.org Sat Feb 20 23:19:48 2016 Received: (at 22655) by debbugs.gnu.org; 21 Feb 2016 04:19:48 +0000 Received: from localhost ([127.0.0.1]:35722 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXLUp-0003Wb-Q8 for submit@debbugs.gnu.org; Sat, 20 Feb 2016 23:19:48 -0500 Received: from mail-oi0-f46.google.com ([209.85.218.46]:36337) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXLUo-0003WN-1u for 22655@debbugs.gnu.org; Sat, 20 Feb 2016 23:19:46 -0500 Received: by mail-oi0-f46.google.com with SMTP id w5so36418501oie.3 for <22655@debbugs.gnu.org>; Sat, 20 Feb 2016 20:19:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=oQeqx2ezAHyZFUB7DOgjth7bU2CozplLya/npqiCkf4=; b=v46SgOEBGyx4hnXhGHCIZXB6Wirx2FUWPtvYlj7+somTJatmXyi5clxG/OiN7XUmN5 6YVwMItJWvSUNtJKwjB6MVfkFwLiVZqZlUkmBFI/+XBdX3bhQ2XRcIeK9aqU/wEnqD+U R6KLEngQBJ5STJA9zCZqbqvmKO0Glj78vMO3TXZy4YHrMVw3p22eEhV88goGQpbl0cwa 23hlgRi2z972Fla81ZK6WnSFUE4xwIhrOYT9kdFnvxP8bvlYu5AJ1bmGxnt5JtOqiXjz VnlMbv9CiNU2JzN0WCHIGYbo1c8xtIdA5bCTqfxNc61kIO5j1sgxcECC6RbeUDES/zC5 JhOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-type; bh=oQeqx2ezAHyZFUB7DOgjth7bU2CozplLya/npqiCkf4=; b=dUb84wk3bF4du0OD7/daRscv5fYCB1uCSTqAQpaVHLH2gXy9kWYBCLq6hHfMNlu9fK +MSP6kqmGAuvH590+esTnZOiLYM234FAo5yGSfbFsljPA5aObw7GmH1FGKzcvhmXXhtO 6O2gtnSV9ereFV/QhwHOrdsgzBjxwOuisDeAjMiIPw72bdu8O+Bu6PpNmq0Qs1+Fniz+ 4KzhjohD5uGJnNTm9KjI9RiPa0mF2D4RYFTAxaA9peT21F1Nr+CXMqzWEHiq1rmt9fTs w8rApQdVPzEUhHpb5az34gyYuTzDAxEDVD99Dra+AGOUp54fIVzcE/Z2Bt16mTNljwhV 6dzA== X-Gm-Message-State: AG10YOQlaTrPOtP1OIVH03eShVyxdx6jcZr1jhj1VaM9tnqUAYMROxCG9lRzCTPLLfyV4LAz8YVhlv93+qlJWw== X-Received: by 10.202.84.82 with SMTP id i79mr17161835oib.130.1456028380510; Sat, 20 Feb 2016 20:19:40 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.64.134 with HTTP; Sat, 20 Feb 2016 20:19:20 -0800 (PST) In-Reply-To: <56C0DD45.90600@gmail.com> References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> From: Jim Meyering Date: Sat, 20 Feb 2016 20:19:20 -0800 X-Google-Sender-Auth: F2_4849VArOv4y3pC52Wr0S3v3k Message-ID: Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Ulya Fokanova Content-Type: multipart/mixed; boundary=001a113debb0674fd9052c40066f X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22655 Cc: Jim Meyering , Ulrich Mueller , 22655@debbugs.gnu.org, Sergei Trofimovich , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --001a113debb0674fd9052c40066f Content-Type: text/plain; charset=UTF-8 On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova wrote: > I've explored the following case: > > $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c > 6 > > It's a bug (there should be no match). > > This is what grep does: > > * triesto build DFA (as indfa.c) > * fails to expand character range [1-4] because of multibyte > localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF > that suppressesall dfa.c-related code), note the difference with > [1234] casein whichthere's no need to expand multibyte range > * falls back to Regex (gnulib extension of regex.h) > * Regex doesn't support '-z'semantics(the closest configuration to > '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP > set), so '\n'is treated as newline and match erroneously succeeds > > I think this should be worked around in grep: before calling 're_search' it > should split the input string by 'eolbyte'. > > The bug also present with PCRE engine: > > $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c > 6 > $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c > 6 Thank you for the analysis and the report. I have fixed the regex-oriented problem with the attached patch, but not yet the case using -P -z (PCRE + --null-data): --001a113debb0674fd9052c40066f Content-Type: text/x-patch; charset=US-ASCII; name="0001-grep-z-avoid-erroneous-match-with-regexp-anchor-and-.patch" Content-Disposition: attachment; filename="0001-grep-z-avoid-erroneous-match-with-regexp-anchor-and-.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_ikw1ocr20 RnJvbSAzY2U4YjM5ZTMxMzdkM2NkY2Y4Y2VjODRkYzg5Nzg4MDM3ZTc2NzQyIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U2F0LCAyMCBGZWIgMjAxNiAxMjo1MDoyNyAtMDgwMApTdWJqZWN0OiBbUEFUQ0hdIGdyZXAgLXo6 IGF2b2lkIGVycm9uZW91cyBtYXRjaCB3aXRoIHJlZ2V4cCBhbmNob3IgYW5kIFxuIGluCiB0ZXh0 CgoqIHNyYy9kZmFzZWFyY2guYyAoRUdleGVjdXRlKTogQ2xlYXIgdGhlIG5ld2xpbmVfYW5jaG9y IGJpdCB3aGVuCmVvbGJ5dGUgaXMgbm90ICdcbicuCiogdGVzdHMvei1hbmNob3ItbmV3bGluZTog TmV3IGZpbGUuCiogdGVzdHMvTWFrZWZpbGUuYW0gKFRFU1RTKTogQWRkIGl0LgoqIE5FV1MgKEJ1 ZyBmaXhlcyk6IERlc2NyaWJlIGl0LgpPcmlnaW5hbGx5IHJlcG9ydGVkIGJ5IFVscmljaCBNdWVs bGVyIGluCmh0dHBzOi8vYnVncy5nZW50b28ub3JnL3Nob3dfYnVnLmNnaT9pZD01NzQ2NjIKUmVw b3J0ZWQgdG8gdXMgYnkgU2VyZ2VpIFRyb2ZpbW92aWNoIGFzIGh0dHA6Ly9kZWJidWdzLmdudS5v cmcvMjI2NTUKLS0tCiBORVdTICAgICAgICAgICAgICAgICAgIHwgMTMgKysrKysrKysrKysrKwog c3JjL2RmYXNlYXJjaC5jICAgICAgICB8ICAxICsKIHRlc3RzL01ha2VmaWxlLmFtICAgICAgfCAg MyArKy0KIHRlc3RzL3otYW5jaG9yLW5ld2xpbmUgfCA0MyArKysrKysrKysrKysrKysrKysrKysr KysrKysrKysrKysrKysrKysrKysrCiA0IGZpbGVzIGNoYW5nZWQsIDU5IGluc2VydGlvbnMoKyks IDEgZGVsZXRpb24oLSkKIGNyZWF0ZSBtb2RlIDEwMDc1NSB0ZXN0cy96LWFuY2hvci1uZXdsaW5l CgpkaWZmIC0tZ2l0IGEvTkVXUyBiL05FV1MKaW5kZXggZmVjYTVjNS4uYWUyMzhiZSAxMDA2NDQK LS0tIGEvTkVXUworKysgYi9ORVdTCkBAIC0yLDYgKzIsMTkgQEAgR05VIGdyZXAgTkVXUyAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC0qLSBvdXRsaW5lIC0qLQoKICogTm90ZXdv cnRoeSBjaGFuZ2VzIGluIHJlbGVhc2UgPy4/ICg/Pz8/LT8/LT8/KSBbP10KCisqKiBCdWcgZml4 ZXMKKworICBncmVwIC16IHdvdWxkIG1hdGNoIHN0cmluZ3MgaXQgc2hvdWxkIG5vdC4gIFRvIHRy aWdnZXIgdGhlIGJ1ZywgeW91J2QKKyAgaGF2ZSB0byB1c2UgYSByZWd1bGFyIGV4cHJlc3Npb24g aW5jbHVkaW5nIGFuIGFuY2hvciAoXiBvciAkKSBhbmQgYQorICBmZWF0dXJlIGxpa2UgYSByYW5n ZSBvciBhIGJhY2tyZWZlcmVuY2UsIGNhdXNpbmcgZ3JlcCB0byBmb3JlZ28gaXRzIERGQQorICBt YXRjaGVyIGFuZCByZXNvcnQgdG8gdXNpbmcgcmVfc2VhcmNoLiAgV2l0aCBhIG11bHRpYnl0ZSBs b2NhbGUsIHRoYXQKKyAgbWF0Y2hlciBjb3VsZCBtaXN0YWtlbmx5IG1hdGNoIGEgc3RyaW5nIGNv bnRhaW5pbmcgYSBuZXdsaW5lLgorICBGb3IgZXhhbXBsZSwgdGhpcyBjb21tYW5kOgorICAgIHBy aW50ZiAnYVxuYlwwJyB8IExDX0FMTD1lbl9VUy51dGYtOCBncmVwIC16ICdeW2EtYl0qYicKKyAg d291bGQgbWlzdGFrZW5seSBtYXRjaCBhbmQgcHJpbnQgYWxsIGZvdXIgaW5wdXQgYnl0ZXMuICBB ZnRlciB0aGUgZml4LAorICB0aGVyZSBpcyBubyBtYXRjaCwgYXMgZXhwZWN0ZWQuCisgIFtidWcg aW50cm9kdWNlZCBpbiBncmVwLTIuN10KKwoKICogTm90ZXdvcnRoeSBjaGFuZ2VzIGluIHJlbGVh c2UgMi4yMyAoMjAxNi0wMi0wNCkgW3N0YWJsZV0KCmRpZmYgLS1naXQgYS9zcmMvZGZhc2VhcmNo LmMgYi9zcmMvZGZhc2VhcmNoLmMKaW5kZXggZTA0YTJkZi4uZDM0OGQ0NCAxMDA2NDQKLS0tIGEv c3JjL2RmYXNlYXJjaC5jCisrKyBiL3NyYy9kZmFzZWFyY2guYwpAQCAtMzQyLDYgKzM0Miw3IEBA IEVHZXhlY3V0ZSAoY2hhciAqYnVmLCBzaXplX3Qgc2l6ZSwgc2l6ZV90ICptYXRjaF9zaXplLAog ICAgICAgZm9yIChpID0gMDsgaSA8IHBjb3VudDsgaSsrKQogICAgICAgICB7CiAgICAgICAgICAg cGF0dGVybnNbaV0ucmVnZXhidWYubm90X2VvbCA9IDA7CisgICAgICAgICAgcGF0dGVybnNbaV0u cmVnZXhidWYubmV3bGluZV9hbmNob3IgPSBlb2xieXRlID09ICdcbic7CiAgICAgICAgICAgc3Rh cnQgPSByZV9zZWFyY2ggKCYocGF0dGVybnNbaV0ucmVnZXhidWYpLAogICAgICAgICAgICAgICAg ICAgICAgICAgICAgICBiZWcsIGVuZCAtIGJlZyAtIDEsCiAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgIHB0ciAtIGJlZywgZW5kIC0gcHRyIC0gMSwKZGlmZiAtLWdpdCBhL3Rlc3RzL01ha2Vm aWxlLmFtIGIvdGVzdHMvTWFrZWZpbGUuYW0KaW5kZXggYTM4MzAzYy4uNWEyYzBmMCAxMDA2NDQK LS0tIGEvdGVzdHMvTWFrZWZpbGUuYW0KKysrIGIvdGVzdHMvTWFrZWZpbGUuYW0KQEAgLTE0MSw3 ICsxNDEsOCBAQCBURVNUUyA9CQkJCQkJXAogICB3b3JkLWRlbGltLW11bHRpYnl0ZQkJCQlcCiAg IHdvcmQtbXVsdGktZmlsZQkJCQlcCiAgIHdvcmQtbXVsdGlieXRlCQkJCVwKLSAgeWVzbm8KKyAg eWVzbm8JCQkJCQlcCisgIHotYW5jaG9yLW5ld2xpbmUKCiBFWFRSQV9ESVNUID0JCQkJCVwKICAg JChURVNUUykJCQkJCVwKZGlmZiAtLWdpdCBhL3Rlc3RzL3otYW5jaG9yLW5ld2xpbmUgYi90ZXN0 cy96LWFuY2hvci1uZXdsaW5lCm5ldyBmaWxlIG1vZGUgMTAwNzU1CmluZGV4IDAwMDAwMDAuLmI0 ZGZlYmMKLS0tIC9kZXYvbnVsbAorKysgYi90ZXN0cy96LWFuY2hvci1uZXdsaW5lCkBAIC0wLDAg KzEsNDMgQEAKKyMhL2Jpbi9zaAorIyBncmVwIC16IHdpdGggYW4gYW5jaG9yIGluIHRoZSByZWdl eCBjb3VsZCBtaXN0YWtlbmx5IG1hdGNoIHRleHQKKyMgaW5jbHVkaW5nIGEgbmV3bGluZS4KKwor IyBDb3B5cmlnaHQgMjAxNiBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24sIEluYy4KKworIyBUaGlz IHByb2dyYW0gaXMgZnJlZSBzb2Z0d2FyZTogeW91IGNhbiByZWRpc3RyaWJ1dGUgaXQgYW5kL29y IG1vZGlmeQorIyBpdCB1bmRlciB0aGUgdGVybXMgb2YgdGhlIEdOVSBHZW5lcmFsIFB1YmxpYyBM aWNlbnNlIGFzIHB1Ymxpc2hlZCBieQorIyB0aGUgRnJlZSBTb2Z0d2FyZSBGb3VuZGF0aW9uLCBl aXRoZXIgdmVyc2lvbiAzIG9mIHRoZSBMaWNlbnNlLCBvcgorIyAoYXQgeW91ciBvcHRpb24pIGFu eSBsYXRlciB2ZXJzaW9uLgorCisjIFRoaXMgcHJvZ3JhbSBpcyBkaXN0cmlidXRlZCBpbiB0aGUg aG9wZSB0aGF0IGl0IHdpbGwgYmUgdXNlZnVsLAorIyBidXQgV0lUSE9VVCBBTlkgV0FSUkFOVFk7 IHdpdGhvdXQgZXZlbiB0aGUgaW1wbGllZCB3YXJyYW50eSBvZgorIyBNRVJDSEFOVEFCSUxJVFkg b3IgRklUTkVTUyBGT1IgQSBQQVJUSUNVTEFSIFBVUlBPU0UuICBTZWUgdGhlCisjIEdOVSBHZW5l cmFsIFB1YmxpYyBMaWNlbnNlIGZvciBtb3JlIGRldGFpbHMuCisKKyMgWW91IHNob3VsZCBoYXZl IHJlY2VpdmVkIGEgY29weSBvZiB0aGUgR05VIEdlbmVyYWwgUHVibGljIExpY2Vuc2UKKyMgYWxv bmcgd2l0aCB0aGlzIHByb2dyYW0uICBJZiBub3QsIHNlZSA8aHR0cDovL3d3dy5nbnUub3JnL2xp Y2Vuc2VzLz4uCisKKy4gIiR7c3JjZGlyPS59L2luaXQuc2giOyBwYXRoX3ByZXBlbmRfIC4uL3Ny YworCityZXF1aXJlX2VuX3V0ZjhfbG9jYWxlXworcmVxdWlyZV9jb21waWxlZF9pbl9NQl9zdXBw b3J0CitMQ19BTEw9ZW5fVVMuVVRGLTgKKworcHJpbnRmICdhXG5iXDAnID4gaW4gfHwgZnJhbWV3 b3JrX2ZhaWx1cmVfCisKK2ZhaWw9MAorCitlbnYgPiAvdC94CisjIFRoZXNlIHRocmVlIHdvdWxk IGFsbCBtaXN0YWtlbmx5IG1hdGNoLCBiZWNhdXNlIHRoZSBbYS1iXSByYW5nZQorIyBmb3JjZWQg dGhlIG5vbi1ERkEgKHJlZ2V4cC11c2luZykgY29kZSBwYXRoLgorcmV0dXJuc18gMSBncmVwIC16 ICdeW2EtYl0qJCcgaW4gfHwgZmFpbD0xCityZXR1cm5zXyAxIGdyZXAgLXogJ2FbYS1iXSokJyBp biB8fCBmYWlsPTEKK3JldHVybnNfIDEgZ3JlcCAteiAnXlthLWJdKmInIGluIHx8IGZhaWw9MQor CisjIFRlc3QgdGhlc2UgZm9yIGdvb2QgbWVhc3VyZTsgdGhleSBleGVyY2lzZSB0aGUgREZBIGNv ZGUgcGF0aAorIyBhbmQgYWx3YXlzIHdvcmtlZAorcmV0dXJuc18gMSBncmVwIC16ICdeW2FiXSok JyBpbiB8fCBmYWlsPTEKK3JldHVybnNfIDEgZ3JlcCAteiAnYVthYl0qJCcgaW4gfHwgZmFpbD0x CityZXR1cm5zXyAxIGdyZXAgLXogJ15bYWJdKmInIGluIHx8IGZhaWw9MQorCitFeGl0ICRmYWls Ci0tIAoyLjYuNAoK --001a113debb0674fd9052c40066f-- From debbugs-submit-bounces@debbugs.gnu.org Sat Feb 20 23:23:32 2016 Received: (at 22655) by debbugs.gnu.org; 21 Feb 2016 04:23:32 +0000 Received: from localhost ([127.0.0.1]:35726 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXLYS-0003ci-EZ for submit@debbugs.gnu.org; Sat, 20 Feb 2016 23:23:32 -0500 Received: from mail-ob0-f169.google.com ([209.85.214.169]:36524) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXLYQ-0003cR-Pk for 22655@debbugs.gnu.org; Sat, 20 Feb 2016 23:23:31 -0500 Received: by mail-ob0-f169.google.com with SMTP id gc3so139106729obb.3 for <22655@debbugs.gnu.org>; Sat, 20 Feb 2016 20:23:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=gLbNX7m03QT7e7pcmkcJ4R6S8zqE1Efpon+zg7EKDb0=; b=u1oNsfQ5lRnQGHOfA/PAUjPuCtCY6d8RisF7VZTxQFhmqy+FlNMXOCLqSMEJ5hC45G W/xgob+l9wAzpKQaA0c7JnH23lFp0IXyRYyDZtaPcITziJhf1JsbrWsTmV8q1YmtprX8 QhQN9+zNOeNJuA/YLkp1Zn/eObDul08mdkUOx2NLgLhupmjHyqBWpt+fQ7a4uuY7ZoPt c2ILmWdYnY3NeqB4zs2UwasL+g4fkl2KE110GOJBX+WWC3zkEpu2oTgckkaDmmopJO5k QaNzAVVZNoOJ0A4YPegMPgGW7g+wZCqoCBOe9BciG1/WoaNHqhoXoJbBCe9J4SQqj9g3 fv5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-type; bh=gLbNX7m03QT7e7pcmkcJ4R6S8zqE1Efpon+zg7EKDb0=; b=NcKv/Nj2d1Mke0G5pu2NlZNm9Nljyglk3hiblSRmuKY9Rd6cATyfICRtmiLYQbPLzf WwSV5giy0/alIzvoOmeuMQC2VHeU5QIcfnJspXKqRRJZxSmwVrA5EARyUygDhHL/fyHd XRLiIs/5/zmamTcN/3zZQPNZ5XfGyF/Wy3oTpKbtiYeaY+oB+8S/gsBAwiYAZohg7x/h 9MwfcvJwzc+Dm8TUnw3TizqW7wOXohW/8xo2DdHyg1L88nO+VJuBmV2L3KuvgWwSR4pl oLpXFIbgAWbHYtsTEw/FJKBwkj+w4zHtGkCo52xbgixFFTCIczMqUXw3qxaM9G7Oqi60 SHpQ== X-Gm-Message-State: AG10YOS6SYSHNRQXyP2P8wZq4kILCHN1k4ZdtwXAnhAAdu6oLEnfN3Ov8RZ3/uir58Hekv2fSFhiSZ4lDFlklQ== X-Received: by 10.60.246.37 with SMTP id xt5mr18370703oec.72.1456028605040; Sat, 20 Feb 2016 20:23:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.64.134 with HTTP; Sat, 20 Feb 2016 20:23:05 -0800 (PST) In-Reply-To: References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> From: Jim Meyering Date: Sat, 20 Feb 2016 20:23:05 -0800 X-Google-Sender-Auth: aRyktzuvoj-4o-dZrIPpGKFet4g Message-ID: Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Ulya Fokanova Content-Type: multipart/mixed; boundary=001a1136988cc96681052c4013ad X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22655 Cc: Jim Meyering , Ulrich Mueller , 22655@debbugs.gnu.org, Sergei Trofimovich , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --001a1136988cc96681052c4013ad Content-Type: text/plain; charset=UTF-8 On Sat, Feb 20, 2016 at 8:19 PM, Jim Meyering wrote: ... >> The bug also present with PCRE engine: >> >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c >> 6 >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c >> 6 > > Thank you for the analysis and the report. > I have fixed the regex-oriented problem with the attached > patch, but not yet the case using -P -z (PCRE + --null-data): FTR, I've also pushed the attached test-improving patch: --001a1136988cc96681052c4013ad Content-Type: text/x-patch; charset=US-ASCII; name="0001-tests-convert-cmd-fail-1-to-returns_-1-cmd-fail-1.patch" Content-Disposition: attachment; filename="0001-tests-convert-cmd-fail-1-to-returns_-1-cmd-fail-1.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_ikw1t6ll1 RnJvbSAxMjFkNjM2NjZmZWQyOTg3NjlhMTkzNTRlNTE3ZmQ5YTcxOTBhM2EyIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U2F0LCAyMCBGZWIgMjAxNiAxNzoxOTo0NCAtMDgwMApTdWJqZWN0OiBbUEFUQ0ggMS8yXSB0ZXN0 czogY29udmVydCAiY21kICYmIGZhaWw9MSIgdG8gInJldHVybnNfIDEgY21kIHx8CiBmYWlsPTEi CgpUaGUgbGF0dGVyIGlzIHJvYnVzdCwgd2hpbGUgdGhlIGZvcm1lciBjYW4gc2lsZW50bHkgaWdu b3JlCmZhaWx1cmUgZHVlIHRvIHNpZ25hbHMuCiogY2ZnLm1rIChzY19wcm9oaWJpdF9hbmRfZmFp bF8xKTogTmV3IHJ1bGUsIGNvcGllZCBmcm9tIGNvcmV1dGlscy4KKiB0ZXN0cy9sb25nLXBhdHRl cm4tcGVyZjogUGVyZm9ybSB0aGUgYWJvdmUgc3Vic3RpdHV0aW9uLgoqIHRlc3RzL21iLW5vbi1V VEY4LXBlcmZvcm1hbmNlOiBMaWtld2lzZS4KKiB0ZXN0cy9oZWxwLXZlcnNpb246IE1lcmdlIGZy b20gY29yZXV0aWxzLgotLS0KIGNmZy5tayAgICAgICAgICAgICAgICAgICAgICAgIHwgMTEgKysr KysrKysrCiB0ZXN0cy9oZWxwLXZlcnNpb24gICAgICAgICAgICB8IDUzICsrKysrKysrKysrKysr KysrKysrLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KIHRlc3RzL2xvbmctcGF0dGVybi1wZXJmICAg ICAgIHwgIDIgKy0KIHRlc3RzL21iLW5vbi1VVEY4LXBlcmZvcm1hbmNlIHwgIDIgKy0KIDQgZmls ZXMgY2hhbmdlZCwgMzggaW5zZXJ0aW9ucygrKSwgMzAgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0 IGEvY2ZnLm1rIGIvY2ZnLm1rCmluZGV4IDljODBiMmEuLjNmNGVmMzIgMTAwNjQ0Ci0tLSBhL2Nm Zy5taworKysgYi9jZmcubWsKQEAgLTEyMCw2ICsxMjAsMTcgQEAgc2NfVEhBTktTX2luX2R1cGxp Y2F0ZXM6CiAJICAgICYmIHsgZWNobyAnJChNRSk6IHJlbW92ZSB0aGUgYWJvdmUgbmFtZXMgZnJv bSBUSEFOS1MuaW4nCVwKIAkJICAxPiYyOyBleGl0IDE7IH0gfHwgOgoKKyMgRW5zdXJlIHRoYXQg dGVzdHMgZG9uJ3QgdXNlIGBjbWQgLi4uICYmIGZhaWw9MWAgYXMgdGhhdCBoaWRlcyBjcmFzaGVz LgorIyBUaGUgImV4Y2x1ZGUiIGV4cHJlc3Npb24gYWxsb3dzIGNvbW1vbiBpZGlvbXMgbGlrZSBg dGVzdCAuLi4gJiYgZmFpbD0xYAorIyBhbmQgdGhlIDI+Li4uIHBvcnRpb24gYWxsb3dzIGNvbW1h bmRzIHRoYXQgcmVkaXJlY3Qgc3RkZXJyIGFuZCBzbyBwcm9iYWJseQorIyBpbmRlcGVuZGVudGx5 IGNoZWNrIGl0cyBjb250ZW50cyBhbmQgdGh1cyBkZXRlY3QgYW55IGNyYXNoIG1lc3NhZ2VzLgor c2NfcHJvaGliaXRfYW5kX2ZhaWxfMToKKwlAcHJvaGliaXQ9JyYmIGZhaWw9MScJCQkJCQlcCisJ ZXhjbHVkZT0nKHN0YXR8a2lsbHx0ZXN0IHxFR1JFUHxncmVwfGNvbXBhcmV8Mj4gKlteL10pJwkJ XAorCWhhbHQ9JyYmIGZhaWw9MSBkZXRlY3RlZC4gUGxlYXNlIHVzZTogcmV0dXJuc18gMSAuLi4g fHwgZmFpbD0xJwlcCisJaW5fdmNfZmlsZXM9J150ZXN0cy8nCQkJCQkJXAorCSAgJChfc2Nfc2Vh cmNoX3JlZ2V4cCkKKwogdXBkYXRlLWNvcHlyaWdodC1lbnYgPSBcCiAgIFVQREFURV9DT1BZUklH SFRfVVNFX0lOVEVSVkFMUz0xIFwKICAgVVBEQVRFX0NPUFlSSUdIVF9NQVhfTElORV9MRU5HVEg9 NzkKZGlmZiAtLWdpdCBhL3Rlc3RzL2hlbHAtdmVyc2lvbiBiL3Rlc3RzL2hlbHAtdmVyc2lvbgpp bmRleCA1OWVjYWViLi45ZDliMTQ3IDEwMDc1NQotLS0gYS90ZXN0cy9oZWxwLXZlcnNpb24KKysr IGIvdGVzdHMvaGVscC12ZXJzaW9uCkBAIC0xLDUgKzEsNSBAQAotIyEgL2Jpbi9zaAotIyBNYWtl IHN1cmUgYWxsIHRoZXNlIHByb2dyYW1zIHdvcmsgcHJvcGVybHkKKyMhL2Jpbi9zaAorIyBNYWtl IHN1cmUgYWxsIG9mIHRoZXNlIHByb2dyYW1zIHdvcmsgcHJvcGVybHkKICMgd2hlbiBpbnZva2Vk IHdpdGggLS1oZWxwIG9yIC0tdmVyc2lvbi4KCiAjIENvcHlyaWdodCAoQykgMjAwMC0yMDE2IEZy ZWUgU29mdHdhcmUgRm91bmRhdGlvbiwgSW5jLgpAQCAtMTcsMjAgKzE3LDE2IEBACiAjIFlvdSBz aG91bGQgaGF2ZSByZWNlaXZlZCBhIGNvcHkgb2YgdGhlIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNl bnNlCiAjIGFsb25nIHdpdGggdGhpcyBwcm9ncmFtLiAgSWYgbm90LCBzZWUgPGh0dHA6Ly93d3cu Z251Lm9yZy9saWNlbnNlcy8+LgoKLSMgRW5zdXJlIHRoYXQgJFNIRUxMIGlzIHNldCB0byAqc29t ZSogdmFsdWUgYW5kIGV4cG9ydGVkLgotIyBUaGlzIGlzIHJlcXVpcmVkIGZvciBkaXJjb2xvcnMs IHdoaWNoIHdvdWxkIGZhaWwgZS5nLiwgd2hlbgotIyBpbnZva2VkIHZpYSBkZWJ1aWxkICh3aGlj aCByZW1vdmVzIFNIRUxMIGZyb20gdGhlIGVudmlyb25tZW50KS4KLXRlc3QgIngkU0hFTEwiID0g eCAmJiBTSEVMTD0vYmluL3NoCi1leHBvcnQgU0hFTEwKLQogLiAiJHtzcmNkaXI9Ln0vaW5pdC5z aCI7IHBhdGhfcHJlcGVuZF8gLi4vc3JjCgorIyBUZXJtaW5hdGUgYW55IGJhY2tncm91bmQgcHJv Y2Vzc2VzCitjbGVhbnVwXygpIHsga2lsbCAkcGlkIDI+L2Rldi9udWxsICYmIHdhaXQgJHBpZDsg fQorCiBleHBlY3RlZF9mYWlsdXJlX3N0YXR1c19jaHJvb3Q9MTI1CiBleHBlY3RlZF9mYWlsdXJl X3N0YXR1c19lbnY9MTI1CiBleHBlY3RlZF9mYWlsdXJlX3N0YXR1c19uaWNlPTEyNQogZXhwZWN0 ZWRfZmFpbHVyZV9zdGF0dXNfbm9odXA9MTI1CiBleHBlY3RlZF9mYWlsdXJlX3N0YXR1c19zdGRi dWY9MTI1Ci1leHBlY3RlZF9mYWlsdXJlX3N0YXR1c19zdT0xMjUKIGV4cGVjdGVkX2ZhaWx1cmVf c3RhdHVzX3RpbWVvdXQ9MTI1CiBleHBlY3RlZF9mYWlsdXJlX3N0YXR1c19wcmludGVudj0yCiBl eHBlY3RlZF9mYWlsdXJlX3N0YXR1c190dHk9MwpAQCAtNzUsMTQgKzcxLDExIEBAIGZvciBsYW5n IGluIEMgZnIgZGE7IGRvCiAgIGZvciBpIGluICRidWlsdF9wcm9ncmFtczsgZG8KCiAgICAgIyBT a2lwICd0ZXN0JzsgaXQgZG9lc24ndCBhY2NlcHQgLS1oZWxwIG9yIC0tdmVyc2lvbi4KLSAgICB0 ZXN0ICRpID0gdGVzdCAmJiBjb250aW51ZTsKKyAgICB0ZXN0ICRpID0gdGVzdCAmJiBjb250aW51 ZQoKICAgICAjIGZhbHNlIGZhaWxzIGV2ZW4gd2hlbiBpbnZva2VkIHdpdGggLS1oZWxwIG9yIC0t dmVyc2lvbi4KLSAgICBpZiB0ZXN0ICRpID0gZmFsc2U7IHRoZW4KLSAgICAgIGVudiBMQ19NRVNT QUdFUz0kbGFuZyAkaSAtLWhlbHAgPi9kZXYvbnVsbCAmJiBmYWlsPTEKLSAgICAgIGVudiBMQ19N RVNTQUdFUz0kbGFuZyAkaSAtLXZlcnNpb24gPi9kZXYvbnVsbCAmJiBmYWlsPTEKLSAgICAgIGNv bnRpbnVlCi0gICAgZmkKKyAgICAjIHRydWUgYW5kIGZhbHNlIGFyZSB0ZXN0ZWQgd2l0aCB0aGVz ZSBvcHRpb25zIHNlcGFyYXRlbHkuCisgICAgdGVzdCAkaSA9IGZhbHNlIHx8IHRlc3QgJGkgPSB0 cnVlICYmIGNvbnRpbnVlCgogICAgICMgVGhlIGp1c3QtYnVpbHQgaW5zdGFsbCBleGVjdXRhYmxl IGlzIGFsd2F5cyBuYW1lZCAnZ2luc3RhbGwnLgogICAgIHRlc3QgJGkgPSBpbnN0YWxsICYmIGk9 Z2luc3RhbGwKQEAgLTk3LDE5ICs5MCwxOSBAQCBmb3IgbGFuZyBpbiBDIGZyIGRhOyBkbwoKICAg ICAjIE1ha2Ugc3VyZSB0aGV5IGZhaWwgdXBvbiAnZGlzayBmdWxsJyBlcnJvci4KICAgICBpZiB0 ZXN0IC13IC9kZXYvZnVsbCAmJiB0ZXN0IC1jIC9kZXYvZnVsbDsgdGhlbgotICAgICAgZW52ICRp IC0taGVscCAgICA+L2Rldi9mdWxsIDI+L2Rldi9udWxsICYmIGZhaWw9MQotICAgICAgZW52ICRp IC0tdmVyc2lvbiA+L2Rldi9mdWxsIDI+L2Rldi9udWxsICYmIGZhaWw9MQotICAgICAgc3RhdHVz PSQ/Ci0gICAgICB0ZXN0ICRpID0gWyAmJiBwcm9nPWxicmFja2V0IHx8IHByb2c9JGkKKyAgICAg IHRlc3QgJGkgPSBbICYmIHByb2c9bGJyYWNrZXQgfHwgcHJvZz0kKGVjaG8gJGl8c2VkICJzLyRF WEVFWFQkLy8iKQogICAgICAgZXZhbCAiZXhwZWN0ZWQ9XCRleHBlY3RlZF9mYWlsdXJlX3N0YXR1 c18kcHJvZyIKICAgICAgIHRlc3QgeCRleHBlY3RlZCA9IHggJiYgZXhwZWN0ZWQ9MQotICAgICAg aWYgdGVzdCAkc3RhdHVzID0gJGV4cGVjdGVkOyB0aGVuCi0gICAgICAgIDogIyBvawotICAgICAg ZWxzZQorCisgICAgICByZXR1cm5zXyAkZXhwZWN0ZWQgZW52ICRpIC0taGVscCAgICA+L2Rldi9m dWxsIDI+L2Rldi9udWxsICYmCisgICAgICByZXR1cm5zXyAkZXhwZWN0ZWQgZW52ICRpIC0tdmVy c2lvbiA+L2Rldi9mdWxsIDI+L2Rldi9udWxsIHx8CisgICAgICB7CiAgICAgICAgIGZhaWw9MQor ICAgICAgICBlbnYgJGkgLS1oZWxwID4vZGV2L2Z1bGwgMj4vZGV2L251bGwKKyAgICAgICAgc3Rh dHVzPSQ/CiAgICAgICAgIGVjaG8gIioqKiAkaTogYmFkIGV4aXQgc3RhdHVzICckc3RhdHVzJyAo ZXhwZWN0ZWQgJGV4cGVjdGVkKSwiIDE+JjIKICAgICAgICAgZWNobyAiICB3aXRoIC0taGVscCBv ciAtLXZlcnNpb24gb3V0cHV0IHJlZGlyZWN0ZWQgdG8gL2Rldi9mdWxsIiAxPiYyCi0gICAgICBm aQorICAgICAgfQogICAgIGZpCiAgIGRvbmUKIGRvbmUKQEAgLTE3Nyw2ICsxNzAsNyBAQCBsbl9z ZXR1cCAoKSB7IGFyZ3M9IiR0bXBfaW4gbG4tdGFyZ2V0IjsgfQogZ2luc3RhbGxfc2V0dXAgKCkg eyBhcmdzPSIkdG1wX2luICR0bXBfaW4yIjsgfQogbXZfc2V0dXAgKCkgeyBhcmdzPSIkdG1wX2lu ICR0bXBfaW4yIjsgfQogbWtkaXJfc2V0dXAgKCkgeyBhcmdzPSR0bXBfZGlyL3N1YmRpcjsgfQor cmVhbHBhdGhfc2V0dXAgKCkgeyBhcmdzPSR0bXBfaW47IH0KIHJtZGlyX3NldHVwICgpIHsgYXJn cz0kdG1wX2RpcjsgfQogcm1fc2V0dXAgKCkgeyBhcmdzPSR0bXBfaW47IH0KIHNocmVkX3NldHVw ICgpIHsgYXJncz0kdG1wX2luOyB9CkBAIC0yMDcsNyArMjAxLDYgQEAgbm9odXBfc2V0dXAgKCkg eyBhcmdzPS0tdmVyc2lvbjsgfQogcHJpbnRmX3NldHVwICgpIHsgYXJncz1mb287IH0KIHNlcV9z ZXR1cCAoKSB7IGFyZ3M9MTA7IH0KIHNsZWVwX3NldHVwICgpIHsgYXJncz0wOyB9Ci1zdV9zZXR1 cCAoKSB7IGFyZ3M9LS12ZXJzaW9uOyB9CiBzdGRidWZfc2V0dXAgKCkgeyBhcmdzPSItb0wgdHJ1 ZSI7IH0KIHRpbWVvdXRfc2V0dXAgKCkgeyBhcmdzPS0tdmVyc2lvbjsgfQoKQEAgLTIyNiw4ICsy MTksOSBAQCBpZF9zZXR1cCAoKSB7IGFyZ3M9LXU7IH0KCiAjIFVzZSBlbnYgdG8gYXZvaWQgaW52 b2tpbmcgYnVpbHQtaW4gc2xlZXAgb2YgU29sYXJpcyAxMSdzIC9iaW4vc2guCiBraWxsX3NldHVw ICgpIHsKLSAgZW52IHNsZWVwIDEwbSAmCi0gIGFyZ3M9JCEKKyAgZXh0ZXJuYWw9ZW52CisgICRl eHRlcm5hbCBzbGVlcCAxMG0gJiBwaWQ9JCEKKyAgYXJncz0kcGlkCiB9CgogbGlua19zZXR1cCAo KSB7IGFyZ3M9IiR0bXBfaW4gbGluay10YXJnZXQiOyB9CkBAIC0yNDIsMTEgKzIzNiwxNCBAQCBz dGF0X3NldHVwICgpIHsgYXJncz0kdG1wX2luOyB9CiB1bmxpbmtfc2V0dXAgKCkgeyBhcmdzPSR0 bXBfaW47IH0KIGxicmFja2V0X3NldHVwICgpIHsgYXJncz0iOiBdIjsgfQoKK3BhcnRlZF9zZXR1 cCAoKSB7IGFyZ3M9Ii1zICR0bXBfaW4gbWtsYWJlbCBncHQiCisgIGRkIGlmPS9kZXYvbnVsbCBv Zj0kdG1wX2luIHNlZWs9MjAwMDsgfQorCiAjIEVuc3VyZSB0aGF0IGVhY2ggcHJvZ3JhbSAid29y a3MiIChleGl0cyBzdWNjZXNzZnVsbHkpIHdoZW4gZG9pbmcKICMgc29tZXRoaW5nIG1vcmUgdGhh biAtLWhlbHAgb3IgLS12ZXJzaW9uLgogZm9yIGkgaW4gJGJ1aWx0X3Byb2dyYW1zOyBkbwogICAj IFNraXAgdGhlc2UuCi0gIGNhc2UgJGkgaW4gY2hyb290fHN0dHl8dHR5fGZhbHNlfGNoY29ufHJ1 bmNvbikgY29udGludWU7OyBlc2FjCisgIGNhc2UgJGkgaW4gY2hyb290fHN0dHl8dHR5fGZhbHNl fGNoY29ufHJ1bmNvbnxjb3JldXRpbHMpIGNvbnRpbnVlOzsgZXNhYwoKICAgcm0gLXJmICR0bXBf aW4gJHRtcF9pbjIgJHRtcF9kaXIgJHRtcF9vdXQgJGJpZ1pfaW4gJHppbiAkemluMgogICBlY2hv IHogfGd6aXAgPiAkemluCkBAIC0yNjEsNyArMjU4LDcgQEAgZm9yIGkgaW4gJGJ1aWx0X3Byb2dy YW1zOyBkbwogICBjcCAkdG1wX2luICR0bXBfaW4yCiAgIG1rZGlyICR0bXBfZGlyCiAgICMgZWNo byA9PT09PT09PT09PT09PT09PT0gJGkKLSAgdGVzdCAkaSA9IFsgJiYgcHJvZz1sYnJhY2tldCB8 fCBwcm9nPSRpCisgIHRlc3QgJGkgPSBbICYmIHByb2c9bGJyYWNrZXQgfHwgcHJvZz0kKGVjaG8g JGl8c2VkICJzLyRFWEVFWFQkLy8iKQogICBpZiB0eXBlICR7cHJvZ31fc2V0dXAgPiAvZGV2L251 bGwgMj4mMTsgdGhlbgogICAgICR7cHJvZ31fc2V0dXAKICAgZWxzZQpkaWZmIC0tZ2l0IGEvdGVz dHMvbG9uZy1wYXR0ZXJuLXBlcmYgYi90ZXN0cy9sb25nLXBhdHRlcm4tcGVyZgppbmRleCBiODhj MDNjLi4wNTQ4ZjM0IDEwMDc1NQotLS0gYS90ZXN0cy9sb25nLXBhdHRlcm4tcGVyZgorKysgYi90 ZXN0cy9sb25nLXBhdHRlcm4tcGVyZgpAQCAtMzgsNiArMzgsNiBAQCBiMTB4X21zPSQodXNlcl90 aW1lXyAxIGdyZXAgLWYgcmUtMTB4IGluKSB8fCBmYWlsPTEKICMgSW5jcmVhc2luZyB0aGUgbGVu Z3RoIG9mIHRoZSByZWd1bGFyIGV4cHJlc3Npb24gYnkgYSBmYWN0b3IKICMgb2YgMTAgc2hvdWxk IGNhdXNlIG5vIG1vcmUgdGhhbiBhIDEweCBpbmNyZWFzZSBpbiBkdXJhdGlvbi4KICMgSG93ZXZl ciwgd2UnbGwgZHJhdyB0aGUgbGluZSBhdCAyMHggdG8gYXZvaWQgZmFsc2UtcG9zaXRpdmVzLgot ZXhwciAkYmFzZV9tcyAnPCcgJGIxMHhfbXMgLyAyMCAmJiBmYWlsPTEKK3JldHVybnNfIDEgZXhw ciAkYmFzZV9tcyAnPCcgJGIxMHhfbXMgLyAyMCB8fCBmYWlsPTEKCiBFeGl0ICRmYWlsCmRpZmYg LS1naXQgYS90ZXN0cy9tYi1ub24tVVRGOC1wZXJmb3JtYW5jZSBiL3Rlc3RzL21iLW5vbi1VVEY4 LXBlcmZvcm1hbmNlCmluZGV4IDNmOTIwOWMuLjExZDBjZjcgMTAwNzU1Ci0tLSBhL3Rlc3RzL21i LW5vbi1VVEY4LXBlcmZvcm1hbmNlCisrKyBiL3Rlc3RzL21iLW5vbi1VVEY4LXBlcmZvcm1hbmNl CkBAIC00Myw2ICs0Myw2IEBAIG1ieXRlX21zPSQodXNlcl90aW1lXyAxIGdyZXAgLWkgZm9vYmFy IGluKSB8fCBmYWlsPTEKICMgVGhlIGR1cmF0aW9uIG9mIHRoZSBtdWx0aS1ieXRlIHJ1biBtdXN0 IGJlIG5vIG1vcmUgdGhhbiAzMCB0aW1lcwogIyB0aGF0IG9mIHRoZSBzaW5nbGUtYnl0ZSBvbmUu CiAjIEEgbXVsdGlwbGUgb2YgMyBzZWVtcyB0byBiZSBlbm91Z2ggZm9yIGk1LGk3LCBidXQgQU1E IG5lZWRzID4yNS4KLWV4cHIgJHVieXRlX21zICc8JyAkbWJ5dGVfbXMgLyAzMCAmJiBmYWlsPTEK K3JldHVybnNfIDEgZXhwciAkdWJ5dGVfbXMgJzwnICRtYnl0ZV9tcyAvIDMwIHx8IGZhaWw9MQoK IEV4aXQgJGZhaWwKLS0gCjIuNi40Cgo= --001a1136988cc96681052c4013ad-- From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 21 05:46:50 2016 Received: (at 22655) by debbugs.gnu.org; 21 Feb 2016 10:46:51 +0000 Received: from localhost ([127.0.0.1]:35815 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXRXO-00066V-RF for submit@debbugs.gnu.org; Sun, 21 Feb 2016 05:46:50 -0500 Received: from a1www.kph.uni-mainz.de ([134.93.134.1]:41066) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXRXM-00066F-Oe for 22655@debbugs.gnu.org; Sun, 21 Feb 2016 05:46:49 -0500 Received: from a1i15.kph.uni-mainz.de (a1i15.kph.uni-mainz.de [134.93.134.92]) by a1www.kph.uni-mainz.de (8.14.9/8.14.7) with ESMTP id u1LAkWAL030337; Sun, 21 Feb 2016 11:46:32 +0100 Received: from a1i15.kph.uni-mainz.de (localhost [127.0.0.1]) by a1i15.kph.uni-mainz.de (8.14.8/8.14.2) with ESMTP id u1LAkW3E029248; Sun, 21 Feb 2016 11:46:32 +0100 Received: (from ulm@localhost) by a1i15.kph.uni-mainz.de (8.14.8/8.14.8/Submit) id u1LAkVtv029244; Sun, 21 Feb 2016 11:46:31 +0100 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <22217.38278.854737.217979@a1i15.kph.uni-mainz.de> Date: Sun, 21 Feb 2016 11:46:30 +0100 To: Jim Meyering Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) In-Reply-To: References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> X-Mailer: VM 8.2.0b under 24.3.1 (x86_64-pc-linux-gnu) From: Ulrich Mueller X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22655 Cc: Jim Meyering , Ulya Fokanova , 22655@debbugs.gnu.org, Sergei Trofimovich , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) >>>>> On Sat, 20 Feb 2016, Jim Meyering wrote: > I have fixed the regex-oriented problem with the attached patch [...] I confirm that this patch (applied to sys-apps/grep-2.23 in Gentoo) fixes the problem for me. From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 21 10:36:55 2016 Received: (at 22655) by debbugs.gnu.org; 21 Feb 2016 15:36:55 +0000 Received: from localhost ([127.0.0.1]:36496 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXW47-0006Md-Id for submit@debbugs.gnu.org; Sun, 21 Feb 2016 10:36:55 -0500 Received: from mail-ob0-f171.google.com ([209.85.214.171]:36096) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXW45-0006MP-Jv for 22655@debbugs.gnu.org; Sun, 21 Feb 2016 10:36:53 -0500 Received: by mail-ob0-f171.google.com with SMTP id gc3so144930455obb.3 for <22655@debbugs.gnu.org>; Sun, 21 Feb 2016 07:36:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=PV5fcCE6ohXxLfnqwRrSm4MzTU/UBsK84++Ob/RK5UI=; b=kNVa3HVb1M1b5qjIwcXdisb0dLIaR+j96FBDNdXyKnb9xbl7NOL2KlEfj+T99x6dUn mIcqibVoqveLWnRIg9L40po9Tjsu/jPjXNg0NuxcFvQIY+IUseMq6uLKEKNWd1eij973 k8K2/ZMpQNwNVyjXgIDH+U4Ksiq2dVmTmub0E9vXo62nxeXT3khQDJX7pCbDLlDkWfnt 71+SPenDUiolXQ14hRc1JWjQ5KTFUD8pF5LChS+Qlm998AiMAlnkafbzgtHH8rpo7GOJ us7j8THmZM3migBSTxzqbie+/QrKGhZbNlo7lbKZWZDB4RJONP2+xtFpt/MsMwCqFGNa fgJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-type; bh=PV5fcCE6ohXxLfnqwRrSm4MzTU/UBsK84++Ob/RK5UI=; b=C7QVkfIYyKd5dJdFLkN2NOGjczYX2OXu1PJTtqU9SlqXV/LpbrKlWJqHh9azdgzE/G 6/HuHz+W/8h+GJBFy1rxZOQ2O11RQxLRf1++APHDCEUWWQkEb+4xUJXE9zz7lFNBTLon WkZWV4ETO7C7k4L2DKYqNoKPvBYe/Xla7eDGmvi6nDntN0nY8KoqA6T3dv6TIuSRYm1D J3XyyLA2B77DMPnN+0MZbPEFL0B5VZblHsVkExgs6z+sEb0JJIgS8zQ6NsKbA9U/Sujp 7YXWDGPs0rtKWf8ZMsW9Pi0uTnQBuaJp8Zjr9SN5CepiUZu6ufxVW/CNrku88TGcEAUa GQ0Q== X-Gm-Message-State: AG10YOTAS5SxCJGbPiyfc97N3TUf5JMTjaRNXEz7xzkmd4Hpn6wxQWCVOtL9Q8BkyocpZEeLtPF2sGyNEL3olA== X-Received: by 10.182.24.104 with SMTP id t8mr19084050obf.1.1456069007857; Sun, 21 Feb 2016 07:36:47 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.44.194 with HTTP; Sun, 21 Feb 2016 07:36:27 -0800 (PST) In-Reply-To: <22217.38278.854737.217979@a1i15.kph.uni-mainz.de> References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> <22217.38278.854737.217979@a1i15.kph.uni-mainz.de> From: Jim Meyering Date: Sun, 21 Feb 2016 07:36:27 -0800 X-Google-Sender-Auth: dXwDQJpHJmlCKlC9L1CcCoIzqec Message-ID: Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Ulrich Mueller Content-Type: multipart/mixed; boundary=001a11c29c9efb767e052c497b1a X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22655 Cc: Jim Meyering , Ulya Fokanova , 22655@debbugs.gnu.org, Sergei Trofimovich , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --001a11c29c9efb767e052c497b1a Content-Type: text/plain; charset=UTF-8 On Sun, Feb 21, 2016 at 2:46 AM, Ulrich Mueller wrote: >>>>>> On Sat, 20 Feb 2016, Jim Meyering wrote: > >> I have fixed the regex-oriented problem with the attached patch [...] > > I confirm that this patch (applied to sys-apps/grep-2.23 in Gentoo) > fixes the problem for me. Thanks for checking. I reread the patch I pushed and was dismayed to see I'd left in a debugging statement. This additional patch removes the offending line (harmless unless you have a writable directory named /t): --001a11c29c9efb767e052c497b1a Content-Type: text/x-patch; charset=US-ASCII; name="0001-tests-test-cleanup.patch" Content-Disposition: attachment; filename="0001-tests-test-cleanup.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_ikwpvilk0 RnJvbSAzNmUwNTg3YTNjZGVlYTI0MzY1MTY4Zjk2MTg3YjkzNzczOWYyYmQzIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U3VuLCAyMSBGZWIgMjAxNiAwNzozMzozOCAtMDgwMApTdWJqZWN0OiBbUEFUQ0hdIHRlc3RzOiB0 ZXN0IGNsZWFudXAKCiogdGVzdHMvei1hbmNob3ItbmV3bGluZTogUmVtb3ZlIHRlc3QgYXJ0aWZh Y3QgdGhhdCB3b3VsZCB3cml0ZQp0byAvdC94LgotLS0KIHRlc3RzL3otYW5jaG9yLW5ld2xpbmUg fCAxIC0KIDEgZmlsZSBjaGFuZ2VkLCAxIGRlbGV0aW9uKC0pCgpkaWZmIC0tZ2l0IGEvdGVzdHMv ei1hbmNob3ItbmV3bGluZSBiL3Rlc3RzL3otYW5jaG9yLW5ld2xpbmUKaW5kZXggYjRkZmViYy4u YzEyZGQxZSAxMDA3NTUKLS0tIGEvdGVzdHMvei1hbmNob3ItbmV3bGluZQorKysgYi90ZXN0cy96 LWFuY2hvci1uZXdsaW5lCkBAIC0yNyw3ICsyNyw2IEBAIHByaW50ZiAnYVxuYlwwJyA+IGluIHx8 IGZyYW1ld29ya19mYWlsdXJlXwoKIGZhaWw9MAoKLWVudiA+IC90L3gKICMgVGhlc2UgdGhyZWUg d291bGQgYWxsIG1pc3Rha2VubHkgbWF0Y2gsIGJlY2F1c2UgdGhlIFthLWJdIHJhbmdlCiAjIGZv cmNlZCB0aGUgbm9uLURGQSAocmVnZXhwLXVzaW5nKSBjb2RlIHBhdGguCiByZXR1cm5zXyAxIGdy ZXAgLXogJ15bYS1iXSokJyBpbiB8fCBmYWlsPTEKLS0gCjIuNi40Cgo= --001a11c29c9efb767e052c497b1a-- From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 21 11:35:01 2016 Received: (at 22655) by debbugs.gnu.org; 21 Feb 2016 16:35:02 +0000 Received: from localhost ([127.0.0.1]:36532 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXWyL-0007mp-Mi for submit@debbugs.gnu.org; Sun, 21 Feb 2016 11:35:01 -0500 Received: from mail-oi0-f42.google.com ([209.85.218.42]:36554) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXWyJ-0007mc-Nd for 22655@debbugs.gnu.org; Sun, 21 Feb 2016 11:35:00 -0500 Received: by mail-oi0-f42.google.com with SMTP id w5so41319204oie.3 for <22655@debbugs.gnu.org>; Sun, 21 Feb 2016 08:34:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=vQhEwEe5/Adouwl6Y4Xg7ShPonVLh9yf03V65CRWhlo=; b=R3urGWBddcfww7SJgRWZatedvluplEHk4WyeVWOesh9d2FBkEuu/kCLFMtf99CAgEz i+02r9HK2aACeYyQweGjZwohzws4iKwY82Edx/m4fIVcOwO7DiUDNhPuioELymEEagTi dDevQJQf7eY/u4EFxWNqmjERiY+dsBYGjO4tS4Tt4CCScBxmtqRIgvsIoFXygp25uLqr 380dzDi+Jpn5TfEg77OXWazyB9VSOBFz23DSQvgSqJfqF4ccVRAjuf8z57V1wx2MpfT1 bezwo1bXddvphv1Z+ZqPx7loLvu9svKYeCIKlsTRzkNW+fX0g0IkhrmfcrLd8EWaEdYr pq7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-type; bh=vQhEwEe5/Adouwl6Y4Xg7ShPonVLh9yf03V65CRWhlo=; b=ix6mIIV8tnp2w5svH4bFhlpPG28kB6P/0iZrbeHnD9hDEerpH3HY6DuV4dD1qGIPKa pdW2WZRkLObDzn8Cacc6u/XRDOak4Z15A4f5oIcCMbkl/NJVtmHk00SAV+h9fNy8sdc8 8GGjHlkXOcxsM3rfRnjEQXzqJL2RBXaF7T8GpW+oDcRsxsGRxrFphbcT3MGaOWpGFOkc Zp7L/iBVYGxeacEkFekWk9uCWDObaS3gDfdf21wf/AsWCkUSfe7ve2tuZ2Dr/VAFCPHe Ai086J54FUXg3MDP/Ild9Z1N2semdDkmrjspZJStu6801Szg4s+yBtHPzePng89SISBG B0Xg== X-Gm-Message-State: AG10YORA+Cr+a4+Z4cT7rBbUrDfgha4VGeRnOjJrF73+h+wuHVWDjJjFpU/K2M377cfvf4hPxlrsHNzJmYRH3w== X-Received: by 10.202.200.88 with SMTP id y85mr19056256oif.69.1456072493944; Sun, 21 Feb 2016 08:34:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.44.194 with HTTP; Sun, 21 Feb 2016 08:34:34 -0800 (PST) In-Reply-To: References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> From: Jim Meyering Date: Sun, 21 Feb 2016 08:34:34 -0800 X-Google-Sender-Auth: whBdmwG9pB8ClF6u2Ca6tocEaZk Message-ID: Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Ulya Fokanova Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22655 Cc: Ulrich Mueller , 22655@debbugs.gnu.org, Sergei Trofimovich , Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Sat, Feb 20, 2016 at 8:19 PM, Jim Meyering wrote: > On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova wrote: >> I've explored the following case: >> >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c >> 6 ... >> The bug also present with PCRE engine: >> >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c >> 6 >> $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c >> 6 > > Thank you for the analysis and the report. > I have fixed the regex-oriented problem with the attached > patch, but not yet the case using -P -z (PCRE + --null-data): The -Pz/PCRE problem is more fundamental, and strikes even with LC_ALL=C. This shows that with -Pz, anchors still wrongly match at newlines, rather than at \0 bytes: $ printf '\0a\nb\0' | LC_ALL=C src/grep -Plz '^a' [Exit 1] $ printf '\0a\nb\0' | LC_ALL=C src/grep -Plz '^b' (standard input) Fixing this is on PCRE's maint/README wish list with this item: . Line endings: * Option to use NUL as a line terminator in subject strings. This could now be done relatively easily since the extension to support LF, CR, and CRLF. From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 21 15:06:40 2016 Received: (at 22655) by debbugs.gnu.org; 21 Feb 2016 20:06:40 +0000 Received: from localhost ([127.0.0.1]:36580 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXaHA-0004LT-H8 for submit@debbugs.gnu.org; Sun, 21 Feb 2016 15:06:40 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:43280) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXaH9-0004LF-6g for 22655@debbugs.gnu.org; Sun, 21 Feb 2016 15:06:39 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6586E160FC1; Sun, 21 Feb 2016 12:06:33 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 4E3Iqx8Jdyco; Sun, 21 Feb 2016 12:06:32 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 52F97160FEF; Sun, 21 Feb 2016 12:06:32 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id vIgklXE5eLvB; Sun, 21 Feb 2016 12:06:32 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 18621160FC1; Sun, 21 Feb 2016 12:06:31 -0800 (PST) Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Jim Meyering , Ulya Fokanova References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56CA18C4.9050102@cs.ucla.edu> Date: Sun, 21 Feb 2016 12:06:28 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------050102080400000100030801" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22655 Cc: Ulrich Mueller , 22655@debbugs.gnu.org, Sergei Trofimovich X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------050102080400000100030801 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Jim Meyering wrote: > The -Pz/PCRE problem is more fundamental, and strikes > even with LC_ALL=C. grep should report an error when this problem occurs, rather than silently give incorrect answers. I installed the attached patch in a bold attempt to implement this; please feel free to revert if you think it goes too far. --------------050102080400000100030801 Content-Type: text/x-diff; name="0001-grep-Pz-is-incompatible-with-and.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-Pz-is-incompatible-with-and.patch" >From b824e32f5754b93db25b779d2484472120c45ac2 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sun, 21 Feb 2016 11:38:00 -0800 Subject: [PATCH] grep: -Pz is incompatible with ^ and $ Problem reported by Sergei Trofimovich in: http://bugs.gnu.org/22655 * NEWS: Document this. * src/pcresearch.c (Pcompile): Warn with -Pz and anchors. * tests/pcre: Test new behavior. --- NEWS | 7 +++++++ src/pcresearch.c | 14 ++++++++++++++ tests/pcre | 6 ++---- 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/NEWS b/NEWS index ae238be..990118e 100644 --- a/NEWS +++ b/NEWS @@ -15,6 +15,13 @@ GNU grep NEWS -*- outline -*- there is no match, as expected. [bug introduced in grep-2.7] + grep -Pz now diagnoses attempts to use patterns containing ^ and $, + instead of mishandling these patterns. This problem seems to be + inherent to the PCRE API; removing this limitation is on PCRE's + maint/README wish list. Patterns can continue to match literal ^ + and $ by escaping them with \ (now needed even inside [...]). + [bug introduced in grep-2.5] + * Noteworthy changes in release 2.23 (2016-02-04) [stable] diff --git a/src/pcresearch.c b/src/pcresearch.c index 3fee67a..3b8e795 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -124,6 +124,20 @@ Pcompile (char const *pattern, size_t size) /* FIXME: Remove these restrictions. */ if (memchr (pattern, '\n', size)) error (EXIT_TROUBLE, 0, _("the -P option only supports a single pattern")); + if (! eolbyte) + { + bool escaped = false; + for (p = pattern; *p; p++) + if (escaped) + escaped = false; + else + { + escaped = *p == '\\'; + if (*p == '^' || *p == '$') + error (EXIT_TROUBLE, 0, + _("unescaped ^ or $ not supported with -Pz")); + } + } *n = '\0'; if (match_words) diff --git a/tests/pcre b/tests/pcre index 92e788e..b8b4662 100755 --- a/tests/pcre +++ b/tests/pcre @@ -13,9 +13,7 @@ require_pcre_ fail=0 echo | grep -P '\s*$' || fail=1 -echo | grep -zP '\s$' || fail=1 - -echo '.ab' | grep -Pwx ab -test $? -eq 1 || fail=1 +echo | returns_ 2 grep -zP '\s$' || fail=1 +echo '.ab' | returns_ 1 grep -Pwx ab || fail=1 Exit $fail -- 2.5.0 --------------050102080400000100030801-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 22 10:01:23 2016 Received: (at 22655) by debbugs.gnu.org; 22 Feb 2016 15:01:23 +0000 Received: from localhost ([127.0.0.1]:38205 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXrzH-0004Mz-Cg for submit@debbugs.gnu.org; Mon, 22 Feb 2016 10:01:23 -0500 Received: from mail-ob0-f175.google.com ([209.85.214.175]:34248) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aXrzE-0004Ml-Vv for 22655@debbugs.gnu.org; Mon, 22 Feb 2016 10:01:22 -0500 Received: by mail-ob0-f175.google.com with SMTP id ts10so51870165obc.1 for <22655@debbugs.gnu.org>; Mon, 22 Feb 2016 07:01:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=2aQxvLklCOUWLfWAej/k0OkRTc0USntBzPrtUJhY8D8=; b=1BCoBKOVtejtaBzSGfm+rXkwq9fVSN2KAP4xgyiXXkEATTLphYpKYht4tXTKT2lC2Z VwtqSgvN41ewk+QQgS8SooEqOEC3CpXLalBXcvBaV8eHwIhpAKoRZ6bRfNN7wq2Q7v6o yeJgzrcHumB3k4XTZV4XWMxJvxr6KWnUgdNt5wISJCkQ474zusCIVpXmI52Ga6yABpjm TDtYJ2WvBiikK7ovK0YRBbfRblWMgDDb+KcCZzrsowVJD5UvV6glcE4Ig7Fgmi2bWIBo Pf3yO+59vyBz1czA3wICzpicCipkPhaBKcyONAEYITyfAwti/XNET/4NcLCPUexhBhTx mfAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-type; bh=2aQxvLklCOUWLfWAej/k0OkRTc0USntBzPrtUJhY8D8=; b=lrb2Y1+jR3XujmiHr8JXzF6pfdJ3bVlEfI7nWHAm9pBQnxBth222eFjV1LsHcFoxJL RxW/tuHCV5DzVCuTJ2gPVAg102WadNKt1eFhQlla2cQWjqJ86QB0P6l+s8BsPxGu/7AM FeyhNVPBXF/4BuSQrVfhHcJIgx1Js3gc70vC8aK31DO1WmxBMy8HlyVYiWlOYRwepess wZ7H7AHHEwiTgBrrVJ2/nQWgeuBrNEScrzhOAeeOFwvzEHlyinoTqOMaeUiNsTguOVYf g4z+S9CMfRRMIsB2uLrxkQGhG46flq5onP4JF6ab84OBhNQlw0KiSa9ur4BwP99yfi0P oh2w== X-Gm-Message-State: AG10YOSMLuje6oEQIi2k+zOSn0af5n+9WraeNVyaaZKWfnpdbMJ2E2l7D0rsIGJw0EiPZPE5X3uDXOh1+s2Pew== X-Received: by 10.182.24.104 with SMTP id t8mr22788278obf.1.1456153275133; Mon, 22 Feb 2016 07:01:15 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.44.194 with HTTP; Mon, 22 Feb 2016 07:00:54 -0800 (PST) In-Reply-To: <56CA18C4.9050102@cs.ucla.edu> References: <20160213232052.7a5d14a2@sf> <22207.50571.484408.644052@a1i15.kph.uni-mainz.de> <56C0DD45.90600@gmail.com> <56CA18C4.9050102@cs.ucla.edu> From: Jim Meyering Date: Mon, 22 Feb 2016 07:00:54 -0800 X-Google-Sender-Auth: 2IeAw4PAen4oBqgJ8JCoL6DTv4A Message-ID: Subject: Re: bug#22655: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) To: Paul Eggert Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: Ulya Fokanova , 22655@debbugs.gnu.org, Ulrich Mueller , Sergei Trofimovich X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Sun, Feb 21, 2016 at 12:06 PM, Paul Eggert wrote: > Jim Meyering wrote: >> >> The -Pz/PCRE problem is more fundamental, and strikes >> even with LC_ALL=C. > > grep should report an error when this problem occurs, rather than silently > give incorrect answers. I installed the attached patch in a bold attempt to > implement this; please feel free to revert if you think it goes too far. Thank you, Paul. That suits me, and I found no fault with the patch. These two bugs are a big enough deal that I now plan to make another bug-fix-only release this week. From debbugs-submit-bounces@debbugs.gnu.org Wed Mar 23 12:03:24 2016 Received: (at 22655) by debbugs.gnu.org; 23 Mar 2016 16:03:24 +0000 Received: from localhost ([127.0.0.1]:34619 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ailFk-0002yy-0V for submit@debbugs.gnu.org; Wed, 23 Mar 2016 12:03:24 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:36662) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ailFi-0002ym-Th for 22655@debbugs.gnu.org; Wed, 23 Mar 2016 12:03:23 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B6772161204 for <22655@debbugs.gnu.org>; Wed, 23 Mar 2016 09:03:15 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 4l43ddSCJlG6; Wed, 23 Mar 2016 09:03:14 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0AB98160E42; Wed, 23 Mar 2016 09:03:14 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id sAO5_MM4vYcV; Wed, 23 Mar 2016 09:03:13 -0700 (PDT) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id DFF6B1601A6; Wed, 23 Mar 2016 09:03:13 -0700 (PDT) From: Paul Eggert To: 22655@debbugs.gnu.org Subject: [PATCH] grep: -Pz no longer misdiagnoses [^a] Date: Wed, 23 Mar 2016 09:03:01 -0700 Message-Id: <1458748981-16757-1-git-send-email-eggert@cs.ucla.edu> X-Mailer: git-send-email 2.5.5 X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22655 Cc: Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Problem reported by Michael Jess. * NEWS: Document this. * src/pcresearch.c (Pcompile): Do not diagnose [^ when [ is unescaped. * tests/pcre: Test for the bug. --- NEWS | 3 +++ src/pcresearch.c | 8 +++++--- tests/pcre | 1 + 3 files changed, 9 insertions(+), 3 deletions(-) diff --git a/NEWS b/NEWS index 144f668..69e4a23 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,9 @@ GNU grep NEWS -*- outline -*- ** Bug fixes + grep -Pz no longer mistakenly diagnoses patterns like [^a] that use + negated character classes. [bug introduced in grep-2.24] + grep -oz now uses null bytes, not newlines, to terminate output lines. [bug introduced in grep-2.5] diff --git a/src/pcresearch.c b/src/pcresearch.c index 3b8e795..f6e72b0 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -127,15 +127,17 @@ Pcompile (char const *pattern, size_t size) if (! eolbyte) { bool escaped = false; + bool after_unescaped_left_bracket = false; for (p = pattern; *p; p++) if (escaped) - escaped = false; + escaped = after_unescaped_left_bracket = false; else { - escaped = *p == '\\'; - if (*p == '^' || *p == '$') + if (*p == '$' || (*p == '^' && !after_unescaped_left_bracket)) error (EXIT_TROUBLE, 0, _("unescaped ^ or $ not supported with -Pz")); + escaped = *p == '\\'; + after_unescaped_left_bracket = *p == '['; } } diff --git a/tests/pcre b/tests/pcre index b8b4662..8f3d9a4 100755 --- a/tests/pcre +++ b/tests/pcre @@ -15,5 +15,6 @@ fail=0 echo | grep -P '\s*$' || fail=1 echo | returns_ 2 grep -zP '\s$' || fail=1 echo '.ab' | returns_ 1 grep -Pwx ab || fail=1 +echo x | grep -Pz '[^a]' || fail=1 Exit $fail -- 2.5.5 From debbugs-submit-bounces@debbugs.gnu.org Mon Apr 11 00:49:30 2016 Received: (at 22655-done) by debbugs.gnu.org; 11 Apr 2016 04:49:31 +0000 Received: from localhost ([127.0.0.1]:57838 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1apTn0-0003T5-Qo for submit@debbugs.gnu.org; Mon, 11 Apr 2016 00:49:30 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:34182) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1apTn0-0003St-4V for 22655-done@debbugs.gnu.org; Mon, 11 Apr 2016 00:49:30 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9918E160FD3 for <22655-done@debbugs.gnu.org>; Sun, 10 Apr 2016 21:49:24 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id js6L6UF7ARjB for <22655-done@debbugs.gnu.org>; Sun, 10 Apr 2016 21:49:24 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id F0CD8161250 for <22655-done@debbugs.gnu.org>; Sun, 10 Apr 2016 21:49:23 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id wLK2iCKaKcol for <22655-done@debbugs.gnu.org>; Sun, 10 Apr 2016 21:49:23 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D614A160FD3 for <22655-done@debbugs.gnu.org>; Sun, 10 Apr 2016 21:49:23 -0700 (PDT) To: 22655-done@debbugs.gnu.org From: Paul Eggert Subject: Re: grep-2.21 (and git master): --null-data and ranges work in an odd way (-P works fine) Organization: UCLA Computer Science Department Message-ID: <570B2CD3.9040908@cs.ucla.edu> Date: Sun, 10 Apr 2016 21:49:23 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -1.0 (-) X-Debbugs-Envelope-To: 22655-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) As this bug now appears to be fixed in grep master (or at least, as fixed as well as libpcre will allow), I'm closing it. From unknown Wed Jun 25 00:24:01 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 09 May 2016 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 07:51:57 2016 Received: (at control) by debbugs.gnu.org; 18 Nov 2016 12:51:57 +0000 Received: from localhost ([127.0.0.1]:33196 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7ie4-0000aY-Rs for submit@debbugs.gnu.org; Fri, 18 Nov 2016 07:51:56 -0500 Received: from mail-wm0-f45.google.com ([74.125.82.45]:35539) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7ie3-0000aK-EC for control@debbugs.gnu.org; Fri, 18 Nov 2016 07:51:55 -0500 Received: by mail-wm0-f45.google.com with SMTP id a197so36352368wmd.0 for ; Fri, 18 Nov 2016 04:51:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=ch1N9JIFa1/AwhyLQhpuSFFI4dQUQGngKfO82xZf3Zk=; b=W62/rfaJ7uHnk3MSUyh4VWszwnw01DRWI9HtQpkEpWHbLwu3mXcqZNEv/NK54ofIiZ 3rbGcePMU3hqjKNvNnV39lh38F1CP1pDWezyYS/gP7L9BtFPsT1bKEPi4sVfTbWQngHX /a/oQT6e5anrGpvSDgbMurkkPI/dnNbyNzXw9qruDF6dvu23EcHKW3sLVPx7ljnPi5n/ 6hcuDT9nzbBSZe8xG4ROMSzmWmE2jrBAEsb2cPb0l7tqdymTLEsFkjOQdJvzHHotAtKI m5zbUnosxwZKlerwCP6sTdkQzLQCDk5l4MIASD8BxZw+NrdtUCXJEGQsRngeEPL9GyZs Yafw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=ch1N9JIFa1/AwhyLQhpuSFFI4dQUQGngKfO82xZf3Zk=; b=RijTLXaqHcZRMmkQr/5TvfeWjJ2XCycW7QSPkNe6Mpimb/9SMRgnqQ9IruPJGc/DYY gKj0ORqUzoyC0QwvDg+DISLrmjPvis53LvzCVvRqV4BekxzdBxyjxM0g/+jy//Qo6z88 ZSo3K6z1YGoaUQMQO/M5tUavU4HHoS2w1PEX7wFhJoQEFZBRs4SDeT1aS/kc9RN2umUh z+Vm1rY7ezU5THQ5zwnOrSD5lvJ28cpgkNEXxML61gpImnr00i4F9gnZbKcoOSrh+FXk wm8mRbj0ze9uCVH2ShpGOjjMlHhTI4UFjAPZa6ptuvl4o1EFlvfh0bYmAqQ/QikvHJTo a+SQ== X-Gm-Message-State: AKaTC01eBJRjF/cGpA5rc6qQes1MixupBnypwPQnGcgZH073HMLak38tEzQao8xwN5DeEg== X-Received: by 10.194.191.161 with SMTP id gz1mr5629061wjc.22.1479473509617; Fri, 18 Nov 2016 04:51:49 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id w66sm3256887wme.4.2016.11.18.04.51.48 for (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 04:51:49 -0800 (PST) Date: Fri, 18 Nov 2016 12:51:48 +0000 From: Stephane Chazelas To: control@debbugs.gnu.org Subject: unarchive 22655 Message-ID: <20161118125148.GB5144@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) unarchive 22655 From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 07:52:38 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 12:52:39 +0000 Received: from localhost ([127.0.0.1]:33210 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7iek-0000cM-Mi for submit@debbugs.gnu.org; Fri, 18 Nov 2016 07:52:38 -0500 Received: from mail-qk0-f196.google.com ([209.85.220.196]:32964) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7iei-0000c6-VJ for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 07:52:37 -0500 Received: by mail-qk0-f196.google.com with SMTP id x190so31818102qkb.0 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 04:52:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=lHZakn7lltjrU0WmAUFjxgD736HIqOfg3igvkUe15SM=; b=b+x6fsSRBPEQXLYQOLdJOfxIZkTBHE00lBF6R4V07KfxzWa4l0sjqTC6Z5Dry6+itF 5AymRUwpNeXcDuZBLSFfm+GzF4uaZD1ZvimdcFVjhG0CSKvFClssoDukII7S4etxs5QG MIpo04+KsExh63ObywSD5jRfm4OloYgE3NEDspYGSMurEEFzm7Z/IWaGEPyDn9HLO7NB GgMt5Tj260JTCyZakgoCTkLtJwbZQEujlZKVoz2ff2+TqNPxxe11XFp2ffJy+aKak1lS p5wl5qJ6XoE7TjVRARxuiOO4Ie1p5pePkS7yWmQTsSoOJk8xz35903NE4U9qoYEuJcUP wpwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=lHZakn7lltjrU0WmAUFjxgD736HIqOfg3igvkUe15SM=; b=iPhaDjSrIBZXCS3FQWGIPoGtYOtO76IJaiHYDUJToZOFmq7FVxtVfFN5dXYfUcYyT8 68ZgBxCskWML+SG/dFIhB+nMEd9xR28JZdGgB99wgBlZEVXSSmjUQu+bLnxdwGh3eeDv z9KZ5WOykMcIYp3h5jc8dlOyKBBp9CzFQeeJ6LfLoqajsfHx0fOQGb8myxFQYZc3ZkT+ lTVjNencmGUo3jPNs1PCqk1F7S3L5ycfDubwZhBfb+WyB+ef/t53GaKe7pCsAvuUxtpH zsXSlRADoLcozPE2WL2ewuj7/Mv9CtSLm7JKz1ZM+Rtsq0y5Q8tP5BebRGeHINI5NjyA KCQg== X-Gm-Message-State: AKaTC01zzKviIUxlKmvzKihIk5o8rfKtkrqWPGxGUDAHw7DbCSm/0w294fTCyg5NcmEl/g== X-Received: by 10.194.85.77 with SMTP id f13mr5684620wjz.187.1479473551055; Fri, 18 Nov 2016 04:52:31 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id l187sm3259159wml.6.2016.11.18.04.52.30 for <22655@debbugs.gnu.org> (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 04:52:30 -0800 (PST) Date: Fri, 18 Nov 2016 12:52:29 +0000 From: Stephane Chazelas To: 22655@debbugs.gnu.org Subject: grep -Pz '^' now fails! Message-ID: <20161118125229.GA10084@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: -0.2 (/) X-Debbugs-Envelope-To: 22655 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.2 (/) Hello, I'm just finding out that ^ and $ no longer work with grep -Pz: https://unix.stackexchange.com/questions/324263/grep-command-doesnt-support-and-anchors-when-its-with-pz $ grep -Pz '^' grep: unescaped ^ or $ not supported with -Pz Which points to this bug. Note that, it's not that pcre doesn't support NULL-delimited records, it's that grep calls pcre with the wrong flag (PCRE_MULTILINE) which is like the m flag in /.../m perl RE operator which is explicitely to tell ^ to match at the beginning of the subject *but also after every newline* (same for $). As already noted at https://debbugs.gnu.org/cgi/bugreport.cgi?bug=16871#8 printf 'a\nb\0' | grep -Pz '^b' did match which was a bug indeed, but only because of that PCRE_MULTILINE flag. If you turned off that flag: printf 'a\nb\0' | grep -Pz '(?-m)^b' Then it wouldn't match. With grep 2.10: $ printf 'a\nb\0c\0' | grep -Poz '^.' a b # BUG c $ printf 'a\nb\0c\0' | grep -Poz '(?-m)^.' a c Or use \A and \z in place of ^ and $ that match at the beginning of the subject regardless of the state of the "m" flag: $ printf 'a\nb\0c\0' | grep -Poz '\A.' a c Now with the new version, we need to use those \A, \z. Or if we want to match at the beginning of any of the lines in a NUL delimited record, we need ugly things like: grep -Pz '(?:\A|(?<=\n))' instead of grep -Pz '(?m)^' Can that bug please be reopened so it can be addressed differenly (PCRE_MULTILINE removed, PCRE_DOLLAR_ENDONLY added)? -- Stephane From unknown Wed Jun 25 00:24:01 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: Did not alter fixed versions and reopened. Date: Fri, 18 Nov 2016 12:54:01 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # Did not alter fixed versions and reopened. thanks # This fakemail brought to you by your local debbugs # administrator From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 10:54:39 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 15:54:39 +0000 Received: from localhost ([127.0.0.1]:34110 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7lUs-0006rv-QR for submit@debbugs.gnu.org; Fri, 18 Nov 2016 10:54:38 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44434) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7lUr-0006rh-4w for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 10:54:37 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 39DFE16013C; Fri, 18 Nov 2016 07:54:31 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 75Lzh_PebPZB; Fri, 18 Nov 2016 07:54:30 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6156C16013D; Fri, 18 Nov 2016 07:54:30 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id krT_AybOB48T; Fri, 18 Nov 2016 07:54:30 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 4520D16013C; Fri, 18 Nov 2016 07:54:30 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas , 22655@debbugs.gnu.org References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> Date: Fri, 18 Nov 2016 07:54:29 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161118125229.GA10084@chaz.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Stephane Chazelas wrote: > Can that bug please be reopened so it can be addressed > differenly (PCRE_MULTILINE removed, PCRE_DOLLAR_ENDONLY added)? Removing PCRE_MULTILINE will make 'grep' waaaay slower. Can you think of a way of fixing the bug that doesn't involve removing PCRE_MULTILINE? From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 11:24:11 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 16:24:11 +0000 Received: from localhost ([127.0.0.1]:34120 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7lxT-0007b0-GN for submit@debbugs.gnu.org; Fri, 18 Nov 2016 11:24:11 -0500 Received: from mail-wm0-f66.google.com ([74.125.82.66]:33104) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7lxS-0007ah-EA for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 11:24:10 -0500 Received: by mail-wm0-f66.google.com with SMTP id u144so7774223wmu.0 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 08:24:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=/iCpSRFCXJtQ++nhULcSVkZYCpB/tp421L4lxyt+6zc=; b=MhtlxH7gOHGFUsOYsqgBP8unnIY/RXb7cxGVjmCU1cWR0Ev9gaJwdZbBaI4vrAAgdF 7qkQXdJF0C6UfUZmkPVHrkf2fplQGhJNN+iK3yVN+2PzYKI60xaYfFwZU1Yc+2i+dlRa qEvHANtAwAzEmdOPGjlSr2eBwq3d08ZOg0OOs8zsNQu3sfyGJEq6SjyBqg+jL+THdkB9 jEGr0ikvTio/VgyXBeFQVGw5Bg1s9Cj7AZNL/hvPbyaqtcOGc0O/1xm5G/dAR4xo7MIu Abe80MGijllaEbFnNVixWbWHOljNHaBOIN04MKwM5sYTFapn5Pwp34E4Fsw1BwqDu+vC mcbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=/iCpSRFCXJtQ++nhULcSVkZYCpB/tp421L4lxyt+6zc=; b=Q4Zyv1vTsjh+6yZ36z6dGonMnyyVI4lFuanm4UatqRFeLja+GiIv+hkulaG+G8a77a zaIQmXjYS48oEDUo0TuoudDd1GoocvUDBssB8eQYADPIhyNXOdVhV/2As0pU10fDvE/O XuyExsBzNH13DvOLxZYp+akD7GNVA+l71T8+RdBawpyIYG/rzFvm2+0q0bSlaBvG/XLM sxYOwttDgoJ5ZuEUduJ/jKNeElR9pNHq2OxgYuOfYE0yPHZm+LTc7+RAMIN0kxkc9kKI akErSHN3bdx90TfefFrVM2wWbf/oXoka3ryBqj46yLFpUWWNHNyFdNbK49UmIU0Zc0MG DppA== X-Gm-Message-State: AKaTC02mmvI5jnZnOZmwixHWzxfd2QeAIvr8/a3ktd0hW/7URHQoqoN7SYFuKX/3k9RIMQ== X-Received: by 10.28.174.194 with SMTP id x185mr1060727wme.4.1479486244627; Fri, 18 Nov 2016 08:24:04 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id l187sm4182097wml.6.2016.11.18.08.24.03 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 08:24:03 -0800 (PST) Date: Fri, 18 Nov 2016 16:24:02 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161118162402.GD5144@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-18 07:54:29 -0800, Paul Eggert: > Stephane Chazelas wrote: > >Can that bug please be reopened so it can be addressed > >differenly (PCRE_MULTILINE removed, PCRE_DOLLAR_ENDONLY added)? > > Removing PCRE_MULTILINE will make 'grep' waaaay slower. Can you > think of a way of fixing the bug that doesn't involve removing > PCRE_MULTILINE? Why would it make it slower. AFAICT, PCRE_MULTILINE *adds* some overhead. It is really meant to be only about changing the behaviour of ^ and $. Not-PCRE_MULTILINE is the default about everywhere else (including less, pcregrep, ssed -R). Do you have any particular test-case in mind? -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 11:38:03 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 16:38:03 +0000 Received: from localhost ([127.0.0.1]:34139 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7mAt-0007x0-J2 for submit@debbugs.gnu.org; Fri, 18 Nov 2016 11:38:03 -0500 Received: from mail-wj0-f195.google.com ([209.85.210.195]:35445) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7mAr-0007wW-TJ for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 11:38:02 -0500 Received: by mail-wj0-f195.google.com with SMTP id f8so103094wje.2 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 08:38:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=VUYQJb3v/jqHVicB1svfu+CQZnUSSuYhhNPSTEdV9Qs=; b=QJ3eEXnGK+/9obR1oInW3tiOkLVvt5I48UM7M7YG8HuPIAaAqbxXZtUKNfZOgL1GMS nzd+9PUKBrdhpcfZMj7FXusF7uS3ky0YW/eELeHzZthsN4orQEkCAM1xg5Sn/+1VR2Hz DyzwQLC9tm5xR6sLrEEGe7NUSXc6ryi76NrNx+qab46BhOGA4n2Rt3yg+eldMWZXOfbq g7+Sz/Tce2MNCBS0f5D0x2Y46Jhb1/yNGcovaunBqWfjJRy+VaVmo4qIxuGyj1weaNsA CCv5wAv8zwoefk4/K722eswvwUW+ZK71X/s+6hOzT3LeCrIi484spd6Vxtn/wP8EDaXa WTww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=VUYQJb3v/jqHVicB1svfu+CQZnUSSuYhhNPSTEdV9Qs=; b=PVxBOZ9yLgKCAr9yXRiM3ZpAnsuTFw99/6B4KfyD3r0AF+b7vnHH0DwpRWwfZdCmjK abR1smcNVmKKIDvEUZeJ3DgMFuU2tvLFr0e0jLM3+XUZa84UyM7HRzGw5iiJNmqs1pGN A2u5yzBr3BQe4/hkn4GAwj3OhGKaf7i+9oh99erHZRmDzX3KZaIsUxmtectpOe5wXF9b Xlj9Mx4jTLR3B4IjHTGQmfrFszK6FvpKkwj8owaGMI4nOu8srs7M/Soy6ga0OfcTH0Do ie8PD8f1BqD7udJ4VvIlQ34jiY5Sjn0ZSEvkLnIt5kXn+jn38jirZ8hRNLNVzsN/hkIM 0l4w== X-Gm-Message-State: AKaTC03bpsG7604Iim9YzGHK8QGFCpXLl5nL1ZJMxvD/pabtiYr4RWgHDJHZBgtw3K0ZOQ== X-Received: by 10.194.113.2 with SMTP id iu2mr474836wjb.32.1479487076094; Fri, 18 Nov 2016 08:37:56 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id l67sm4404488wmf.0.2016.11.18.08.37.54 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 08:37:55 -0800 (PST) Date: Fri, 18 Nov 2016 16:37:54 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161118163754.GB10084@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161118162402.GD5144@chaz.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) 2016-11-18 16:24:02 +0000, Stephane Chazelas: [...] > Why would it make it slower. AFAICT, PCRE_MULTILINE *adds* > some overhead. It is really meant to be only about changing the > behaviour of ^ and $. [...] Unsurprisingly, where ~/a contains the output of find / -print0 (which is typically what grep -z is used on) 304MiB, 3696401 records, none of which contain a newline character in this instance. with grep 2.10 (Ubuntu 12.04 amd64) in a UTF-8 locale: $ time grep -Pz '(?-m)^/' ~/a > /dev/null grep -Pz '(?-m)^/' ~/a > /dev/null 0.84s user 0.15s system 99% cpu 0.990 total $ time grep -Pz '^/' ~/a > /dev/null grep -Pz '^/' ~/a > /dev/null 0.87s user 0.12s system 99% cpu 0.989 total Not much difference as "/" is found at the beginning of each record. $ time grep -Pz '(?-m)^x' ~/a > /dev/null grep -Pz '(?-m)^x' ~/a > /dev/null 0.41s user 0.06s system 99% cpu 0.473 total $ time grep -Pz '^x' ~/a > /dev/null grep -Pz '^x' ~/a > /dev/null 0.81s user 0.04s system 99% cpu 0.854 total PCRE_MULTILINE significantly slows things down as even though "x" is not found at the beginning of the subject, grep still needs to look for extra newline characters in the record. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 11:48:14 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 16:48:14 +0000 Received: from localhost ([127.0.0.1]:34143 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7mKk-0008CB-IV for submit@debbugs.gnu.org; Fri, 18 Nov 2016 11:48:14 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:54264) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7mKi-0008Bw-0P for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 11:48:12 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 8AC9E16013D; Fri, 18 Nov 2016 08:48:05 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id cu7YIDQP-a6g; Fri, 18 Nov 2016 08:48:04 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id BE22C160148; Fri, 18 Nov 2016 08:48:04 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id uPAwkg9jvyT2; Fri, 18 Nov 2016 08:48:04 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id A270616013D; Fri, 18 Nov 2016 08:48:04 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> Date: Fri, 18 Nov 2016 08:48:04 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161118162402.GD5144@chaz.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Stephane Chazelas wrote: > Why would it make it slower. AFAICT, PCRE_MULTILINE *adds* > some overhead. As I understand it, PCRE_MULTILINE lets 'grep' apply a pattern to an enti= re=20 buffer that contains many lines, and this lets PCRE efficiently find the = first=20 match in the whole buffer. If grep doesn't use PCRE_MULTILINE, grep would= have=20 to apply the pattern to each line separately, which could be significantl= y slower. That being said, PCRE matching is pretty slow already, so perhaps we shou= ldn't=20 worry about efficiency here. From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 12:06:45 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 17:06:45 +0000 Received: from localhost ([127.0.0.1]:34157 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7mcf-0000D0-24 for submit@debbugs.gnu.org; Fri, 18 Nov 2016 12:06:45 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:34243) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7mcd-0000Co-NE for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 12:06:44 -0500 Received: by mail-wm0-f65.google.com with SMTP id g23so8152448wme.1 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 09:06:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=PVG42ZnoQLkL2ApuUq5bvWlEE65Z8RJfzCPX80733hw=; b=wqtFQGmnM8dYOvG6CWAl/uwdxSz9b1BT0s4FIt9c6VeGnikVWjy6V+00lo4kVIxvpI nsUCmbxejisdILmQAr5KYcW4e8WK/h+YuMMF/nckv96T5ANfyBTlzzh45P4ZVMVLllzJ b772RTYkm6Z1fnEkfygG/LH2mekVGXufPxl0IO6SQmkhc2MkYlm24DxY/69SpRoOW1Qj Qiv3PmLB5y6y52xf5AnGvZq84qFuvxWI7jRofsC+rqJ8hzjHAARrxAs3oZ19/bfCkCaO jPxRyWu1jLmEXW1ZCSqk1OleWJqpo3NumZX72icrvSHsodQeStbShPaxiXEDWYNv0FPT +9WQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=PVG42ZnoQLkL2ApuUq5bvWlEE65Z8RJfzCPX80733hw=; b=fqRLeqRVVCyg51D/HEv3C6kIp0ezD6S4e32SfKJZh/+buqiGepcoXQYcMjVVn8nxQI O2ty/RJdjN1Ny8WL1339jKs9c2Od10gUqJ8LumOf7H6WflGNRKDciFq7KsOz4RYdP8rv ZGFE9h9HEHk4khrHdVd5pF7hHb+owVrUCQVAgvVW1V5ifiw4ufkKDl+zFKnr9cQpsJHG N8ldj55r0CIxDDfb4kn6ImtMq58K1Kn36EArARdKOTIci+b9qURhpn38NRx4o/tOedup 2h+yvPBIvH7lIkCnwbnAvA/I4V6BjOsEQSzyuTGjbo7g2h4XwPeQXKFu9ijKZqjFIjy7 GALw== X-Gm-Message-State: AKaTC02MYTBji37AO9NDxeHU8YJgzebxEZLDamIUZDzdO6960wKr3l0vKUGT+NhdnENoPw== X-Received: by 10.28.143.68 with SMTP id r65mr1262315wmd.95.1479488797977; Fri, 18 Nov 2016 09:06:37 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id dj5sm9817583wjb.34.2016.11.18.09.06.36 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 09:06:37 -0800 (PST) Date: Fri, 18 Nov 2016 17:06:36 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161118170636.GE5144@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-18 08:48:04 -0800, Paul Eggert: > Stephane Chazelas wrote: > >Why would it make it slower. AFAICT, PCRE_MULTILINE *adds* > >some overhead. > > As I understand it, PCRE_MULTILINE lets 'grep' apply a pattern to an > entire buffer that contains many lines, and this lets PCRE > efficiently find the first match in the whole buffer. If grep > doesn't use PCRE_MULTILINE, grep would have to apply the pattern to > each line separately, which could be significantly slower. [...] That might have been the case a long time ago, as I remember some discussion about it as it explained some wrong information in the documentation, but as far as I and gdb can tell, grep 2.26 at least call pcre_exec for every line of the input with grep -P. If it didn't echo test | grep -P '\n$' would match. I'll try and dig up the old discussions. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 12:33:09 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 17:33:09 +0000 Received: from localhost ([127.0.0.1]:34166 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7n2D-0000qE-Ep for submit@debbugs.gnu.org; Fri, 18 Nov 2016 12:33:09 -0500 Received: from mail-qk0-f193.google.com ([209.85.220.193]:33597) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7n2B-0000pp-Lh for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 12:33:08 -0500 Received: by mail-qk0-f193.google.com with SMTP id x190so33576286qkb.0 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 09:33:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=NmXEz7Q3N2cUQYhQgGXMGIGepJcLkbW0zxD0B2eT3QY=; b=tvWPr6KN+xStKFLvKjAhxsArzCt2/t5krR22Ge0wNdhfnycwmbfzYopw2j3S55z0Kq jfPUxZEt4um0NZOm6cJD9g0WCt4wnDQIHU9Dsjutzu/deNLk3zjM8bAC8pZ5r4VULtxv bGqgYFY+v6P0/Zeez69xpsFIk5IONDLNcDksY8tKabLKl/bBHvp18Zb3NKfDUQ0IEIu5 iTghgjvVou29aOWX8W+8T0gaklCsNU6fGv+BiH+EdtyaabkD1V6HR7j0a1BzCTkVU4Z0 +XtPiJZg82px+uYjf8O8wAlciJhla2qdrVShKnMst53GMSqBJn6RWG5w078M90nk5rqD EHVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=NmXEz7Q3N2cUQYhQgGXMGIGepJcLkbW0zxD0B2eT3QY=; b=a2wFzt1oGH6Sffv8pDmUXzUNP5lKvZDgLh/BbE7z6O627fVTZ8PQuPEW1zR3BrLxsi fLXL/cRI3gWn93xML2QELxiR8ZOZwVuHxUyP/pRk40iZy3uhj1jiR7gjRRZSCritPNgC K7NCKSPkiRuX0hG4C64w7jJI7SkvwBBM69Ec9MQhTk6WJcyyllnUCPRQKDY5bj8nZBJT JXTcYhH0SDzfCnC1tEzH+yR3iF1v/FJwmXAq0fbbK26mubQFI0PGXEFRfwo3qNn3WBYe 8obz+RVdhtG6tjxD1gEE2PryVZ0eQyM7NEPvGlQah3uiIKb6ukGgCO8TDSP8c87bsgwo SWog== X-Gm-Message-State: AKaTC02aTwRiTsPEAw/A5F4OXzKDQMtYSS/zwJzBMSOsl/9RRvBuU0FwoheNrma1ApDvlg== X-Received: by 10.194.47.242 with SMTP id g18mr535751wjn.203.1479490381922; Fri, 18 Nov 2016 09:33:01 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id g184sm4450652wme.23.2016.11.18.09.33.00 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 09:33:00 -0800 (PST) Date: Fri, 18 Nov 2016 17:33:00 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161118173259.GC10084@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161118170636.GE5144@chaz.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-18 17:06:36 +0000, Stephane Chazelas: [...] > That might have been the case a long time ago, as I remember > some discussion about it as it explained some wrong information > in the documentation, but as far as I and gdb can tell, grep > 2.26 at least call pcre_exec for every line of the input with > grep -P. > > If it didn't > > echo test | grep -P '\n$' > > would match. > > I'll try and dig up the old discussions. [...] I can't find the discussions, and they should have been a follow-up on https://debbugs.gnu.org/cgi/bugreport.cgi?bug=16871 but looking at the code, that changed in commit a14685c2833f7c28a427fecfaf146e0a861d94ba in 2010 for https://savannah.gnu.org/bugs/?27460 And shortly after, -Pz was supported (as that was made straightforward to do after the change I guess). Note that we should also be able to support grep -Pe pat1 -e pat2 now as well. And PCRE_MULTILINE now can only improve performance. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 12:46:56 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 17:46:56 +0000 Received: from localhost ([127.0.0.1]:34187 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7nFY-0001C8-5r for submit@debbugs.gnu.org; Fri, 18 Nov 2016 12:46:56 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:36527) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7nFW-0001Bt-5D for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 12:46:54 -0500 Received: by mail-wm0-f65.google.com with SMTP id m203so8438649wma.3 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 09:46:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=80FIXj5RJ8l9JTIse0munVWATC7qZRIb5JCLGRghMLY=; b=RJ5AgoQZGNlYbFpBy7/p9CIzevFaME/4u6ZvT4/7FEAUib+0DwjlZQUPQ9gLnoKeIK Z86I3hcre8i6j02qMLHG0r+yujYv2Byogw6GoglDnzkR2NTzXXlfBFuKulmbsKBZCR3J S2fVue8hr5wgc+f7EMC2XE32qdvkJInsRTEce/hkakWCR/bw3r//WlagUbaMpkGndGe8 o0WXYjGiInzaxUwp6vMpeEUzScgMaba4rxWsgqfxK9u8JrLXITYsPxZpBK7ImTN/mA5b aINURoRY9WsIxYNEx/t0/f4rm7Xf4ULl0Q7v70recferMxKVnRkB67D0lnj70hpEfFvF zQmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=80FIXj5RJ8l9JTIse0munVWATC7qZRIb5JCLGRghMLY=; b=Hidl36Tri7hnuxTVxq/ygjfBD0RoFp+SGZ0Hnb9E/PsUBcJnDpmMg3NJgK3Rxzna0/ JZoMiPY+Dem+3+wb6TyBeOILsZVUhAK62SBHAlhzS6dMK70jA9tpY5TZVvl0hBu3AHUG tMy6NYLbBnxCrSS0qQ8m0Z6dXOMbxk60qDOnnOdhDQytcF0otcYht3JX4kyrnfEZB5aM TP08ZvOW7zp3Vc4szuIgsnPW/YOMi92zDg7+Uoi5dbMtLBQkjHX1FKL7xhdtNLt2793D e0lZrTKbH3lYdD279XaGaw6NfVr+VfC+70B5yxBXpWYygSDILU9LMWNLe7KpKbzQhPvJ EQRQ== X-Gm-Message-State: AKaTC03Ws/2ZSI96AcUep+HLSN6O7Az1R5uhEdts8YRLJ/eXesNu6kFiIci669EhnszKDQ== X-Received: by 10.194.123.201 with SMTP id mc9mr705495wjb.47.1479491208433; Fri, 18 Nov 2016 09:46:48 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id w79sm4540379wmw.0.2016.11.18.09.46.47 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 09:46:47 -0800 (PST) Date: Fri, 18 Nov 2016 17:46:46 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161118174646.GE10084@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161118173259.GC10084@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161118173259.GC10084@chaz.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-18 17:32:59 +0000, Stephane Chazelas: [...] > And PCRE_MULTILINE now can only improve performance. [...] Sorry, I meant "and *removing* PCRE_MULTILINE now can only improve performance". (and we need to add PCRE_DOLLAR_ENDONLY or otherwise printf 'a\n\0' | grep -Pz 'a$' would still match). From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 12:48:00 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 17:48:00 +0000 Received: from localhost ([127.0.0.1]:34191 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7nGa-0001E1-EQ for submit@debbugs.gnu.org; Fri, 18 Nov 2016 12:48:00 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:36082) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7nGX-0001Dj-SA for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 12:47:58 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id CE624160029; Fri, 18 Nov 2016 09:47:51 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 8CP1JT_w5RW9; Fri, 18 Nov 2016 09:47:51 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 2B1BF160060; Fri, 18 Nov 2016 09:47:51 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 0zMWIDiskqzr; Fri, 18 Nov 2016 09:47:51 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 12945160029; Fri, 18 Nov 2016 09:47:51 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <20161118163754.GB10084@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <300f5e65-f62c-6887-114c-b3f322eec4bd@cs.ucla.edu> Date: Fri, 18 Nov 2016 09:47:50 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161118163754.GB10084@chaz.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Stephane Chazelas wrote: > $ time grep -Pz '(?-m)^/' ~/a > /dev/null It looks like you want "^" to stand for a newline character, not the star= t of a=20 line. That is not how grep -z works. -z causes the null byte to be the li= ne=20 delimiter, and "^" should stand for a position immediately after a null b= yte (or=20 at start of file). It might be nice to have a syntax for matching a newline byte with -z (or= a null=20 byte without -z, for that matter). But that would be a new feature. From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 13:08:00 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 18:08:00 +0000 Received: from localhost ([127.0.0.1]:34195 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7nZw-0001hE-2z for submit@debbugs.gnu.org; Fri, 18 Nov 2016 13:08:00 -0500 Received: from mail-wm0-f68.google.com ([74.125.82.68]:36290) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7nZv-0001h2-1h for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 13:07:59 -0500 Received: by mail-wm0-f68.google.com with SMTP id m203so8621230wma.3 for <22655@debbugs.gnu.org>; Fri, 18 Nov 2016 10:07:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=V9Drm/MV7fNYWFkfa+VPDdJW5mGgGV/TU42hT3OahWA=; b=w4cZY0urW9CBo0EgDQ4QhGnSsn7lRe7aB+sW7xmf3Rn0W0CueQOJWr89O3uyPEsugb OUKA8S0t0DVznrD+XqunmUYY0OXoSLo3qhc2QNrROSegkGs1Vsaktoq30hWfN/uF9fq4 M4/vOtsl9pObVEgA2BTqmLR4IpFCGXT0MuJK3oEqqiVcWC1hR4mbXUOKaPthRWPpU38K p732dB7zTUjo73sYgGH+Qc3idPE4kNd6gWL12sb2VCF4jmSsqTH1M5Ml+0C/Lw9eqCmk pCFgpZfqNIYLTzeGrvYdqNbDgIT4pKf4+E4/JxaNFJEVvvXI5PZyJ3L071DiolSkvQLd mfoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=V9Drm/MV7fNYWFkfa+VPDdJW5mGgGV/TU42hT3OahWA=; b=B7YIFAcB++MKoB+bfx/8uA0+j8LAwvQdquoeAV829BgiQHQWwY+XumGY6c7aJ+FPrW 78wMtjgje2e575X6RrLAA5iBBF+rcGDy0kP7uyEaTxP5I+X972/Jz/Uur9936z2Rqqjk eYCMI1j3dJPEIGa6pigoHxnEGUUexGrcPuyj6dGtwm5zQZWwopilUmM9tsraDZKWKKCs jT2yRm+kG0TC6gpgh5Swz99mJgDhNJevA4c6FNkvxvnLh8dsrjAhIIkfXod58JQ5/f96 R5fdUW/VhZvpUh3Z9HdfEd6fJVsNRL400eDweO1f4BHVeBxcXJ14VhoTWXe5r+qlHYMR JJUw== X-Gm-Message-State: AKaTC02+aEiALMM5b16MBwi+cx7gNo0gXLgZliEHudktcRIRRFNkaTkHki+IJRrxSRSi0Q== X-Received: by 10.194.172.100 with SMTP id bb4mr631889wjc.53.1479492473285; Fri, 18 Nov 2016 10:07:53 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id d17sm10069127wjr.14.2016.11.18.10.07.51 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Fri, 18 Nov 2016 10:07:52 -0800 (PST) Date: Fri, 18 Nov 2016 18:07:51 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161118180751.GF5144@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <20161118163754.GB10084@chaz.gmail.com> <300f5e65-f62c-6887-114c-b3f322eec4bd@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <300f5e65-f62c-6887-114c-b3f322eec4bd@cs.ucla.edu> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-18 09:47:50 -0800, Paul Eggert: > Stephane Chazelas wrote: > >$ time grep -Pz '(?-m)^/' ~/a > /dev/null > > It looks like you want "^" to stand for a newline character, not the > start of a line. That is not how grep -z works. -z causes the null > byte to be the line delimiter, and "^" should stand for a position > immediately after a null byte (or at start of file). [...] No, sorry if I wasn't very clear, that's the other way round and it's the whole point of this discussion. grep had a bug in that it was calling pcre_exec on the content of each null delimited record with a regex compiled with PCRE_MULTILINE That caused printf 'a\nb\0' | grep -zP '^b' to match even though the record doesn't start with a "b". To work around it, you have to disable the PCRE_MULTILINE flag in the regexp syntax with the (?-m) PCRE operator, or use \A instead of ^. The problem was /fixed/ (and I'm arguing here it's the wrong fix), by disallowing ^ with -Pz while the obvious fix is to remove that PCRE_MULTILINE flag. As it turns out PCRE_MULTILINE is there because in the old days, before grep -Pz was supported, with grep -P (without -z), grep would pass more than one line to pcre_exec. If you look at the grep bug history, 90% of the grep pcre related bugs were caused by that. It was fixed/changed in http://git.savannah.gnu.org/cgit/grep.git/commit/?id=a14685c2833f7c28a427fecfaf146e0a861d94ba but Paolo forgot to remove the PCRE_MULTILINE flag when the code was changed to pass one line at a time to pcre_exec and PCRE_MULTILINE was no longer needed anymore (and later called problem when grep -Pz was supported). > It might be nice to have a syntax for matching a newline byte with > -z (or a null byte without -z, for that matter). But that would be a > new feature. That feature is already there. That's the (?m) PCRE operator. That's the whole point. That m flag (PCRE_MULTILINE) is on by default in GNU grep, and that's what it's causing all the problems. Once you turn it off *by default*, that makes ^ match the beginning of the NUL-delimited record as it should and one can use (?m) if he wants ^ to match the beginning of each line in the NUL-delimited record instead of just the beginning of the record. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Fri Nov 18 18:37:25 2016 Received: (at 22655) by debbugs.gnu.org; 18 Nov 2016 23:37:25 +0000 Received: from localhost ([127.0.0.1]:34295 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7sij-0001G6-J1 for submit@debbugs.gnu.org; Fri, 18 Nov 2016 18:37:25 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:46240) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c7sih-0001Fs-Uc for 22655@debbugs.gnu.org; Fri, 18 Nov 2016 18:37:24 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0A20D16005D; Fri, 18 Nov 2016 15:37:18 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id xefMwAKDMTFx; Fri, 18 Nov 2016 15:37:17 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 2593F16005F; Fri, 18 Nov 2016 15:37:17 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id lBMETq6qjGSA; Fri, 18 Nov 2016 15:37:17 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 096BB16005D; Fri, 18 Nov 2016 15:37:17 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Fri, 18 Nov 2016 15:37:16 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161118170636.GE5144@chaz.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Stephane Chazelas wrote: > 2016-11-18 08:48:04 -0800, Paul Eggert: >> Stephane Chazelas wrote: >>> Why would it make it slower. AFAICT, PCRE_MULTILINE *adds* >>> some overhead. >> >> As I understand it, PCRE_MULTILINE lets 'grep' apply a pattern to an >> entire buffer that contains many lines, and this lets PCRE >> efficiently find the first match in the whole buffer. If grep >> doesn't use PCRE_MULTILINE, grep would have to apply the pattern to >> each line separately, which could be significantly slower. > [...] > > That might have been the case a long time ago, as I remember > some discussion about it as it explained some wrong information > in the documentation, but as far as I and gdb can tell, grep > 2.26 at least call pcre_exec for every line of the input with > grep -P. > Although that was true starting with commit=20 a14685c2833f7c28a427fecfaf146e0a861d94ba (2010-03-04), it became false st= arting=20 with commit 9fa500407137f49f6edc3c6b4ee6c7096f0190c5 (2014-09-16). > If it didn't > > echo test | grep -P '\n$' > > would match. No, because grep omits the trailing newline in that particular input. And= for=20 this example: printf 'test\n\n' | grep -p '\n$' grep passes "test\n" to jit_exec, determines that jit_exec returns a matc= h that=20 crosses a line boundary, and rejects the match. From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 03:09:22 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 08:09:22 +0000 Received: from localhost ([127.0.0.1]:34393 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c80iA-0000Fa-2k for submit@debbugs.gnu.org; Sat, 19 Nov 2016 03:09:22 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33752) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c80i7-0000FH-Uj for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 03:09:20 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D36A3160067; Sat, 19 Nov 2016 00:09:13 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id gF5Odz1_F-nV; Sat, 19 Nov 2016 00:09:11 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E41FB160069; Sat, 19 Nov 2016 00:09:11 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ZGyUYChgrp8H; Sat, 19 Nov 2016 00:09:11 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C6B4D160067; Sat, 19 Nov 2016 00:09:11 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <06bdb13d-1db9-cc7f-cae3-84df9d22cb00@cs.ucla.edu> Date: Sat, 19 Nov 2016 00:09:11 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161118162402.GD5144@chaz.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Stephane Chazelas wrote: > Why would it make it slower. AFAICT, PCRE_MULTILINE *adds* > some overhead. After looking into it I now remember why PCRE_MULTILINE speeds things up. See: http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef where using PCRE_MULTILINE sped up 'grep -P "z.*a"' by a factor of 220. From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 03:36:20 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 08:36:21 +0000 Received: from localhost ([127.0.0.1]:34413 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c818G-0000yd-Jf for submit@debbugs.gnu.org; Sat, 19 Nov 2016 03:36:20 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35874) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c818F-0000yP-BZ for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 03:36:19 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id A86B2160072; Sat, 19 Nov 2016 00:36:13 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id f8O_sRVIXvl1; Sat, 19 Nov 2016 00:36:12 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 96C5B160074; Sat, 19 Nov 2016 00:36:12 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id pwGoo8sJpt1U; Sat, 19 Nov 2016 00:36:12 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 7578D160072; Sat, 19 Nov 2016 00:36:12 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <20161118163754.GB10084@chaz.gmail.com> <300f5e65-f62c-6887-114c-b3f322eec4bd@cs.ucla.edu> <20161118180751.GF5144@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <5cd0b7fc-c81e-325f-192b-dcac5feebd44@cs.ucla.edu> Date: Sat, 19 Nov 2016 00:36:12 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161118180751.GF5144@chaz.gmail.com> Content-Type: multipart/mixed; boundary="------------1AFFCC3279769BAF29532F19" X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) This is a multi-part message in MIME format. --------------1AFFCC3279769BAF29532F19 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Stephane Chazelas wrote: > one can > use (?m) if he wants ^ to match the beginning of each line in > the NUL-delimited record instead of just the beginning of the > record. I think the intent is that ^ and $ should match only the line-terminator=20 specified by -z (or by -z's absence). So the sort of usage you describe i= s=20 unspecified and not supported. That being said, it does make sense to mat= ch=20 tricky regular expressions like that line by line, even if this hurts=20 performance. Otherwise, I suspect there are even trickier regular express= ions=20 that could reject a buffer full of lines even though it contains matching= lines.=20 When in doubt we should avoid optimization so I installed the attached pa= tch=20 into the master branch. Please give it a try. --------------1AFFCC3279769BAF29532F19 Content-Type: text/x-diff; name="0001-grep-Pz-no-longer-rejects.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-grep-Pz-no-longer-rejects.patch" =46rom 0e00fe0fc34184b1cdcea92a671eb9ffebb4899b Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sat, 19 Nov 2016 00:25:46 -0800 Subject: [PATCH] grep: -Pz no longer rejects ^, $ Problem reported by Stephane Chazelas (Bug#22655). * NEWS: Document this. * doc/grep.texi (grep Programs): Warn about -Pz. * src/pcresearch.c (reflags): New static var. (multibyte_locale): Remove static var; now local to Pcompile. (Pcompile): Check for (? and (* too. Set reflags instead of dying when problematic operators are found. (Pexecute): Use reflags to decide whether searches should be multiline. * tests/pcre: Test new behavior. --- NEWS | 4 ++++ doc/grep.texi | 4 +++- src/pcresearch.c | 34 +++++++++++++++++++++------------- tests/pcre | 3 ++- 4 files changed, 30 insertions(+), 15 deletions(-) diff --git a/NEWS b/NEWS index b3b5049..a95c875 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,10 @@ GNU grep NEWS -*- out= line -*- =20 ** Bug fixes =20 + grep -Pz no longer rejects patterns containing ^ and $, and is + more cautious about special patterns like (?-m) and (*FAIL). + [bug introduced in grep-2.23] + grep's use of getprogname no longer causes a build failure on HP-UX. =20 =20 diff --git a/doc/grep.texi b/doc/grep.texi index fcfad42..ac821b4 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -1125,8 +1125,10 @@ expressions), separated by newlines, any of which = is to be matched. @opindex --perl-regexp @cindex matching Perl-compatible regular expressions Interpret the pattern as a Perl-compatible regular expression (PCRE). -This is highly experimental and +This is highly experimental, particularly when combined with the +the @option{-z} (@option{--null-data}) option, and @samp{grep@ -P} may warn of unimplemented features. +@xref{Other Options}. =20 @end table =20 diff --git a/src/pcresearch.c b/src/pcresearch.c index 928c22c..9a13d97 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -32,6 +32,9 @@ enum { NSUB =3D 300 }; /* Compiled internal form of a Perl regular expression. */ static pcre *cre; =20 +/* PCRE options used to compile the pattern. */ +static int reflags; + /* Additional information about the pattern. */ static pcre_extra *extra; =20 @@ -85,8 +88,6 @@ jit_exec (char const *subject, int search_bytes, int se= arch_offset, /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty string matches when that flag is used. */ static int empty_match[2]; - -static bool multibyte_locale; #endif =20 void @@ -112,18 +113,19 @@ Pcompile (char const *pattern, size_t size) char *n =3D re; char const *p; char const *pnul; + bool multibyte_locale =3D 1 < MB_CUR_MAX; =20 - if (1 < MB_CUR_MAX) + if (multibyte_locale) { if (! localeinfo.using_utf8) die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 loca= les")); - multibyte_locale =3D true; flags |=3D PCRE_UTF8; } =20 - /* FIXME: Remove these restrictions. */ + /* FIXME: Remove this restriction. */ if (memchr (pattern, '\n', size)) die (EXIT_TROUBLE, 0, _("the -P option only supports a single patter= n")); + if (! eolbyte) { bool escaped =3D false; @@ -133,9 +135,12 @@ Pcompile (char const *pattern, size_t size) escaped =3D after_unescaped_left_bracket =3D false; else { - if (*p =3D=3D '$' || (*p =3D=3D '^' && !after_unescaped_left= _bracket)) - die (EXIT_TROUBLE, 0, - _("unescaped ^ or $ not supported with -Pz")); + if (*p =3D=3D '$' || (*p =3D=3D '^' && !after_unescaped_left= _bracket) + || (*p =3D=3D '(' && (p[1] =3D=3D '?' || p[1] =3D=3D '*'= ))) + { + flags =3D (flags & ~ PCRE_MULTILINE) | PCRE_DOLLAR_ENDON= LY; + break; + } escaped =3D *p =3D=3D '\\'; after_unescaped_left_bracket =3D *p =3D=3D '['; } @@ -217,12 +222,15 @@ Pexecute (char *buf, size_t size, size_t *match_siz= e, error. */ char const *subject =3D buf; =20 - /* If the input is unibyte or is free of encoding errors a multiline s= earch is + /* If the pattern has no problematic operators and the input is + unibyte or is free of encoding errors, a multiline search is typically more efficient. Otherwise, a single-line search is - typically faster, so that pcre_exec doesn't waste time validating - the entire input buffer. */ - bool multiline =3D true; - if (multibyte_locale) + either less confusing because the problematic operators are + interpreted more naturally, or it is typically faster because + pcre_exec doesn't waste time validating the entire input + buffer. */ + bool multiline =3D (reflags & PCRE_MULTILINE) !=3D 0; + if (multiline && (reflags & PCRE_UTF8) !=3D 0) { multiline =3D ! buf_has_encoding_errors (buf, size - 1); buf[size - 1] =3D eolbyte; diff --git a/tests/pcre b/tests/pcre index 8f3d9a4..653ef22 100755 --- a/tests/pcre +++ b/tests/pcre @@ -13,8 +13,9 @@ require_pcre_ fail=3D0 =20 echo | grep -P '\s*$' || fail=3D1 -echo | returns_ 2 grep -zP '\s$' || fail=3D1 +echo | grep -zP '\s$' || fail=3D1 echo '.ab' | returns_ 1 grep -Pwx ab || fail=3D1 echo x | grep -Pz '[^a]' || fail=3D1 +printf 'x\n\0' | returns_ 1 grep -zP 'x$' || fail=3D1 =20 Exit $fail --=20 2.7.4 --------------1AFFCC3279769BAF29532F19-- From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 04:22:14 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 09:22:14 +0000 Received: from localhost ([127.0.0.1]:34425 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c81qf-00029y-RP for submit@debbugs.gnu.org; Sat, 19 Nov 2016 04:22:14 -0500 Received: from thorn.bewilderbeest.net ([71.19.156.171]:57181) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c81qe-00029o-17 for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 04:22:12 -0500 Received: from hatter.bewilderbeest.net (hatter.bewilderbeest.net [IPv6:2001:470:c3f4:1::1:1]) (using TLSv1.2 with cipher AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) (Authenticated sender: zev) by thorn.bewilderbeest.net (Postfix) with ESMTPSA id 024008049A; Sat, 19 Nov 2016 01:22:09 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.10.3 thorn.bewilderbeest.net 024008049A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bewilderbeest.net; s=thorn; t=1479547330; bh=4xvnHdFPcTFGkSbR3dM+hV4pVVfN71+NSktqdOvcqE4=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=TmHOdcF7/e7tzyMbI+yTJzpoQ3GJC3LeHX699p3DMnzToAYStkCw+bvxN75hcgnRe k2hR0jEL8P6WjVFJMaOewHhulp9oEDutb+Di4UK8Go76VLya4oh3u0cpRtUaW9/GF5 /H76TJI4LG3PWYLv7tdb3je6zsR14ykR5gI7iNtE= Date: Sat, 19 Nov 2016 03:22:07 -0600 From: Zev Weiss To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161119092207.uqh2tcn22gmaltsp@hatter.bewilderbeest.net> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <20161118163754.GB10084@chaz.gmail.com> <300f5e65-f62c-6887-114c-b3f322eec4bd@cs.ucla.edu> <20161118180751.GF5144@chaz.gmail.com> <5cd0b7fc-c81e-325f-192b-dcac5feebd44@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <5cd0b7fc-c81e-325f-192b-dcac5feebd44@cs.ucla.edu> User-Agent: NeoMutt/20161014 (1.7.1) X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org, Stephane Chazelas X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) On Sat, Nov 19, 2016 at 12:36:12AM -0800, Paul Eggert wrote: >Stephane Chazelas wrote: >>one can >>use (?m) if he wants ^ to match the beginning of each line in >>the NUL-delimited record instead of just the beginning of the >>record. > >I think the intent is that ^ and $ should match only the >line-terminator specified by -z (or by -z's absence). So the sort of >usage you describe is unspecified and not supported. That being said, >it does make sense to match tricky regular expressions like that line >by line, even if this hurts performance. Otherwise, I suspect there >are even trickier regular expressions that could reject a buffer full >of lines even though it contains matching lines. When in doubt we >should avoid optimization so I installed the attached patch into the >master branch. Please give it a try. >>From 0e00fe0fc34184b1cdcea92a671eb9ffebb4899b Mon Sep 17 00:00:00 2001 >From: Paul Eggert >Date: Sat, 19 Nov 2016 00:25:46 -0800 >Subject: [PATCH] grep: -Pz no longer rejects ^, $ > >Problem reported by Stephane Chazelas (Bug#22655). >* NEWS: Document this. >* doc/grep.texi (grep Programs): Warn about -Pz. >* src/pcresearch.c (reflags): New static var. >(multibyte_locale): Remove static var; now local to Pcompile. >(Pcompile): Check for (? and (* too. Set reflags instead of >dying when problematic operators are found. >(Pexecute): Use reflags to decide whether searches should >be multiline. >* tests/pcre: Test new behavior. I'm a bit confused by this patch -- I see 'reflags' being tested in Pexecute(), but I don't see it getting set anywhere, just Pcompile()'s local 'flags'...I'm guessing 'flags' was supposed to be replaced by 'reflags'? (Not entirely certain though.) Zev From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 04:37:29 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 09:37:29 +0000 Received: from localhost ([127.0.0.1]:34429 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c825R-0002WA-7O for submit@debbugs.gnu.org; Sat, 19 Nov 2016 04:37:29 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39298) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c825P-0002Vx-GU for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 04:37:28 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 604BA16007A; Sat, 19 Nov 2016 01:37:21 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id Um6GPQyXFLt6; Sat, 19 Nov 2016 01:37:20 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1B82316007C; Sat, 19 Nov 2016 01:37:20 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Ks5O1uh-cLMx; Sat, 19 Nov 2016 01:37:19 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id CDBFD16007A; Sat, 19 Nov 2016 01:37:19 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Zev Weiss References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <20161118163754.GB10084@chaz.gmail.com> <300f5e65-f62c-6887-114c-b3f322eec4bd@cs.ucla.edu> <20161118180751.GF5144@chaz.gmail.com> <5cd0b7fc-c81e-325f-192b-dcac5feebd44@cs.ucla.edu> <20161119092207.uqh2tcn22gmaltsp@hatter.bewilderbeest.net> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <21c7cbdc-839e-9ed7-8a79-d18f38c7f480@cs.ucla.edu> Date: Sat, 19 Nov 2016 01:37:18 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161119092207.uqh2tcn22gmaltsp@hatter.bewilderbeest.net> Content-Type: multipart/mixed; boundary="------------926B300E36F878235BB48D72" X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org, Stephane Chazelas X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) This is a multi-part message in MIME format. --------------926B300E36F878235BB48D72 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Zev Weiss wrote: > I see 'reflags' being tested in Pexecute(), but I don't see it getting set anywhere Oops, that somehow got lost during merging. Thanks for catching that. Omitting the initialization caused hurt performance due to a failure to use PCRE_MULTILINE but did not cause a correctness bug, so the tests didn't catch it. Fixed with the attached patch. --------------926B300E36F878235BB48D72 Content-Type: text/x-diff; name="0001-grep-fix-performance-typo-with-P.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-fix-performance-typo-with-P.patch" >From 5b01c4c11cca603c165c3166d2a7566a8505074c Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sat, 19 Nov 2016 01:33:48 -0800 Subject: [PATCH] grep: fix performance typo with -P Reported by Zev Weiss in: http://bugs.gnu.org/22655#88 * src/pcresearch.c (Pcompile): Initialize reflags. --- src/pcresearch.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/pcresearch.c b/src/pcresearch.c index 9a13d97..439945a 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -176,6 +176,7 @@ Pcompile (char const *pattern, size_t size) if (match_lines) strcpy (n, xsuffix); + reflags = flags; cre = pcre_compile (re, flags, &ep, &e, pcre_maketables ()); if (!cre) die (EXIT_TROUBLE, 0, "%s", ep); -- 2.7.4 --------------926B300E36F878235BB48D72-- From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 05:41:37 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 10:41:37 +0000 Received: from localhost ([127.0.0.1]:34472 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c835V-0004AN-Jr for submit@debbugs.gnu.org; Sat, 19 Nov 2016 05:41:37 -0500 Received: from mail-wm0-f45.google.com ([74.125.82.45]:36362) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c835T-0004A6-DS for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 05:41:36 -0500 Received: by mail-wm0-f45.google.com with SMTP id g23so73668858wme.1 for <22655@debbugs.gnu.org>; Sat, 19 Nov 2016 02:41:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=qfqvevlsTOafhP/nYKf/WuU+0Gpaa43z442gXUJStVo=; b=zWtTA9jJlji/6u/LDo6omFk76hy9y8aXRyA5YWRhIXSjUze0mVG2vRzPuK+ClUZcS3 +IFpQn4XQOIRc4ifbPrMs/w3O0JUycyM+tX5mBmGubfiIbMblGTK7ujaqNLXDU8vuiqz jpRxdxmNi1hlDMNx0vY8gaz0DdRjV1yerkdnUJF21DC/U2m1fxFNxg6xWHJwPFT6BoVB yupOtL1GuHukUkWMlm2AwXTbA6yKMOrLlbaaH/kaK24sKkROI2+6ilzXfu/JGyF04/Tt Gd8Cp4X0mwboBo8lOj+timYxYSIrt5GXVI3YT83uEHfcZuLJlLt5YIgjcv1V/iQtUkYW T+9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=qfqvevlsTOafhP/nYKf/WuU+0Gpaa43z442gXUJStVo=; b=S1RxaOs19gdut/aPoInamR6m/nF5FE7kZg57AGXBMJrHQEpo8yQmN3JF9hHYDOCQv4 8TBwAHUO7RX5He+lV1RhoSAoiFP1+YcyTLY26CjSbmNpdD9m8PWvG9avTI7upSdjbqOX pNj6bwUdLiiNXZe/XOARMiXtgOUwpZC7GjNyN5+cDv7HxM/2EPnRj6kC4cH4n5LulHUJ gwyp1jjj1y095QFh0dslX5H+fW+ijMreM+GFDwcOWbqSDaekHY+qGr1ZJNun9ptVrJsQ be7F6vRoLuRqE0pCQCo5kMV9e+hlihgp6O71Xv4uHB3adhO+BTCQyHMmW0iMNfBktqW6 om3Q== X-Gm-Message-State: AKaTC00zyCqFTg0JMDMfnpl+9+LTC+2Jsgx1nA0xWgswfJ7cIPV0qG7LvDeHWfEtk7e4SA== X-Received: by 10.194.14.105 with SMTP id o9mr2545900wjc.66.1479552089584; Sat, 19 Nov 2016 02:41:29 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id v202sm8107248wmv.8.2016.11.19.02.41.28 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sat, 19 Nov 2016 02:41:28 -0800 (PST) Date: Sat, 19 Nov 2016 10:41:27 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161119104127.GA4980@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-18 15:37:16 -0800, Paul Eggert: [...] > >That might have been the case a long time ago, as I remember > >some discussion about it as it explained some wrong information > >in the documentation, but as far as I and gdb can tell, grep > >2.26 at least call pcre_exec for every line of the input with > >grep -P. > > > > Although that was true starting with commit > a14685c2833f7c28a427fecfaf146e0a861d94ba (2010-03-04), it became > false starting with commit 9fa500407137f49f6edc3c6b4ee6c7096f0190c5 > (2014-09-16). [...] OK, it looks like I don't have the full story, and my multiple calls to pcre_exec() seems to point to something else: $ seq 10 | ltrace -e '*pcre*' ./src/grep -P . grep->pcre_maketables(0x221e2f0, 0x221e240, 1, 2) = 0x221e310 grep->pcre_compile(0x221e2f0, 2050, 0x7ffe943ec6f8, 0x7ffe943ec6f4) = 0x221e760 grep->pcre_study(0x221e760, 1, 0x7ffe943ec6f8, 0x7ffe943eb490) = 0x221e7b0 grep->pcre_fullinfo(0x221e760, 0x221e7b0, 16, 0x7ffe943ec6f4) = 0 grep->pcre_exec(0x221e760, 0x221e7b0, "", 0, 0, 128, 0x7ffe943ec700, 300) = -1 grep->pcre_exec(0x221e760, 0x221e7b0, "", 0, 0, 0, 0x7ffe943ec700, 300) = -1 grep->pcre_exec(0x221e760, 0x221e7b0, "1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n", 20, 0, 8192, 0x7ffe943ec4e0, 300) = 1 1 grep->pcre_exec(0x221e760, 0x221e7b0, "2\n3\n4\n5\n6\n7\n8\n9\n10\n", 18, 0, 8192, 0x7ffe943ec4e0, 300) = 1 2 grep->pcre_exec(0x221e760, 0x221e7b0, "3\n4\n5\n6\n7\n8\n9\n10\n", 16, 0, 8192, 0x7ffe943ec4e0, 300) = 1 3 grep->pcre_exec(0x221e760, 0x221e7b0, "4\n5\n6\n7\n8\n9\n10\n", 14, 0, 8192, 0x7ffe943ec4e0, 300) = 1 4 grep->pcre_exec(0x221e760, 0x221e7b0, "5\n6\n7\n8\n9\n10\n", 12, 0, 8192, 0x7ffe943ec4e0, 300) = 1 5 grep->pcre_exec(0x221e760, 0x221e7b0, "6\n7\n8\n9\n10\n", 10, 0, 8192, 0x7ffe943ec4e0, 300) = 1 6 grep->pcre_exec(0x221e760, 0x221e7b0, "7\n8\n9\n10\n", 8, 0, 8192, 0x7ffe943ec4e0, 300) = 1 7 grep->pcre_exec(0x221e760, 0x221e7b0, "8\n9\n10\n", 6, 0, 8192, 0x7ffe943ec4e0, 300) = 1 8 grep->pcre_exec(0x221e760, 0x221e7b0, "9\n10\n", 4, 0, 8192, 0x7ffe943ec4e0, 300) = 1 9 grep->pcre_exec(0x221e760, 0x221e7b0, "10\n", 2, 0, 8192, 0x7ffe943ec4e0, 300) = 1 10 +++ exited (status 0) +++ I don't know the details of why it's done that way, but I'm not sure I can see how calling pcre_exec that way can be quicker than calling it on each individual line/record. Note that this is still wrong: $ printf 'a\nb\0' | ./src/grep -zxP a a b Removing PCRE_MULTILINE (and get back to calling pcre_exec on every record separately) would help except in the cases where the user does: grep -xzP '(?m)a' You'd want to change: static char const xprefix[] = "^(?:"; static char const xsuffix[] = ")$"; To: static char const xprefix[] = "\A(?:"; static char const xsuffix[] = ")\z"; -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 06:22:39 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 11:22:39 +0000 Received: from localhost ([127.0.0.1]:34488 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c83jD-0005AY-Bt for submit@debbugs.gnu.org; Sat, 19 Nov 2016 06:22:39 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44780) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c83jB-0005AK-PB for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 06:22:38 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 71DED16007D; Sat, 19 Nov 2016 03:22:31 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 8KtFmJTnoDeK; Sat, 19 Nov 2016 03:22:30 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 00BA916007E; Sat, 19 Nov 2016 03:22:29 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id XYWMNebeCsSO; Sat, 19 Nov 2016 03:22:24 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 67F3316007D; Sat, 19 Nov 2016 03:22:24 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> Date: Sat, 19 Nov 2016 03:22:23 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161119104127.GA4980@chaz.gmail.com> Content-Type: multipart/mixed; boundary="------------D4269EEDB97837326214E9E1" X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) This is a multi-part message in MIME format. --------------D4269EEDB97837326214E9E1 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Stephane Chazelas wrote: > I don't know the details of why it's done that way, but I'm not > sure I can see how calling pcre_exec that way can be quicker > than calling it on each individual line/record. It can be hundreds of times faster in common cases. See: http://git.savannah.gnu.org/cgit/grep.git/commit/?id=3Df6603c4e1e04dbb87a= 7232c4b44acc6afdf65fef > Note that this is still wrong: > > $ printf 'a\nb\0' | ./src/grep -zxP a > a > b Thanks, fixed by installing the attached. > Removing PCRE_MULTILINE (and get back to calling pcre_exec on > every record separately) would help except in the cases where the > user does: > > grep -xzP '(?m)a' I don't think grep can address this problem, as in general that would req= uire=20 interpreting the PCRE pattern at run-time and grep should not be delving = into=20 PCRE internals. Uses of (?m) lead to unspecified behavior in grep, and=20 applications should not rely on any particular behavior in this area. Thi= s is=20 firmly in the Perl tradition, as the Perl documentation for this part of = the=20 regular expression syntax says "The stability of these extensions varies = widely.=20 Some ... are experimental and may change without warning or be completely= =20 removed." Also, the grep manual says that -P "is highly experimental". Us= er=20 beware, that's all. --------------D4269EEDB97837326214E9E1 Content-Type: text/x-diff; name="0001-grep-fix-zxP-bug.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-grep-fix-zxP-bug.patch" =46rom 882e652c8988ef9380d043ecfca96953e6c30009 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sat, 19 Nov 2016 03:12:56 -0800 Subject: [PATCH] grep: fix -zxP bug * NEWS: Document this. * src/pcresearch.c (Pcompile): Search a line at a time if -x is used, since -x uses ^ and $. * tests/pcre: Test this. --- NEWS | 6 +++--- src/pcresearch.c | 34 ++++++++++++++++++++-------------- tests/pcre | 1 + 3 files changed, 24 insertions(+), 17 deletions(-) diff --git a/NEWS b/NEWS index 978ec55..4972c01 100644 --- a/NEWS +++ b/NEWS @@ -10,9 +10,9 @@ GNU grep NEWS -*- ou= tline -*- >/dev/null" where PROGRAM dies when writing into a broken pipe. [bug introduced in grep-2.26] =20 - grep -Pz no longer rejects patterns containing ^ and $, and is - more cautious about special patterns like (?-m) and (*FAIL). - [bug introduced in grep-2.23] + grep -Pz no longer rejects patterns containing ^ and $, is more + cautious about special patterns like (?-m) and (*FAIL), and works + when combined with -x. [bug introduced in grep-2.23] =20 grep -m0 -L PAT FILE now outputs "FILE". [bug introduced in grep-2.5]= =20 diff --git a/src/pcresearch.c b/src/pcresearch.c index 439945a..01616c2 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -128,22 +128,28 @@ Pcompile (char const *pattern, size_t size) =20 if (! eolbyte) { - bool escaped =3D false; - bool after_unescaped_left_bracket =3D false; - for (p =3D pattern; *p; p++) - if (escaped) - escaped =3D after_unescaped_left_bracket =3D false; - else - { - if (*p =3D=3D '$' || (*p =3D=3D '^' && !after_unescaped_left= _bracket) - || (*p =3D=3D '(' && (p[1] =3D=3D '?' || p[1] =3D=3D '*'= ))) + bool line_at_a_time =3D match_lines; + if (! line_at_a_time) + { + bool escaped =3D false; + bool after_unescaped_left_bracket =3D false; + for (p =3D pattern; *p; p++) + if (escaped) + escaped =3D after_unescaped_left_bracket =3D false; + else { - flags =3D (flags & ~ PCRE_MULTILINE) | PCRE_DOLLAR_ENDON= LY; - break; + if (*p =3D=3D '$' || (*p =3D=3D '^' && !after_unescaped_= left_bracket) + || (*p =3D=3D '(' && (p[1] =3D=3D '?' || p[1] =3D=3D= '*'))) + { + line_at_a_time =3D true; + break; + } + escaped =3D *p =3D=3D '\\'; + after_unescaped_left_bracket =3D *p =3D=3D '['; } - escaped =3D *p =3D=3D '\\'; - after_unescaped_left_bracket =3D *p =3D=3D '['; - } + } + if (line_at_a_time) + flags =3D (flags & ~ PCRE_MULTILINE) | PCRE_DOLLAR_ENDONLY; } =20 *n =3D '\0'; diff --git a/tests/pcre b/tests/pcre index 653ef22..a290099 100755 --- a/tests/pcre +++ b/tests/pcre @@ -17,5 +17,6 @@ echo | grep -zP '\s$' || fail=3D1 echo '.ab' | returns_ 1 grep -Pwx ab || fail=3D1 echo x | grep -Pz '[^a]' || fail=3D1 printf 'x\n\0' | returns_ 1 grep -zP 'x$' || fail=3D1 +printf 'a\nb\0' | grep -zxP a && fail=3D1 =20 Exit $fail --=20 2.7.4 --------------D4269EEDB97837326214E9E1-- From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 09:38:24 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 14:38:24 +0000 Received: from localhost ([127.0.0.1]:34573 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c86md-0003Ua-Ob for submit@debbugs.gnu.org; Sat, 19 Nov 2016 09:38:24 -0500 Received: from mail-oi0-f46.google.com ([209.85.218.46]:33616) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c86mb-0003UI-2r for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 09:38:21 -0500 Received: by mail-oi0-f46.google.com with SMTP id 128so118206920oih.0 for <22655@debbugs.gnu.org>; Sat, 19 Nov 2016 06:38:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=aaroncrane-co-uk.20150623.gappssmtp.com; s=20150623; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-transfer-encoding; bh=JVUy4fq2kqDwtSRtdRD0qrDOXF8TUI8LCz/KqSJDqh4=; b=nGTdHygUqWHLw/41FavfOXfXR6/lnIf8M3kkvdfjcUG/3OXEXvYIw7ETyKEHCtPjoZ yYJbEoAlct20UKgHMrfCbJJ65uLiRln4AQosHhiR4Xtc+hTWaXx/cAH/eQwi1ZiV2ZWZ yEF0102Q8TCI1uHvJ8F1FMOU7X+G3b4S24oxHhDdGsiTnNcBF3TGyPyV/XaycfqDZdcO AhI1/Yv5521++fKQBGv4hWaezZWkoSjls4lH0Jr6YD96ow1yQ6/9hqQAMyhMc0klSGa0 5l+BdgWOaxZZH5SmMEGjCp7dGNrVHYjcyGFhSmyfmwFl4QQPOIOTKiUk1EBWetrqJNqH Astg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-transfer-encoding; bh=JVUy4fq2kqDwtSRtdRD0qrDOXF8TUI8LCz/KqSJDqh4=; b=Sl4s1MbZI8at3cscXeYRE7k/SgxqbOBRHYsD4RXUZOs/WaAKEQ1CsBdrWaEMrrqBzy D9aX5Pj47Sa31ytpMCWC3EIUXpldVaQC/YX4lQh1AZ/AolOjsxSAahR21wcmL4m/C7s/ WrovSZW/wSCWoP7lqQpXXb3zLTNYqf0gntb/7nDAiiHZIeLFVUZZzl88wNkWFC8stJP8 X8mDAAMELCeY21YX313H4drY3vCAgdePce2H5qOrxvSqCMBhN2sP6mqQModyw2XcIfb0 plTY2JZ75JsA+/b3kqoe1FtRRdFaSOgTV0E9aeIFbEvljLR0VD9t1HdJXsZAskNCXep6 3aJA== X-Gm-Message-State: AKaTC02QD+xEjFYhtqBSOI2YNBuzV6o+lyqJrAmERbZ0NTrlBxXu1+3zmryTBOxPog/XphrtzBHEijH3K3vM3g== X-Received: by 10.157.31.23 with SMTP id x23mr2950606otd.175.1479566295299; Sat, 19 Nov 2016 06:38:15 -0800 (PST) MIME-Version: 1.0 Received: by 10.157.52.187 with HTTP; Sat, 19 Nov 2016 06:37:59 -0800 (PST) X-Originating-IP: [86.142.216.89] In-Reply-To: <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> From: Aaron Crane Date: Sat, 19 Nov 2016 14:37:59 +0000 X-Google-Sender-Auth: aNB1s4lsu1zgFInmctrWa_RbHWc Message-ID: Subject: Re: bug#22655: grep -Pz '^' now fails! To: Paul Eggert Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org, Stephane Chazelas X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Paul Eggert wrote: > Stephane Chazelas wrote: >> Removing PCRE_MULTILINE (and get back to calling pcre_exec on >> every record separately) would help except in the cases where the >> user does: >> >> grep -xzP '(?m)a' > > I don't think grep can address this problem, as in general that would > require interpreting the PCRE pattern at run-time and grep should not be > delving into PCRE internals. Uses of (?m) lead to unspecified behavior in > grep, and applications should not rely on any particular behavior in this > area. This is firmly in the Perl tradition, as the Perl documentation for > this part of the regular expression syntax says "The stability of these > extensions varies widely. Some ... are experimental and may change withou= t > warning or be completely removed." Also, the grep manual says that -P "is > highly experimental". User beware, that's all. I believe the sense of "stability" in that line of the Perl documentation is "expected continued availability of that feature in future versions of Perl". But most of the constructs covered by that caveat have been unchanged (except for bug fixes) for well over a decade; and some of them, including (?m) and friends, date to Perl 5.000, released in 1994. I can therefore say with a high degree of confidence that they aren't going to be removed or changed in future versions of Perl 5, and I've just removed that outdated notice from the Perl documentation: https://perl5.git.perl.org/perl.git/commitdiff/ff8bb4687895e07f822f5227d573= c967aa0a4524 (That change should be part of the next stable release of Perl, namely 5.26, expected in May 2017.) Beyond that, I'm not sure it's ideal to use the Perl documentation as a comprehensive guide to PCRE: the two implementations are independent, with slightly differing sets of features, and with known differences in behaviour on edge cases even for features they share. So even if Perl did consider constructs like (?m) to be "unstable" or "unreliable", PCRE's authors and maintainers might take a different view. That said, I do accept that there are limits to what grep can reasonably do here =E2=80=94 I certainly wouldn't advocate that it try to delve into PCRE internals! =E2=80=94 and in particular, I recognise that -P= is documented as experimental. --=20 Aaron Crane ** http://aaroncrane.co.uk/ From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 11:14:40 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 16:14:40 +0000 Received: from localhost ([127.0.0.1]:35283 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c88Ho-0006Ed-FV for submit@debbugs.gnu.org; Sat, 19 Nov 2016 11:14:40 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:33011) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c88Hl-0006EM-Rd for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 11:14:38 -0500 Received: by mail-wm0-f65.google.com with SMTP id u144so14100512wmu.0 for <22655@debbugs.gnu.org>; Sat, 19 Nov 2016 08:14:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=dr/V/WqCz2RfboAaafAAEIX0W5Kaa7LHWoEnpHW31QM=; b=Edvs0bW2I25dM6uV9EirpdeARGWfL2UdfMioUEbqCuCDCRjpWaM+vqC2+VOim3/kCl fCHHbX9yMy7QaG7+MjjGlfUq2EU6CQTLINjM6sGyknNBgpgybm7MfVkP8ly4P3DgQYJl TyGq/m+YDvjqNU+fL/CFf0hKWazElo8eTF2Rw66BSkSkwNkRad4yz4AFMYO+8rW3fmOK aKiSUKyWo6uTHb0wqgbfmV1AZGeZRpewoEuTVExuBK5YWjzHIwh+9gkL0URYZog2QtgA VP8zShf9ZLnEvQiXUcXUnjnmJXntjjJcgA2nJcqUjGVRFdTZQ2aYTR4jplcPNgstMYwb ra0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=dr/V/WqCz2RfboAaafAAEIX0W5Kaa7LHWoEnpHW31QM=; b=FAMQGXYRQhMecqgGABGXg4GBKqAxdWmGVVbRR4pCeamGpnNtLqX4MYKs/owzSee9Xj IafmPdpzJC0Cad+740CyIFK1dbmtIUIRMdo4JDps/uDwWZlPmDjXlMOjp8zRHXPohJre Ng+1pjh4xiUqunT/WCiDQxfJXI3jBEXhjPxj92jPR0ii64lvR1Gx03eI1lY6nSSFf+jQ 8QYAMt8Xz10W7TwCoUyfR03ZE9n31lNbJbBXUKWYcmUunZnt8uNiRr4kKkK1QfsYH4Sj S1GIfAAp++Ptg+wct+Yz1Ovy3Bf24c555/QxhzWrMAdSgda17K4NAxjzJdf9/Qg6Z7am RF/Q== X-Gm-Message-State: AKaTC03yHakLMAAaBfpABzDfSyO0iwAq9MwsnBtcLIJvqO7kRjnVreJe6+s0vSNUHqBebQ== X-Received: by 10.28.74.133 with SMTP id n5mr3882696wmi.132.1479572072028; Sat, 19 Nov 2016 08:14:32 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id d85sm9458811wmd.17.2016.11.19.08.14.29 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sat, 19 Nov 2016 08:14:31 -0800 (PST) Date: Sat, 19 Nov 2016 16:14:28 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161119161428.GB4980@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-19 03:22:23 -0800, Paul Eggert: > Stephane Chazelas wrote: > > >I don't know the details of why it's done that way, but I'm not > >sure I can see how calling pcre_exec that way can be quicker > >than calling it on each individual line/record. > > It can be hundreds of times faster in common cases. See: > > http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef On that same sample, when comparing with pcregrep, an implementation which does the right thing IMO and is what grep is documented to do, that is match each line against the regexp passed as argument and without PCRE_MULTILINE I don't find a x220 factor, more like a x2.5 factor: $ time pcregrep "z.*a" k pcregrep "z.*a" k 1.00s user 0.05s system 99% cpu 1.047 total $ time ./grep -P "z.*a" k ./grep -P "z.*a" k 0.41s user 0.05s system 99% cpu 0.457 total On the other hand if you change the pattern to "z[^+]*a", pcregrep still takes about one second, but GNU grep a lot longer (I gave up after a few minutes). With a simpler example: $ time seq 10000 | ./grep -P '[^a]*1000[^a]*2000' seq 10000 0.00s user 0.00s system 0% cpu 0.002 total ./grep -P '[^a]*1000[^a]*2000' 1.99s user 0.00s system 99% cpu 1.991 total $ time seq 10000 | pcregrep '[^a]*1000[^a]*2000' seq 10000 0.00s user 0.00s system 0% cpu 0.001 total pcregrep '[^a]*1000[^a]*2000' 0.00s user 0.00s system 0% cpu 0.002 total That's at least a x1000 factor. If I understand correctly, you're calling pcre_exec on several lines and with PCRE_MULTILINE as an optimisation in the cases where the RE is unlikely to match, so one call to pcre_exec can go through many lines at a time, and if it doesn't match, that's as many lines you can safely skip. The example above is an one that shows it's an invalid optimisation. seq 10000 obviously has no line that matches that pattern, but still because GNU grep tries to match it on a string that contains multiple lines, it will match in several places, and you need to call pcre_exec again to double-check each of those false positives which is counter-productive. AFAICT, that optimisation only brings a little optimisation in some cases (and reduces performance in others) but also reduces functionality and breaks users' expectations. IMO, it's not worth it. [...] > >grep -xzP '(?m)a' > > I don't think grep can address this problem, as in general that > would require interpreting the PCRE pattern at run-time and grep > should not be delving into PCRE internals. Uses of (?m) lead to > unspecified behavior in grep, and applications should not rely on > any particular behavior in this area. I agree, grep should not try and outsmart libpcre, it should just call it on every line (or NUL delimited record with -z), so that users can expect the regexp it passes to grep to be matched on those lines/records. > This is firmly in the Perl > tradition, as the Perl documentation for this part of the regular > expression syntax says "The stability of these extensions varies > widely. Some ... are experimental and may change without warning or > be completely removed." No, those s, m, i... flags are at the core of perl regexp matching, they are certainly not experimental. Most of the ones that are "experimental" in perl are not available in PCRE. > Also, the grep manual says that -P "is > highly experimental". User beware, that's all. It's highly experimental because grep is not using PCRE the straightforward way. The simple "grep -P re" could be implemented in a few lines of code (the equivalent of a fgets+pcre_exec loop). PCRE is the modern de-facto regular expression standard. Most modern languages have regexps that are more or less compatible with those (when they don't link to libpcre). If you look at the trends on stackoverflow.com or unix.stackexchange.com, you'll notice that nowadays, GNU grep is more often called with -P than not. I don't think you can just dismiss it as "not GNU grep's core functionality". -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 11:45:38 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 16:45:38 +0000 Received: from localhost ([127.0.0.1]:35292 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c88lm-000756-Gz for submit@debbugs.gnu.org; Sat, 19 Nov 2016 11:45:38 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:35892) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c88ll-00074o-8r for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 11:45:37 -0500 Received: by mail-wm0-f67.google.com with SMTP id m203so14252078wma.3 for <22655@debbugs.gnu.org>; Sat, 19 Nov 2016 08:45:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=1CArYfr4oECA3qHU7V8MpfcppD2vYii+D3xVX/+1j/k=; b=ndw+RjKJYQDWy9tjSRfUJbFcz9IgEvWvJAp8Eu1w3+g1I61wYwJjWoQbctTproDr0P JEYMtXMfoSJd2MucX94glw/2Pl8lSeA2wArYrGkG2Y2GXjXpOf6vOP2yIIUK4xCTCdGm Jbg9/4tBI8NdvEwUvkc1XW7eStxkza45erpmcWHsnRR24OUZHYs20e1OSRayg9ivERrN LoYpnh4SoG3tlaLaAcy3QP8c7740CznUjFSPm/Efs+vnqEh4emR8TwJgC4OJ+rDfgOgU PIeYvirp9fbCVUf3KeRTtgRkXCAWPZK7M9cm7xUDKjwdjNEBEKzL/YjK+SzU30at5qef zspw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=1CArYfr4oECA3qHU7V8MpfcppD2vYii+D3xVX/+1j/k=; b=A6QE3ZHm9wQ/k5d5pF79fmIAiI7DzvXGOjtpROiRaOh5hpcIlQKW6IKfSHX5zS+Icp 4qHG9RhZnl/E5fTctyoE7DUu48Pfs1mqxCGBrMmQ09ZW5KFni78bz/tuPeqTVTVzebij S5/x8C1H/LecrFEpNa21+tmS3H3z6zNKyqaLC1qJULjEDfmSqdzwifmfuKlrc+XkIssq TSg1GQBHeAOWCutTVxl14eaAGZGjsWyeYlDVzjb6bOkFum/cVfD08Vdrib1dhVpI4DkJ pesSZtUak7NAkuZSkNMOrgicysNJou6ldcToMBnkvaU9vbdHRDgYIjFAKvEGi2FP2ktc YRpg== X-Gm-Message-State: AKaTC01rAXSruQw4li0Fd0H1YQOigJxWMYM9AmI5pTDCnj+cdgBk0v2K2Kv1uOrNNhXd5g== X-Received: by 10.28.214.84 with SMTP id n81mr4213314wmg.120.1479573931571; Sat, 19 Nov 2016 08:45:31 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id 191sm9561737wmr.11.2016.11.19.08.45.30 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sat, 19 Nov 2016 08:45:30 -0800 (PST) Date: Sat, 19 Nov 2016 16:45:29 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161119164529.GA28686@chaz.gmail.com> References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <20161119161428.GB4980@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161119161428.GB4980@chaz.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-19 16:14:28 +0000, Stephane Chazelas: [...] > AFAICT, that optimisation only brings a little optimisation in > some cases (and reduces performance in others) but also reduces > functionality and breaks users' expectations. IMO, it's not > worth it. [...] To be fair, that last part about users' expectations is highly reduced with your recent changes. Thanks for that and for re-enabling grep -Pz ^/$ I just think that if grep always did the simplest thing of calling pcre_exec on every record in all the cases, that would make for simpler, more easily maintainable and less "experimental" code, and probably wouldn't affect performance so badly on average. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 16:52:45 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 21:52:45 +0000 Received: from localhost ([127.0.0.1]:35364 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8DYy-0007DK-Ri for submit@debbugs.gnu.org; Sat, 19 Nov 2016 16:52:45 -0500 Received: from mail-wm0-f66.google.com ([74.125.82.66]:33670) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8DYx-0007D5-Ih for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 16:52:44 -0500 Received: by mail-wm0-f66.google.com with SMTP id u144so15947485wmu.0 for <22655@debbugs.gnu.org>; Sat, 19 Nov 2016 13:52:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=G42MKkQ99GdVDkSWsvnOUrsb+oIBr9XhiMTPr9mtG2c=; b=nZki6GPsUk46cqejfV7MF5Obs2jemaidgAvXJ8og0YSAe5AqzASt7Xr35wF1yIwqev Uor1lqlxb4pC+rcrf1+OsNCUEuEa2R3z9huVX9d+dcaJD6XPyUAaWUnqn9qslqaQ82Lm NhBuykmQRzQbKjjGZZ7emWJ0BeQbK9RX0yo3HZwT8QJuao8IWePt0p4zE+yV59T9EjrM OoYuqECTM+nCXWdB+RQHyMHTNbMmth2xvodSwKfwhAKbF6kdGVmrtszqUxijM5VnR47A f40TUwwd9djSRC2Rj3oElYSyvDmWZ/HW5HpC0ZrO9GvMEs7qwLgyR0RdI36gs9CO3VHV 1Blw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=G42MKkQ99GdVDkSWsvnOUrsb+oIBr9XhiMTPr9mtG2c=; b=bvFZY2Stp1/d2H0pNNEZB3KJJExYCQ1URiiLMgDWPC9PYIc7vqIs3nzwQ7FLP7Rpim akvW6vcvi5+pwE2z7nvbVa9GgOOAIfN8KECVL4ZPb4JvqDo5m7ug/H6TPcvM9C9fE9ih WNq6m+oOjR8m3wHlOJBj1rPNXoeIJDFaARTVauCMMLKdOyhYglrVtn8T6gr+Nk2ubbCN 7mrtHJaApHJBMD2SuuxeLwSyg4korEk/u7VLUvzsWSSUSKhEbDLWO1KprYy8UGx95AIw w0aLAKHq/GYWDdsn2artHvBGv/Pa6X3O8/fQdQurEuW/Zhyy6Bvc+IF12+TyP6pYSg9L Rrkw== X-Gm-Message-State: AKaTC00U6pfPugvEW0ye7/CT16IU3Y9/4KJ09vxw168jp8mR/yBAbOyM+kB8CrCybm+0/A== X-Received: by 10.28.103.134 with SMTP id b128mr5480782wmc.54.1479592357601; Sat, 19 Nov 2016 13:52:37 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id p144sm6586384wme.23.2016.11.19.13.52.35 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sat, 19 Nov 2016 13:52:36 -0800 (PST) Date: Sat, 19 Nov 2016 21:52:35 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161119215235.GA5948@chaz.gmail.com> References: <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <20161119161428.GB4980@chaz.gmail.com> <20161119164529.GA28686@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161119164529.GA28686@chaz.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-19 16:45:29 +0000, Stephane Chazelas: > 2016-11-19 16:14:28 +0000, Stephane Chazelas: > [...] > > AFAICT, that optimisation only brings a little optimisation in > > some cases (and reduces performance in others) but also reduces > > functionality and breaks users' expectations. IMO, it's not > > worth it. > [...] > > To be fair, that last part about users' expectations is highly > reduced with your recent changes. Thanks for that and for > re-enabling grep -Pz ^/$ [...] A few cases left: $ printf 'a\nb\n' | ./grep -P 'a\z' $ printf 'a\nb\n' | pcregrep 'a\z' a Note that it's https://savannah.gnu.org/bugs/?27460 already mentioned whose fix at the time was to stop passing several lines to pcre_exec(). (A regression test could have helped spot this one.) $ printf 'a\nb\n' | ./grep -P 'a[^b]*+$' $ printf 'a\nb\n' | pcregrep 'a[^b]*+$' a *+ is the non-backtracking * (same for other repeating operators). The example is a bit contrieved, but that hopefully illustrates that any extended PCRE operator, current or future could fool that GNU grep optimisation. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Sat Nov 19 18:12:51 2016 Received: (at 22655) by debbugs.gnu.org; 19 Nov 2016 23:12:51 +0000 Received: from localhost ([127.0.0.1]:35385 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8EoV-0000ph-M8 for submit@debbugs.gnu.org; Sat, 19 Nov 2016 18:12:51 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44364) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8EoT-0000pS-MV for 22655@debbugs.gnu.org; Sat, 19 Nov 2016 18:12:50 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B97FB16004F; Sat, 19 Nov 2016 15:12:42 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id jvjq1iPleko4; Sat, 19 Nov 2016 15:12:42 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1292A16005F; Sat, 19 Nov 2016 15:12:42 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 6Lxlc590JSeO; Sat, 19 Nov 2016 15:12:41 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E6A1616004F; Sat, 19 Nov 2016 15:12:41 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Aaron Crane References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <5d090b4d-17bb-4790-a656-985878eb0c10@cs.ucla.edu> Date: Sat, 19 Nov 2016 15:12:41 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org, Stephane Chazelas X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Aaron Crane wrote: > I'm not sure it's ideal to use the Perl documentation as > a comprehensive guide to PCRE: Unfortunately libpcre does not document the regular expression syntax it=20 supports. Neither does the grep manual, for 'grep -P'. And as you've ment= ioned,=20 the Perl manual isn't a reliable source for libpcre either. So users of '= grep=20 -P' cannot rely on any documentation for behavior; it's an unwritten trad= ition=20 instead. Long ago when a friend told me "When you're telling someone else where yo= u plan=20 to throw the Frisbee, don't say anything more specific than 'Watch this!'= "=20 That's 'grep -P' in a nutshell. From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 20 02:14:37 2016 Received: (at 22655) by debbugs.gnu.org; 20 Nov 2016 07:14:37 +0000 Received: from localhost ([127.0.0.1]:35541 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8MKj-0005tV-EV for submit@debbugs.gnu.org; Sun, 20 Nov 2016 02:14:37 -0500 Received: from mail-wm0-f68.google.com ([74.125.82.68]:33833) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8MKi-0005tG-4p for 22655@debbugs.gnu.org; Sun, 20 Nov 2016 02:14:36 -0500 Received: by mail-wm0-f68.google.com with SMTP id g23so17693307wme.1 for <22655@debbugs.gnu.org>; Sat, 19 Nov 2016 23:14:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=7rqVi2D69xpedoTVi4fc6b5G/qFWJRArOUzV+AAV0DM=; b=TVdkwmxh6MjyF9xLNWEumXQW1XBLXy2dkResXsTgj0NunGN4fG25cg3WVPHI9E+6Vg ziZoSLUTDzlBCLBrDs8/hZaAFds/diTi+jA5vRQ2dIyxfrfFJs7xIb61zVRQjRBu+2Uq fFmSpNI7OJSeXgvxiCFMLqMOJptEiOp0sN8P7ao02/PcFwRY/Qr5HfkFc2cHkGli7nHH Hq1gpw/4rux7j46YXHg1c2/e0undpMeIjM1VYypEsUOCrLAkj8GQ3jfvlxQVSX8JIaLq f1aYB9+AC8cSXT5zEEoYH0e5afYC5zaJW8XJmlgiy/mhzGleDjzsuZJamPs/0jQRfPR7 yL1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=7rqVi2D69xpedoTVi4fc6b5G/qFWJRArOUzV+AAV0DM=; b=bCxuzAXydZdtPAHj2bDMIqMcloq9/8XI5ANGMnTflYV9qCckz0t/Ukm3/IXoBMCQWN IWFF31XlizHslRneosweh6wPQDmTRL/z4AeuNRqQqEblQFld76TBRgwfjwPRsCD1ep3b l49rA68NawPugdMF5oqXfUDp763+v4yPdZmC+tmux+wSdztawtf6ihxw/ebNuQkHEYZh rOILA0lHXPNdsdfboryG5cqwuQFsr/4MX0sCe2XfammCMqQ0vaOWGThF25bm5J0YrgIA de2uJNi2t0L8NEASa35DqfSCxjvB2m/hzISpCn+JxIJiNqc2kFfljSFHxcheXI9X8aPR dKyg== X-Gm-Message-State: AKaTC02K/CnUqCx3wJbdGZ4uk5LRQ7j6ZG3zZqqgnd9p62BZHH43eceXn+p0ll1YiN0d2g== X-Received: by 10.28.46.144 with SMTP id u138mr6732021wmu.136.1479626070332; Sat, 19 Nov 2016 23:14:30 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id 135sm12512142wmh.14.2016.11.19.23.14.29 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sat, 19 Nov 2016 23:14:29 -0800 (PST) Date: Sun, 20 Nov 2016 07:14:28 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161120071427.GA4814@chaz.gmail.com> References: <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <5d090b4d-17bb-4790-a656-985878eb0c10@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5d090b4d-17bb-4790-a656-985878eb0c10@cs.ucla.edu> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org, Aaron Crane X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-19 15:12:41 -0800, Paul Eggert: > Aaron Crane wrote: > >I'm not sure it's ideal to use the Perl documentation as > >a comprehensive guide to PCRE: > > Unfortunately libpcre does not document the regular expression > syntax it supports. Neither does the grep manual, for 'grep -P'. And > as you've mentioned, the Perl manual isn't a reliable source for > libpcre either. So users of 'grep -P' cannot rely on any > documentation for behavior; it's an unwritten tradition instead. [...] You've missed "man pcrepattern". See also "man pcresyntax" for the cheat sheet and "man pcre" for the index of all the PCRE man pages. If that's a consolation, I also had been using PCRE years before I discovered that man page (which I find of very high quality btw) few years ago. -- Stephane From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 20 02:16:28 2016 Received: (at 22655) by debbugs.gnu.org; 20 Nov 2016 07:16:29 +0000 Received: from localhost ([127.0.0.1]:35546 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8MMW-0005xK-Q8 for submit@debbugs.gnu.org; Sun, 20 Nov 2016 02:16:28 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:45218) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8MMU-0005wu-DH for 22655@debbugs.gnu.org; Sun, 20 Nov 2016 02:16:26 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 976F6160050; Sat, 19 Nov 2016 23:16:19 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id UONZm95U9oZb; Sat, 19 Nov 2016 23:16:18 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E4958160059; Sat, 19 Nov 2016 23:16:18 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id q3593Q5bYpGk; Sat, 19 Nov 2016 23:16:18 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C46DC160050; Sat, 19 Nov 2016 23:16:18 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <5d090b4d-17bb-4790-a656-985878eb0c10@cs.ucla.edu> <20161120071427.GA4814@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sat, 19 Nov 2016 23:16:17 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161120071427.GA4814@chaz.gmail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org, Aaron Crane X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Stephane Chazelas wrote: > You've missed "man pcrepattern". Yes I did. Thanks. Google was not my friend there. From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 20 02:57:34 2016 Received: (at 22655) by debbugs.gnu.org; 20 Nov 2016 07:57:34 +0000 Received: from localhost ([127.0.0.1]:35562 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8N0I-00072N-2l for submit@debbugs.gnu.org; Sun, 20 Nov 2016 02:57:34 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:47630) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8N0F-000729-Fq for 22655@debbugs.gnu.org; Sun, 20 Nov 2016 02:57:32 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 14133160050; Sat, 19 Nov 2016 23:57:25 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 5bP6Fj_dmbc1; Sat, 19 Nov 2016 23:57:23 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1E2E9160059; Sat, 19 Nov 2016 23:57:23 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id MbhPUrgR_bkq; Sat, 19 Nov 2016 23:57:23 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id EC76E160050; Sat, 19 Nov 2016 23:57:22 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <20161119161428.GB4980@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sat, 19 Nov 2016 23:57:22 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161119161428.GB4980@chaz.gmail.com> Content-Type: multipart/mixed; boundary="------------738720204FBD31C258FC1A24" X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655 Cc: 22655@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) This is a multi-part message in MIME format. --------------738720204FBD31C258FC1A24 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Stephane Chazelas wrote: > I don't find a x220 factor, more like a x2.5 factor: I think I found the factor-of-hundreds slowdown, and fixed it in the 2nd=20 attached patch. When I tried your benchmark with pcregrep (pcre 8.39, configured with=20 --enable-unicode-properties), and with ./grep0 (which has the PCRE_MULTIL= INE=20 implementation, i.e., commit da94c91a81fc63275371d0580d8688b6abd85346), a= nd with=20 ./grep (which is grep after the attached patches are installed), I got ti= mings=20 like the following: user sys 1.972 0.072 LC_ALL=3Den_US.utf8 pcregrep -u "z.*a" k 0.234 0.076 LC_ALL=3Den_US.utf8 ./grep0 -P "z.*a" k 1.280 0.064 LC_ALL=3Den_US.utf8 ./grep -P "z.*a" k 1.487 0.077 LC_ALL=3DC pcregrep "z.*a" k 0.193 0.067 LC_ALL=3DC ./grep0 -P "z.*a" k 0.825 0.096 LC_ALL=3DC ./grep -P "z.*a" k All times are CPU seconds. This is Fedora 24 x86-64, AMD Phenom II X4 910= e. As=20 before, k was created by the shell command: yes 'abcdefg hijklmn opqrstu = vwxyz'=20 | head -n 10000000 >k So, on this benchmark using PCRE_MULTILINE gave a speedup of a factor of = ~4.3 in=20 a multibyte locale, and a speedup of ~3.5 in a unibyte locale. > On the other hand if you change the pattern to "z[^+]*a", > pcregrep still takes about one second, but GNU grep a lot longer Yes, that example makes GNU grep -P look really bad. So installed the 1st= =20 attached patch, which mostly just reverts the January multiline patch, i.= e., it=20 goes back to the slower "./grep -P" lines measured above. --------------738720204FBD31C258FC1A24 Content-Type: text/x-diff; name="0001-grep-P-no-longer-uses-PCRE_MULTILINE.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-grep-P-no-longer-uses-PCRE_MULTILINE.patch" =46rom 1b65e6bbbc75d62e766dc6293ce042cd40fb935d Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sat, 19 Nov 2016 22:48:37 -0800 Subject: [PATCH 1/2] grep: -P no longer uses PCRE_MULTILINE This reverts commit f6603c4e1e04dbb87a7232c4b44acc6afdf65fef, as the extra performance is not worth the trouble for PCRE users. Problem reported by Stephane Chazelas in: http://bugs.gnu.org/22655#103 * NEWS: Document this and the next patch. * src/dfasearch.c (EGexecute): * src/grep.c (execute_fp_t): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pexecute): First arg is now a const pointer again. * src/grep.c (buf_has_encoding_errors): Now static. * src/grep.h (buf_has_encoding_errors): Remove decl. * src/search.h: Adjust decls. * src/pcresearch.c (reflags): Remove. All uses removed. (Pcompile, Pexecute): Do not use PCRE_MULTILINE. --- NEWS | 8 +++-- src/dfasearch.c | 2 +- src/grep.c | 4 +-- src/grep.h | 1 - src/kwsearch.c | 2 +- src/pcresearch.c | 101 ++++++-------------------------------------------= ------ src/search.h | 6 ++-- 7 files changed, 22 insertions(+), 102 deletions(-) diff --git a/NEWS b/NEWS index 4972c01..6138b48 100644 --- a/NEWS +++ b/NEWS @@ -10,9 +10,11 @@ GNU grep NEWS -*- o= utline -*- >/dev/null" where PROGRAM dies when writing into a broken pipe. [bug introduced in grep-2.26] =20 - grep -Pz no longer rejects patterns containing ^ and $, is more - cautious about special patterns like (?-m) and (*FAIL), and works - when combined with -x. [bug introduced in grep-2.23] + grep -P no longer attempts multiline matches. This works more + intuitively with unusual patterns, and means that grep -Pz no longer + rejects patterns containing ^ and $ and works when combined with -x. + [bugs introduced in grep-2.23] A downside is that grep -P is now + significantly slower, albeit typically still faster than pcregrep. =20 grep -m0 -L PAT FILE now outputs "FILE". [bug introduced in grep-2.5]= =20 diff --git a/src/dfasearch.c b/src/dfasearch.c index d41b6fd..ded9917 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -216,7 +216,7 @@ GEAcompile (char const *pattern, size_t size, reg_syn= tax_t syntax_bits) } =20 size_t -EGexecute (char *buf, size_t size, size_t *match_size, +EGexecute (char const *buf, size_t size, size_t *match_size, char const *start_ptr) { char const *buflim, *beg, *end, *ptr, *match, *best_match, *mb_start; diff --git a/src/grep.c b/src/grep.c index f120bcf..4538f22 100644 --- a/src/grep.c +++ b/src/grep.c @@ -589,7 +589,7 @@ static bool seek_data_failed; =20 /* Functions we'll use to search. */ typedef void (*compile_fp_t) (char const *, size_t); -typedef size_t (*execute_fp_t) (char *, size_t, size_t *, char const *);= +typedef size_t (*execute_fp_t) (char const *, size_t, size_t *, char con= st *); static compile_fp_t compile; static execute_fp_t execute; =20 @@ -696,7 +696,7 @@ skip_easy_bytes (char const *buf) /* Return true if BUF, of size SIZE, has an encoding error. BUF must be followed by at least sizeof (uword) bytes, the first of which may be modified. */ -bool +static bool buf_has_encoding_errors (char *buf, size_t size) { if (! unibyte_mask) diff --git a/src/grep.h b/src/grep.h index b45992f..d10145e 100644 --- a/src/grep.h +++ b/src/grep.h @@ -29,7 +29,6 @@ extern bool match_words; /* -w */ extern bool match_lines; /* -x */ extern char eolbyte; /* -z */ =20 -extern bool buf_has_encoding_errors (char *, size_t); extern char const *pattern_file_name (size_t, size_t *); =20 #endif diff --git a/src/kwsearch.c b/src/kwsearch.c index 29d140c..c3e69b3 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -78,7 +78,7 @@ Fcompile (char const *pattern, size_t size) } =20 size_t -Fexecute (char *buf, size_t size, size_t *match_size, +Fexecute (char const *buf, size_t size, size_t *match_size, char const *start_ptr) { char const *beg, *try, *end, *mb_start; diff --git a/src/pcresearch.c b/src/pcresearch.c index 01616c2..1948acf 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -32,9 +32,6 @@ enum { NSUB =3D 300 }; /* Compiled internal form of a Perl regular expression. */ static pcre *cre; =20 -/* PCRE options used to compile the pattern. */ -static int reflags; - /* Additional information about the pattern. */ static pcre_extra *extra; =20 @@ -107,15 +104,13 @@ Pcompile (char const *pattern, size_t size) int fix_len_max =3D MAX (sizeof wprefix - 1 + sizeof wsuffix - 1, sizeof xprefix - 1 + sizeof xsuffix - 1); char *re =3D xnmalloc (4, size + (fix_len_max + 4 - 1) / 4); - int flags =3D (PCRE_MULTILINE - | (match_icase ? PCRE_CASELESS : 0)); + int flags =3D PCRE_DOLLAR_ENDONLY | (match_icase ? PCRE_CASELESS : 0);= char const *patlim =3D pattern + size; char *n =3D re; char const *p; char const *pnul; - bool multibyte_locale =3D 1 < MB_CUR_MAX; =20 - if (multibyte_locale) + if (1 < MB_CUR_MAX) { if (! localeinfo.using_utf8) die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 loca= les")); @@ -126,32 +121,6 @@ Pcompile (char const *pattern, size_t size) if (memchr (pattern, '\n', size)) die (EXIT_TROUBLE, 0, _("the -P option only supports a single patter= n")); =20 - if (! eolbyte) - { - bool line_at_a_time =3D match_lines; - if (! line_at_a_time) - { - bool escaped =3D false; - bool after_unescaped_left_bracket =3D false; - for (p =3D pattern; *p; p++) - if (escaped) - escaped =3D after_unescaped_left_bracket =3D false; - else - { - if (*p =3D=3D '$' || (*p =3D=3D '^' && !after_unescaped_= left_bracket) - || (*p =3D=3D '(' && (p[1] =3D=3D '?' || p[1] =3D=3D= '*'))) - { - line_at_a_time =3D true; - break; - } - escaped =3D *p =3D=3D '\\'; - after_unescaped_left_bracket =3D *p =3D=3D '['; - } - } - if (line_at_a_time) - flags =3D (flags & ~ PCRE_MULTILINE) | PCRE_DOLLAR_ENDONLY; - } - *n =3D '\0'; if (match_words) strcpy (n, wprefix); @@ -182,7 +151,6 @@ Pcompile (char const *pattern, size_t size) if (match_lines) strcpy (n, xsuffix); =20 - reflags =3D flags; cre =3D pcre_compile (re, flags, &ep, &e, pcre_maketables ()); if (!cre) die (EXIT_TROUBLE, 0, "%s", ep); @@ -210,7 +178,7 @@ Pcompile (char const *pattern, size_t size) } =20 size_t -Pexecute (char *buf, size_t size, size_t *match_size, +Pexecute (char const *buf, size_t size, size_t *match_size, char const *start_ptr) { #if !HAVE_LIBPCRE @@ -229,38 +197,14 @@ Pexecute (char *buf, size_t size, size_t *match_siz= e, error. */ char const *subject =3D buf; =20 - /* If the pattern has no problematic operators and the input is - unibyte or is free of encoding errors, a multiline search is - typically more efficient. Otherwise, a single-line search is - either less confusing because the problematic operators are - interpreted more naturally, or it is typically faster because - pcre_exec doesn't waste time validating the entire input - buffer. */ - bool multiline =3D (reflags & PCRE_MULTILINE) !=3D 0; - if (multiline && (reflags & PCRE_UTF8) !=3D 0) - { - multiline =3D ! buf_has_encoding_errors (buf, size - 1); - buf[size - 1] =3D eolbyte; - } - for (; p < buf + size; p =3D line_start =3D line_end + 1) { - bool too_big; - - if (multiline) - { - size_t pcre_size_max =3D MIN (INT_MAX, SIZE_MAX - 1); - size_t scan_size =3D MIN (pcre_size_max + 1, buf + size - p); - line_end =3D memrchr (p, eolbyte, scan_size); - too_big =3D ! line_end; - } - else - { - line_end =3D memchr (p, eolbyte, buf + size - p); - too_big =3D INT_MAX < line_end - p; - } - - if (too_big) + /* Use a single_line search. Although this code formerly used + PCRE_MULTILINE for performance, the performance wasn't always + better and the correctness issues were too puzzling. See + Bug#22655. */ + line_end =3D memchr (p, eolbyte, buf + size - p); + if (INT_MAX < line_end - p) die (EXIT_TROUBLE, 0, _("exceeded PCRE's line length limit")); =20 for (;;) @@ -289,27 +233,11 @@ Pexecute (char *buf, size_t size, size_t *match_siz= e, int options =3D 0; if (!bol) options |=3D PCRE_NOTBOL; - if (multiline) - options |=3D PCRE_NO_UTF8_CHECK; =20 e =3D jit_exec (subject, line_end - subject, search_offset, options, sub); if (e !=3D PCRE_ERROR_BADUTF8) - { - if (0 < e && multiline && sub[1] - sub[0] !=3D 0) - { - char const *nl =3D memchr (subject + sub[0], eolbyte, - sub[1] - sub[0]); - if (nl) - { - /* This match crosses a line boundary; reject it. = */ - p =3D subject + sub[0]; - line_end =3D nl; - continue; - } - } - break; - } + break; int valid_bytes =3D sub[0]; =20 if (search_offset <=3D valid_bytes) @@ -382,15 +310,6 @@ Pexecute (char *buf, size_t size, size_t *match_size= , beg =3D matchbeg; end =3D matchend; } - else if (multiline) - { - char const *prev_nl =3D memrchr (line_start - 1, eolbyte, - matchbeg - (line_start - 1)); - char const *next_nl =3D memchr (matchend, eolbyte, - line_end + 1 - matchend); - beg =3D prev_nl + 1; - end =3D next_nl + 1; - } else { beg =3D line_start; diff --git a/src/search.h b/src/search.h index b6c1945..4957a63 100644 --- a/src/search.h +++ b/src/search.h @@ -54,15 +54,15 @@ extern wint_t mb_next_wc (char const *, char const *)= ; /* dfasearch.c */ extern struct localeinfo localeinfo; extern void GEAcompile (char const *, size_t, reg_syntax_t); -extern size_t EGexecute (char *, size_t, size_t *, char const *); +extern size_t EGexecute (char const *, size_t, size_t *, char const *); =20 /* kwsearch.c */ extern void Fcompile (char const *, size_t); -extern size_t Fexecute (char *, size_t, size_t *, char const *); +extern size_t Fexecute (char const *, size_t, size_t *, char const *); =20 /* pcresearch.c */ extern void Pcompile (char const *, size_t); -extern size_t Pexecute (char *, size_t, size_t *, char const *); +extern size_t Pexecute (char const *, size_t, size_t *, char const *); =20 /* Return the number of bytes in the character at the start of S, which is of size N. N must be positive. MBS is the conversion state. --=20 2.7.4 --------------738720204FBD31C258FC1A24 Content-Type: text/x-diff; name="0002-grep-further-P-performance-fix.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0002-grep-further-P-performance-fix.patch" =46rom 2a8fd4a7db5b7ce3cca6bc644e0c3643cbc96009 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Sat, 19 Nov 2016 22:48:37 -0800 Subject: [PATCH 2/2] grep: further -P performance fix Problem reported by Stephane Chazelas in: http://bugs.gnu.org/22655#103 * src/pcresearch.c (Pexecute): Set the subject to the start of each line as it is found. --- src/pcresearch.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/src/pcresearch.c b/src/pcresearch.c index 1948acf..108baff 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -194,12 +194,12 @@ Pexecute (char const *buf, size_t size, size_t *mat= ch_size, =20 /* The search address to pass to pcre_exec. This is the start of the buffer, or just past the most-recently discovered encoding - error. */ + error or line end. */ char const *subject =3D buf; =20 - for (; p < buf + size; p =3D line_start =3D line_end + 1) + do { - /* Use a single_line search. Although this code formerly used + /* Search line by line. Although this code formerly used PCRE_MULTILINE for performance, the performance wasn't always better and the correctness issues were too puzzling. See Bug#22655. */ @@ -269,7 +269,9 @@ Pexecute (char const *buf, size_t size, size_t *match= _size, if (e !=3D PCRE_ERROR_NOMATCH) break; bol =3D true; + p =3D subject =3D line_start =3D line_end + 1; } + while (p < buf + size); =20 if (e <=3D 0) { --=20 2.7.4 --------------738720204FBD31C258FC1A24-- From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 20 03:03:35 2016 Received: (at 22655-done) by debbugs.gnu.org; 20 Nov 2016 08:03:35 +0000 Received: from localhost ([127.0.0.1]:35567 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8N67-0007D1-4C for submit@debbugs.gnu.org; Sun, 20 Nov 2016 03:03:35 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:48010) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8N65-0007Cn-6u for 22655-done@debbugs.gnu.org; Sun, 20 Nov 2016 03:03:33 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 92ED7160050; Sun, 20 Nov 2016 00:03:27 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id AdOYmWcCFWZH; Sun, 20 Nov 2016 00:03:27 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id EC6D3160059; Sun, 20 Nov 2016 00:03:26 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id AcPLd-3unifv; Sun, 20 Nov 2016 00:03:26 -0800 (PST) Received: from [192.168.1.9] (unknown [47.153.178.162]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D0A16160050; Sun, 20 Nov 2016 00:03:26 -0800 (PST) Subject: Re: bug#22655: grep -Pz '^' now fails! To: Stephane Chazelas References: <20160213232052.7a5d14a2@sf> <20161118125229.GA10084@chaz.gmail.com> <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <20161119161428.GB4980@chaz.gmail.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sun, 20 Nov 2016 00:03:26 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.9 (--) X-Debbugs-Envelope-To: 22655-done Cc: 22655-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) As all the bugs in this bug report appear to be fixed, I'm closing it now. From debbugs-submit-bounces@debbugs.gnu.org Sun Nov 20 04:16:19 2016 Received: (at 22655-done) by debbugs.gnu.org; 20 Nov 2016 09:16:19 +0000 Received: from localhost ([127.0.0.1]:35602 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8OEV-0000e5-5i for submit@debbugs.gnu.org; Sun, 20 Nov 2016 04:16:19 -0500 Received: from mail-wj0-f196.google.com ([209.85.210.196]:36539) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8OET-0000dr-9F for 22655-done@debbugs.gnu.org; Sun, 20 Nov 2016 04:16:17 -0500 Received: by mail-wj0-f196.google.com with SMTP id jb2so1684701wjb.3 for <22655-done@debbugs.gnu.org>; Sun, 20 Nov 2016 01:16:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=P7gC1CDWz8ig69PORjp4DsaIfqQgHo7JLoLveNRJuOM=; b=DG3XHa24BwFp5cFxGtAfcDYQ/L3A9nEE266ysVwmtEp+vVpyGQbzx24I1pobGXB0fE hoo3oOJ2z+/cwURJAOzd5syfhmkvFM/8LsjzqkZVvuzE+RoUCSeH8eBq1VYJdCeUoGoQ TnMR1+oY8ILTIRscInLAklTKQpCh6q0h9K6hZDgho0/qJx/PYq75RNmUG28haye27Db7 veC/zrE43IFSQ3fvMlAkDeoOoC5qFws7UvijWlyeqKLrhIxG8b61tIgCIY31oKK2w7bW e/1oGUqEq+hF1himaf5w303a8Hxnsh5DHcfymuBPSrJaqiiitI/8vRI7nqt/uAcPY3xJ VdJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=P7gC1CDWz8ig69PORjp4DsaIfqQgHo7JLoLveNRJuOM=; b=GsHyiR7xUTzNekF8jEbLjhwqf141FvYgEtgiWMZ3oAysBsNqZ7fPBLN/wJGEetPVyX NoD8aPSrM2eRCboH+VpzFLPpxqKg1VtUHJcFJb1bhQXkOncJx3u0+NePifnzzvhWjcWU TBdEsxzZlI/ofNf16cdvzLvGK2Q0mJ72thUZCaTBfqBYK3DZgjLI+vPwJAs8G78R0QbE ME5XuxZnONraCTNCG/Q9h2TeKuZZtNene/dCDgFbRDY8YBh/iAUXmmpKi1rjffKSLgx+ 6ZtOevMIpM2um0mwkeuRQ8AmIPWZYwj/cEp9HzeJJrEbPUfvtF3zR03SBxPTHx0bfyYX 4v1A== X-Gm-Message-State: AKaTC03s8wAEHOYfp/60zF1v+7hiQll3AOvCw513U9PbdO5bA/q5a4ztXCERmEEj0vkM5g== X-Received: by 10.194.0.229 with SMTP id 5mr6178279wjh.55.1479633371607; Sun, 20 Nov 2016 01:16:11 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id l67sm14101179wmf.0.2016.11.20.01.16.10 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sun, 20 Nov 2016 01:16:10 -0800 (PST) Date: Sun, 20 Nov 2016 09:16:09 +0000 From: Stephane Chazelas To: Paul Eggert Subject: Re: bug#22655: grep -Pz '^' now fails! Message-ID: <20161120091609.GB4814@chaz.gmail.com> References: <25664206-c26f-1204-c63d-21c52122daef@cs.ucla.edu> <20161118162402.GD5144@chaz.gmail.com> <5dd1091b-5948-46a9-03c0-ec3ee7e93fcb@cs.ucla.edu> <20161118170636.GE5144@chaz.gmail.com> <20161119104127.GA4980@chaz.gmail.com> <9a1bb4e4-2938-701d-4edd-cc5d046aaf7d@cs.ucla.edu> <20161119161428.GB4980@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 22655-done Cc: 22655-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) 2016-11-20 00:03:26 -0800, Paul Eggert: > As all the bugs in this bug report appear to be fixed, I'm closing it now. Thanks. BTW, I was wrong about static char const xprefix[] = "^(?:"; static char const xsuffix[] = ")$"; (as opposed to \A, \z) causing a problem with grep -Pxz '(?m)...' as the pattern becomes ^(?:(?m)...)$, so the m flag is only applied within the (?:...). I can see pcre_compile() supports a PCRE_ANCHORED flag, but it looks like it only makes ^ implicit (not $). -- Stephane From unknown Wed Jun 25 00:24:01 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sun, 18 Dec 2016 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator