From unknown Sun Jun 22 07:57:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#72246: Possible PCRE bug in grep 3.11 Resent-From: Glenn Golden Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 22 Jul 2024 18:26:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 72246 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 72246@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Reply-To: gdg@zplane.com Received: via spool by submit@debbugs.gnu.org id=B.17216727538102 (code B ref -1); Mon, 22 Jul 2024 18:26:01 +0000 Received: (at submit) by debbugs.gnu.org; 22 Jul 2024 18:25:53 +0000 Received: from localhost ([127.0.0.1]:58566 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sVxjN-00026c-Bz for submit@debbugs.gnu.org; Mon, 22 Jul 2024 14:25:53 -0400 Received: from lists.gnu.org ([209.51.188.17]:42662) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sVxjL-00026U-LJ for submit@debbugs.gnu.org; Mon, 22 Jul 2024 14:25:52 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sVxjH-0003pj-KZ for bug-grep@gnu.org; Mon, 22 Jul 2024 14:25:47 -0400 Received: from fhigh1-smtp.messagingengine.com ([103.168.172.152]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sVxjE-0000YB-9i for bug-grep@gnu.org; Mon, 22 Jul 2024 14:25:47 -0400 Received: from compute3.internal (compute3.nyi.internal [10.202.2.43]) by mailfhigh.nyi.internal (Postfix) with ESMTP id 8C5021140191 for ; Mon, 22 Jul 2024 14:25:39 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute3.internal (MEProxy); Mon, 22 Jul 2024 14:25:39 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zplane.com; h=cc :content-type:content-type:date:date:from:from:in-reply-to :message-id:mime-version:reply-to:reply-to:subject:subject:to :to; s=fm2; t=1721672739; x=1721759139; bh=W9hhURj+85wp7YbN09hxZ BvIy2pMhHBBGjMW8DhseFs=; b=fWM5aQ3ivLOazNK0qf2IQ7P7pXOqFZel9BEV7 U1bRg1AFARuz695u8m5P1FqJ9jXoZKDu/5ck6iraFxXzgynrehYPYCF9N53xEaXX IVjlD6oPz7TT6NbW4tY9ulqHbAgOz3HR0YRn22L+3e5jUH3ep5DBqZTekRqm2a7S cmjiWiNeR39FGvb/hFBRJlnCRDHSwBI9LgA3xJ360dEswgVvsWI3adCQZFkZ93Qc v1Bi07X1opDqX3Gc8NhiZGBgF7uAgyjTGyH6UH7HFJ9p/PGrLvidj4PIfPrqjtAS SkZPozSfa11c3fTrN8lU0huNlj0EaRVuehXFC4ZrSFnbIsPLg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:message-id :mime-version:reply-to:reply-to:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1721672739; x=1721759139; bh=W9hhURj+85wp7YbN09hxZBvIy2pMhHBBGjM W8DhseFs=; b=evJ8NOh4hYshaanIqnrCLHZuUFg390CJqDGP31IOCA1syx5AtBF y/5D1Tc0C1kLn7wRiXhabgt6qInaRa6+nOFHyc/MBPoOSCvbyDyKmTCcv6Mjp086 mC5v9OLKR55DPwDpkKhZ0/7eXDw4tV0hmKorA8qm4MSg/fXyRPEC4RAaVI1UKxXf s13MGLCMqnehgZFxesRPZHCR0QK4ovanvuo17ci5xsmY9KlA5C3tE9TGm2EB43vI LNv1xiEYaXl2pWLQhQqsA/XJHnywNRjZ1DG2vV3vau4Im42S2+6Khviwk3HXP49f 0J0jkeLAH9gD+kpLKoifGmryVyzrc2wt0iA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddrheejgdduvdehucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucenucfjughrpeffhffvuffkrhggtggusehttdertd dttddvnecuhfhrohhmpefilhgvnhhnucfiohhluggvnhcuoehgughgseiiphhlrghnvgdr tghomheqnecuggftrfgrthhtvghrnhepffelgedtheelffeivdfhjefffeeigfdufffhle fggfehjeffvefgveetkedvtdeknecuffhomhgrihhnpehgnhhurdhorhhgnecuvehluhhs thgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepghgughesiihplhgrnh gvrdgtohhm X-ME-Proxy: Feedback-ID: i002c41f6:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA for ; Mon, 22 Jul 2024 14:25:39 -0400 (EDT) Received: by gc.zplane.com (Postfix, from userid 501) id BF7874012A; Mon, 22 Jul 2024 12:25:36 -0600 (MDT) Date: Mon, 22 Jul 2024 12:25:36 -0600 From: Glenn Golden Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Received-SPF: pass client-ip=103.168.172.152; envelope-from=gdg@zplane.com; helo=fhigh1-smtp.messagingengine.com X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.6 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) Grep 3.11 doesn't seem to behave as expected with some range-based PCREs. See attached minimal example, comparing grep 3.11 to pcregrep 8.45. (The latter behaves as I had thought 'grep -P' ought to, but maybe I'm wrong on that.) This may be related to https://lists.gnu.org/archive/html/grep-devel/2023-03/msg00017.html which references a regression in 3.10. Figured it was worthwhile to report even it may be a duplicate. Version info: Arch64 linux, kernel 6.1.68, commodity x86-64 laptop. - Glenn Golden ========================== BEGIN INLINE ATTACHMENT ========================= #!/usr/bin/bash # # String containing 3 octets >= 0x80: # str=$(printf "begin\xe2\x80\x99end") # # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them, # and exits with 1, indicating no match. # printf "Using grep 3.11:\n" printf "${str}\n" | grep --color=auto -P -e '[\x80-\xFF]' printf "exit value = $?\n"; printf "\n" # # pcregrep 8.45 behaves as I thought 'grep -P' ought to: # printf "Using pcregrep 8.45:\n" printf "${str}\n" | pcregrep --color=auto -e '[\x80-\xFF]' printf "exit value = $?\n"; ========================== END INLINE ATTACHMENT ========================= From unknown Sun Jun 22 07:57:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#72246: Possible PCRE bug in grep 3.11 Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 22 Jul 2024 19:01:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 72246 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: gdg@zplane.com Cc: 72246@debbugs.gnu.org Received: via spool by 72246-submit@debbugs.gnu.org id=B72246.172167483711626 (code B ref 72246); Mon, 22 Jul 2024 19:01:02 +0000 Received: (at 72246) by debbugs.gnu.org; 22 Jul 2024 19:00:37 +0000 Received: from localhost ([127.0.0.1]:58604 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sVyGz-00031R-Gs for submit@debbugs.gnu.org; Mon, 22 Jul 2024 15:00:37 -0400 Received: from mail.cs.ucla.edu ([131.179.128.66]:43646) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sVyGu-00031A-0L for 72246@debbugs.gnu.org; Mon, 22 Jul 2024 15:00:36 -0400 Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 894603C00E408; Mon, 22 Jul 2024 12:00:21 -0700 (PDT) Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavis, port 10032) with ESMTP id Y0GOn-4KR8ut; Mon, 22 Jul 2024 12:00:21 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.cs.ucla.edu (Postfix) with ESMTP id 4643F3C00E410; Mon, 22 Jul 2024 12:00:21 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 4643F3C00E410 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1721674821; bh=A4c9y/HqeVoh3tFiUBFgyjCTj0kWNrYGoug19uOPvX4=; h=Message-ID:Date:MIME-Version:To:From; b=Dka5PxXnP/k9FFVcN8BDqzKEK2AWR4nvidAXQOWLQBstNw1ghwVBk3TjAuc29/eA5 jTplOfc8qiF/mLOeVpy4kQPVo4owq2zNxJg80Q7Pf7qB4lk24/paxUKuOVhY7fuBqy NWn30shKfT155hV/maQbaMwgGBeUAmXwjkgOEpOQWJA4tdlkizuNoqEQN72S03p7L6 1GIw2KUK/1+ChrDXgBSkeN8En+w9Hv+C4JtXW/TZvbelkD/uyF+CplMRI7Mpb728fm As/GPobEHjHoXslVKMSLYDgO5BAQ2bk3AoyornTdOhpjXIq/e31q5U9uL2WH02KXia xs3ZT8EqLar7w== X-Virus-Scanned: amavis at mail.cs.ucla.edu Received: from mail.cs.ucla.edu ([127.0.0.1]) by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavis, port 10026) with ESMTP id KInSik-Y-4_I; Mon, 22 Jul 2024 12:00:21 -0700 (PDT) Received: from [192.168.254.12] (unknown [47.154.17.165]) by mail.cs.ucla.edu (Postfix) with ESMTPSA id 28F803C00E408; Mon, 22 Jul 2024 12:00:21 -0700 (PDT) Message-ID: <56533831-8ab6-49b5-aa77-cca71a949203@cs.ucla.edu> Date: Mon, 22 Jul 2024 12:00:21 -0700 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird References: Content-Language: en-US From: Paul Eggert Organization: UCLA Computer Science Department In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On 2024-07-22 11:25, Glenn Golden wrote: > str=3D$(printf "begin\xe2\x80\x99end") >=20 > # > # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them, > # and exits with 1, indicating no match. > # > printf"Using grep 3.11:\n" > printf "${str}\n" | grep --color=3Dauto -P -e '[\x80-\xFF]' This asks 'grep' to output all lines containing characters in the range=20 \x80 through \xFF. In a single-byte locale this matches any line=20 containing a byte in that range (i.e., any byte with the top bit set),=20 and 'grep' will output the line and exit with status zero. However, in a UTF-8 locale this will match any line containing the=20 characters U+0080 (a nameless control character) through U+00FF (LATIN=20 SMALL LETTER Y WITH DIAERESIS, or "=C3=BF"). Because the bytes E2, 80, 99= in=20 'str' represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so=20 grep doesn't output anything and exits with status 1. In short, to get the behavior your want, put LC_ALL=3D"C" in the locale. If pcregrep finds a match in a UTF-8 locale then that would appear to be=20 a bug in pcregrep; you might report it to the pcregrep maintainer. From unknown Sun Jun 22 07:57:37 2025 X-Loop: help-debbugs@gnu.org Subject: bug#72246: Possible PCRE bug in grep 3.11 Resent-From: Glenn Golden Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 22 Jul 2024 19:25:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 72246 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 72246@debbugs.gnu.org Reply-To: gdg@zplane.com Received: via spool by 72246-submit@debbugs.gnu.org id=B72246.172167628114027 (code B ref 72246); Mon, 22 Jul 2024 19:25:02 +0000 Received: (at 72246) by debbugs.gnu.org; 22 Jul 2024 19:24:41 +0000 Received: from localhost ([127.0.0.1]:58618 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sVyeG-0003eA-LA for submit@debbugs.gnu.org; Mon, 22 Jul 2024 15:24:40 -0400 Received: from fhigh1-smtp.messagingengine.com ([103.168.172.152]:56741) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1sVyeD-0003dt-0S for 72246@debbugs.gnu.org; Mon, 22 Jul 2024 15:24:39 -0400 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailfhigh.nyi.internal (Postfix) with ESMTP id 18704114027E; Mon, 22 Jul 2024 15:24:28 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute1.internal (MEProxy); Mon, 22 Jul 2024 15:24:28 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=zplane.com; h=cc :cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:reply-to:subject:subject:to:to; s=fm2; t= 1721676268; x=1721762668; bh=TWvJicMr30mSfYM6bmnDRc1G+5C+ywC8U46 qPEi8a+U=; b=ixHoYStBJKzer6+C0FUUE9MmcrFYQi+Ey5Kr+pdNG/o954FPZk0 xREdwiRFhIKjZuaFitX8312Zc9M5v5Y8LGDW1xok6wEXdu6WFhpXLU+lPibrE1zV l6+7c2gng1PImEJV3oKcW3MheAXsQ27zjFOpauv/UMzt3e0XTg6ylBqcrF0mx/fC BeCKtXzmelkodFZihzSqjV3M2PMCyq47CARHQpk5pXLuCFpJvjHDJfS/pVrfRLl4 +NM9ageIaBujHetclmTwuW2Isc/YEayfRNoR0ouHy7EaQsgUVOP1VBVgYl7/4bgU 5sDI0aTlCsq3fr73fzglFtbhyPVSUEzaxUA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:reply-to:subject:subject:to:to:x-me-proxy :x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1721676268; x=1721762668; bh=TWvJicMr30mSfYM6bmnDRc1G+5C+ywC8U46 qPEi8a+U=; b=WlrwO0Pe3NXda8QXwOEgsQ3II3hlyuoQtsU4uXCH4dCa0K7cwii h2mqqkHgn3PVJQzELxJv+v7w6Kwypb60ZGavqSx+kGrhuMEGhz9COi6Pm9NK6jKw sFXT5PU0vdoccEweqqPgLPWN72vgdxjRDWS9nKPinDyd2fks3ngqrR6PhTNm1dHe 0iQcH2fko8M/PBWwPuQaPinfyWXsrzbCOVxxt9eAwMVFc6fD020TZ4081seSBlO1 Xbra68kJ+RKZgGOa6fyhTbmQcM9SeQnF0DU7txkbfMg71gNE3SbYHIUWQge0xB5h 3m/WVSgIKsoNKvlTqStvxyJB+mWs/UAPlFQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeftddrheejgddufeelucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvfevuffkrhhfgggtugfgjgesthhqredttddtudenucfhrhhomhepifhl vghnnhcuifholhguvghnuceoghgughesiihplhgrnhgvrdgtohhmqeenucggtffrrghtth gvrhhnpeevueeuveeviefgleelffdvtddvheelgfdtfeetjeffgeeuudevuedvffdtteeg leenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpehgug hgseiiphhlrghnvgdrtghomh X-ME-Proxy: Feedback-ID: i002c41f6:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 22 Jul 2024 15:24:27 -0400 (EDT) Received: by gc.zplane.com (Postfix, from userid 501) id 0BA364012A; Mon, 22 Jul 2024 13:24:26 -0600 (MDT) Date: Mon, 22 Jul 2024 13:24:26 -0600 From: Glenn Golden Message-ID: References: <56533831-8ab6-49b5-aa77-cca71a949203@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <56533831-8ab6-49b5-aa77-cca71a949203@cs.ucla.edu> X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) Paul Eggert [2024-07-22 12:00:21 -0700]: > On 2024-07-22 11:25, Glenn Golden wrote: > > str=3D$(printf "begin\xe2\x80\x99end") > >=20 > > # > > # grep 3.11 using PCRE '[\x80-\xFF]' doesn't find any of them, > > # and exits with 1, indicating no match. > > # > > printf"Using grep 3.11:\n" > > printf "${str}\n" | grep --color=3Dauto -P -e '[\x80-\xFF]' >=20 > This asks 'grep' to output all lines containing characters in the range \= x80 > through \xFF. In a single-byte locale this matches any line containing a > byte in that range (i.e., any byte with the top bit set), and 'grep' will > output the line and exit with status zero. >=20 > However, in a UTF-8 locale this will match any line containing the > characters U+0080 (a nameless control character) through U+00FF (LATIN SM= ALL > LETTER Y WITH DIAERESIS, or "=FF"). Because the bytes E2, 80, 99 in 'str' > represent U+2019 RIGHT SINGLE QUOTATION MARK, there is no match so grep > doesn't output anything and exits with status 1. >=20 Ahhhhhhhhhhh... ok, got it, thanks for the explanation. I had not realized that even literal octet-like specifications (e.g. \xNN) get 'promoted' (so to speak) to the underlying code points when interpreted in UTF-8 locales. >=20 > If pcregrep finds a match in a UTF-8 locale then that would appear to be a > bug in pcregrep; you might report it to the pcregrep maintainer. > In looking just now at the 'pcre' package (which contains pcregrep) it seems that it is now listed as 'deprecated' in the Arch package list, so probably not worth reporting. In any case, thanks for the explanation, and sorry for the noise. - Glenn