From unknown Sun Jun 22 04:05:56 2025 X-Loop: help-debbugs@gnu.org Subject: bug#60618: unicode characters are not identified as such for \w and \b with -P Resent-From: Carlo Arenas Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 07 Jan 2023 03:49:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 60618 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 60618@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.167306329424127 (code B ref -1); Sat, 07 Jan 2023 03:49:01 +0000 Received: (at submit) by debbugs.gnu.org; 7 Jan 2023 03:48:14 +0000 Received: from localhost ([127.0.0.1]:56229 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE0Bp-0006H3-QW for submit@debbugs.gnu.org; Fri, 06 Jan 2023 22:48:14 -0500 Received: from lists.gnu.org ([209.51.188.17]:49012) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE0Bo-0006Gq-Ap for submit@debbugs.gnu.org; Fri, 06 Jan 2023 22:48:13 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pE0Bo-0000Sj-4b for bug-grep@gnu.org; Fri, 06 Jan 2023 22:48:12 -0500 Received: from mail-oi1-x22e.google.com ([2607:f8b0:4864:20::22e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pE0Bm-0003jP-9q for bug-grep@gnu.org; Fri, 06 Jan 2023 22:48:11 -0500 Received: by mail-oi1-x22e.google.com with SMTP id n8so2687229oih.0 for ; Fri, 06 Jan 2023 19:48:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=I7vB+rWDcLUILLx4qWal32406Judsg15p93XJvAhsFA=; b=kaOvI5ugZqhOg6gQLjiTHvJqAQVX9cBE9FEU+13oBev5hIbNTfmjp2JQEr6V/FsT1Z KS16OYffk8Iuc38s0zKnmdxc68912d66k/vTidyZF86qkgfEiraGwwb5TfDT2UwdJT8p +3sa5dl1V67za4y9ixrWygVc3TNddYy26rcHuYYMpxgcPDTTxiXIip3GK6tY4LZGkKi5 cntGO4itqAcJo/ABEI+iXg01UkbrKbVfY7Ro7ihT8EJKydVjkOlSw11FUvvk00g0V2q3 DU9Jxh7vtYXWdryIkklE3MH3u53usmaAUgaga3fdLwDAlklc4JW/8KyatPnSUtJRehop 7cNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=I7vB+rWDcLUILLx4qWal32406Judsg15p93XJvAhsFA=; b=1gXqyoZ8knb5HOsgUN9SbAPHMkU5cylhW6jRFG728cwRvgXjxk3Kwcm6K/C1c6ZqeL sIhOqy6FcYGTsdKM6TTnY3aeJl3dmfATeeY8adbVXr7+nMF2djOFDd4Gi9FT88VBt3kK LGEoXD0ll9gQoRO5YpILBM+Ekyk4gADLFXZGwKzY9xyOV0TTdZRlcQJ3jmCZ1JrOaV24 Rzk4l+qe3TLBkGjyD69825tLmamRSinHmZn2kpxZxS605DIKGVr6YRjhIyP12HtOIOhP 4NxSFAAu+0gVrvAnCQBTLUoKUr2LwoKalEiQKEo41CFBfwCcqzX1ksNbOIlwxhtxMTWR 56ZQ== X-Gm-Message-State: AFqh2kpPASkZy9p9Y9UF4qrx2PpLR0L2FBwQ0Q7FtK/U3T0IU64W9fZQ SCCyMgjIc0R6oIZfSxwqY4xOokA8FKCCWoWq0fS1cTB1Zqc= X-Google-Smtp-Source: AMrXdXvNmndCRE2q1DpV4R2Ly17ONHHef6Z81IWjWhwUTgSzWcelILzZNpM0o1e+Coqtf8yFLjongwf4Mu01llcOFMU= X-Received: by 2002:a05:6808:9a8:b0:360:d0f8:2ecd with SMTP id e8-20020a05680809a800b00360d0f82ecdmr2764763oig.59.1673063287094; Fri, 06 Jan 2023 19:48:07 -0800 (PST) MIME-Version: 1.0 From: Carlo Arenas Date: Fri, 6 Jan 2023 19:48:01 -0800 Message-ID: Content-Type: multipart/mixed; boundary="000000000000ea5c7805f1a4662f" Received-SPF: pass client-ip=2607:f8b0:4864:20::22e; envelope-from=carenas@gmail.com; helo=mail-oi1-x22e.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --000000000000ea5c7805f1a4662f Content-Type: text/plain; charset="UTF-8" Reported to PCRE[1] with mention of GNU grep being also affected. [1] https://github.com/PCRE2Project/pcre2/issues/185 --000000000000ea5c7805f1a4662f Content-Type: text/x-patch; charset="UTF-8"; name="0001-pcre-use-UCP-in-UTF-mode.patch" Content-Disposition: attachment; filename="0001-pcre-use-UCP-in-UTF-mode.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_lclemlgk0 RnJvbSBjMmQ0YTQzYjViMTVkZjdjODg1M2Q1OTFiZjZhZTg3MmM2MDJlZDE0IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiA9P1VURi04P3E/Q2FybG89MjBNYXJjZWxvPTIwQXJlbmFzPTIw QmVsPUMzPUIzbj89IDxjYXJlbmFzQGdtYWlsLmNvbT4KRGF0ZTogRnJpLCA2IEphbiAyMDIzIDE5 OjM0OjU2IC0wODAwClN1YmplY3Q6IFtQQVRDSF0gcGNyZTogdXNlIFVDUCBpbiBVVEYgbW9kZQoK KiBzcmMvcGNyZXNlYXJjaC5jOiBzZXQgUENSRTJfVUNQIHRvZ2V0aGVyIHdpdGggUENSRTJfVVRG CiogdGVzdHMvcGNyZS11dGY4LXc6IGFkZCB0ZXN0Ci0tLQogc3JjL3BjcmVzZWFyY2guYyAgfCAg MiArLQogdGVzdHMvTWFrZWZpbGUuYW0gfCAgMSArCiB0ZXN0cy9wY3JlLXV0ZjgtdyB8IDI4ICsr KysrKysrKysrKysrKysrKysrKysrKysrKysKIDMgZmlsZXMgY2hhbmdlZCwgMzAgaW5zZXJ0aW9u cygrKSwgMSBkZWxldGlvbigtKQogY3JlYXRlIG1vZGUgMTAwNzU1IHRlc3RzL3BjcmUtdXRmOC13 CgpkaWZmIC0tZ2l0IGEvc3JjL3BjcmVzZWFyY2guYyBiL3NyYy9wY3Jlc2VhcmNoLmMKaW5kZXgg YTEwN2Y0ZC4uNDViNjdlZSAxMDA2NDQKLS0tIGEvc3JjL3BjcmVzZWFyY2guYworKysgYi9zcmMv cGNyZXNlYXJjaC5jCkBAIC0xNDksNyArMTQ5LDcgQEAgUGNvbXBpbGUgKGNoYXIgKnBhdHRlcm4s IGlkeF90IHNpemUsIHJlZ19zeW50YXhfdCBpZ25vcmVkLCBib29sIGV4YWN0KQogICAgIHsKICAg ICAgIGlmICghIGxvY2FsZWluZm8udXNpbmdfdXRmOCkKICAgICAgICAgZGllIChFWElUX1RST1VC TEUsIDAsIF8oIi1QIHN1cHBvcnRzIG9ubHkgdW5pYnl0ZSBhbmQgVVRGLTggbG9jYWxlcyIpKTsK LSAgICAgIGZsYWdzIHw9IFBDUkUyX1VURjsKKyAgICAgIGZsYWdzIHw9IChQQ1JFMl9VVEYgfCBQ Q1JFMl9VQ1ApOwogI2lmIDAKICAgICAgIC8qIERvIG5vdCBtYXRjaCBpbmRpdmlkdWFsIGNvZGUg dW5pdHMgYnV0IG9ubHkgVVRGLTguICAqLwogICAgICAgZmxhZ3MgfD0gUENSRTJfTkVWRVJfQkFD S1NMQVNIX0M7CmRpZmYgLS1naXQgYS90ZXN0cy9NYWtlZmlsZS5hbSBiL3Rlc3RzL01ha2VmaWxl LmFtCmluZGV4IGUwYjA1MDMuLmE0N2NmNWMgMTAwNjQ0Ci0tLSBhL3Rlc3RzL01ha2VmaWxlLmFt CisrKyBiL3Rlc3RzL01ha2VmaWxlLmFtCkBAIC0xNDcsNiArMTQ3LDcgQEAgVEVTVFMgPQkJCQkJ CVwKICAgcGNyZS1qaXRzdGFjawkJCQkJXAogICBwY3JlLW8JCQkJCVwKICAgcGNyZS11dGY4CQkJ CQlcCisgIHBjcmUtdXRmOC13CQkJCQlcCiAgIHBjcmUtdwkJCQkJXAogICBwY3JlLXd4LWJhY2ty ZWYJCQkJXAogICBwY3JlLXoJCQkJCVwKZGlmZiAtLWdpdCBhL3Rlc3RzL3BjcmUtdXRmOC13IGIv dGVzdHMvcGNyZS11dGY4LXcKbmV3IGZpbGUgbW9kZSAxMDA3NTUKaW5kZXggMDAwMDAwMC4uNDMx Njg1YwotLS0gL2Rldi9udWxsCisrKyBiL3Rlc3RzL3BjcmUtdXRmOC13CkBAIC0wLDAgKzEsMjgg QEAKKyMhL2Jpbi9zaAorIyBVVEYtOCBjaGFyYWN0ZXJzIGFyZSBjb3JyZWN0bHkgaWRlbnRpZmll ZCBhcyBwYXJ0IG9mIGEgd29yZAorIworIyBDb3B5cmlnaHQgKEMpIDIwMjMtMjAyMyBGcmVlIFNv ZnR3YXJlIEZvdW5kYXRpb24sIEluYy4KKyMKKyMgQ29weWluZyBhbmQgZGlzdHJpYnV0aW9uIG9m IHRoaXMgZmlsZSwgd2l0aCBvciB3aXRob3V0IG1vZGlmaWNhdGlvbiwKKyMgYXJlIHBlcm1pdHRl ZCBpbiBhbnkgbWVkaXVtIHdpdGhvdXQgcm95YWx0eSBwcm92aWRlZCB0aGUgY29weXJpZ2h0Cisj IG5vdGljZSBhbmQgdGhpcyBub3RpY2UgYXJlIHByZXNlcnZlZC4KKworLiAiJHtzcmNkaXI9Ln0v aW5pdC5zaCI7IHBhdGhfcHJlcGVuZF8gLi4vc3JjCityZXF1aXJlX2VuX3V0ZjhfbG9jYWxlXwor TENfQUxMPWVuX1VTLlVURi04CitleHBvcnQgTENfQUxMCityZXF1aXJlX3BjcmVfCisKK2ZhaWw9 MAorCitlY2hvICdQZXLDuic+IGluIHx8IGZyYW1ld29ya19mYWlsdXJlXworCitlY2hvICfDuicg PiBleHAgfHwgZnJhbWV3b3JrX2ZhaWx1cmVfCitncmVwIC1QbyAnLlxiJyBpbiA+IG91dCB8fCBm YWlsPTEKK2NvbXBhcmUgb3V0IGV4cCB8fCBmYWlsPTEKKworZWNobyAncsO6JyA+IGV4cCB8fCBm cmFtZXdvcmtfZmFpbHVyZV8KK2dyZXAgLVBvICdyXHcnIGluID4gb3V0ICYmIGZhaWw9MQorY29t cGFyZSBvdXQgZXhwIHx8IGZhaWw9MQorCitFeGl0ICRmYWlsCi0tIAoyLjMwLjIKCg== --000000000000ea5c7805f1a4662f-- From unknown Sun Jun 22 04:05:56 2025 X-Loop: help-debbugs@gnu.org Subject: bug#60618: unicode characters are not identified as such for \w and \b with -P Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 07 Jan 2023 07:30:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 60618 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Carlo Arenas Cc: 60618@debbugs.gnu.org Received: via spool by 60618-submit@debbugs.gnu.org id=B60618.167307654524281 (code B ref 60618); Sat, 07 Jan 2023 07:30:03 +0000 Received: (at 60618) by debbugs.gnu.org; 7 Jan 2023 07:29:05 +0000 Received: from localhost ([127.0.0.1]:56351 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE3dZ-0006JZ-HU for submit@debbugs.gnu.org; Sat, 07 Jan 2023 02:29:05 -0500 Received: from mail-lj1-f178.google.com ([209.85.208.178]:42995) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE3dX-0006J0-Aq for 60618@debbugs.gnu.org; Sat, 07 Jan 2023 02:29:03 -0500 Received: by mail-lj1-f178.google.com with SMTP id n5so3115351ljc.9 for <60618@debbugs.gnu.org>; Fri, 06 Jan 2023 23:29:03 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=4fbiwdMXkjt6/twKT32Be7xfS1HxaYqnugRCigFvlJY=; b=XV5xrgxpEqU745UlQLn8zAX5SUiFwGR5IpNSs1TtEWrLUi1cRTf4NBbaRyr1rFazbL 6tmDcLNJJYOYGh5EvzZ54P44oJnIdzJTugKA5o/qeC1xN7kyyQQ3rybc3GG+rRHtRd3H euSRqpl3sOJVKcZ5hrNPcdo/Sxf9TECq7d+NIAC3vmuUD0VuTUcI4mB2QeCGaoNA+Jnh 3rSoCcLTu9yGds9ugVsqJUbIhvWuwDTKILOVbBNgYK80M7DsxZl/mfSBPm57E2Tg1JP1 ttQCUtNTqQSseWR3A4EUK5rfiOQsRvTMHDAe3bCihKP1e6o0wSCrWfLg04VQ/FWljtzv m+SQ== X-Gm-Message-State: AFqh2krWPqMSmposp+mhMKyWA14aen7z8ztjOgH5OQEAfZzZ0GLdoI8r 4qCfmw9lyRlSwg9RPXk2jIclatEaexhRoqFKImw= X-Google-Smtp-Source: AMrXdXtLM1WmiU5nO/DQiK+qgoGTvIbAxpMkZRARnDlpzKvjGHNUr3VgbStXFYlzCY/rJZ06M0+N4AjaUytmWe024LU= X-Received: by 2002:a2e:a908:0:b0:27f:af3a:5e5d with SMTP id j8-20020a2ea908000000b0027faf3a5e5dmr4209603ljq.248.1673076537259; Fri, 06 Jan 2023 23:28:57 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jim Meyering Date: Fri, 6 Jan 2023 23:28:44 -0800 Message-ID: Content-Type: multipart/mixed; boundary="000000000000b0274d05f1a77c44" X-Spam-Score: 0.2 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) --000000000000b0274d05f1a77c44 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas wrote: > Reported to PCRE[1] with mention of GNU grep being also affected. > > [1] https://github.com/PCRE2Project/pcre2/issues/185 Yikes. This is a big deal. Thank you for the patch and added test. I made a tiny comment tweak and this test logic change that was required to make the new test pass with the fixed version. -grep -Po 'r\w' in > out && fail=3D1 +grep -Po 'r\w' in > out || fail=3D1 Also, make syntax-check required to change e.g., -compare out exp || fail=3D1 +compare exp out || fail=3D1 Every bug fix needs a NEWS entry, so I added this: With -P, some non-ASCII UTF8 characters were not recognized as word-constituent due to our omission of the PCRE_UCP flag. E.g., given f(){ echo Per=C3=BA|LC_ALL=3Den_US.UTF-8 grep -Po "$1"; } and this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r". After the fix, it prints the correct results: "r=C3=BA:=C3=BA". Finally, I expanded the ChangeLog entry and gave credit where due. I'll push this tomorrow: --000000000000b0274d05f1a77c44 Content-Type: application/octet-stream; name="grep-pcre-fix.diff" Content-Disposition: attachment; filename="grep-pcre-fix.diff" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_lclmgr3b0 RnJvbSA1MmZiNWE2OGE3YmY4MDYzMDM5MTc2MTYwZjQ1NzhmZTYxNjcwZjA5IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiA9P1VURi04P3E/Q2FybG89MjBNYXJjZWxvPTIwQXJlbmFzPTIw QmVsPUMzPUIzbj89IDxjYXJlbmFzQGdtYWlsLmNvbT4KRGF0ZTogRnJpLCA2IEphbiAyMDIzIDE5 OjM0OjU2IC0wODAwClN1YmplY3Q6IFtQQVRDSF0gcGNyZTogdXNlIFVDUCBpbiBVVEYgbW9kZQoK VGhpcyBmaXhlcyBhIHNlcmlvdXMgYnVnIGFmZmVjdGluZyB3b3JkLWJvdW5kYXJ5IGFuZCB3b3Jk LWNvbnN0aXR1ZW50IHJlZ3VsYXIKZXhwcmVzc2lvbnMgd2hlbiB0aGUgZGVzaXJlZCBtYXRjaCBp bnZvbHZlcyBub24tQVNDSUkgVVRGOCBjaGFyYWN0ZXJzLgoqIHNyYy9wY3Jlc2VhcmNoLmM6IFNl dCBQQ1JFMl9VQ1AgdG9nZXRoZXIgd2l0aCBQQ1JFMl9VVEYKKiB0ZXN0cy9wY3JlLXV0Zjgtdzog TmV3IGZpbGUuCiogdGVzdHMvTWFrZWZpbGUuYW0gKFRFU1RTKTogQWRkIGl0LgoqIE5FV1MgKEJ1 ZyBmaXhlcyk6IE1lbnRpb24gdGhpcy4KUmVwb3J0ZWQgYnkgR3JvLVRzZW4gaHR0cHM6Ly90d2l0 dGVyLmNvbS9ncm9fdHNlbi9zdGF0dXMvMTYxMDk3MjM1Njk3Mjg3NTc3NwpUaGlzIGJ1ZyB3YXMg cHJlc2VudCBmcm9tIGdyZXAtMi41LCB3aGVuIC0tcGVybC1yZWdleHAgKC1QKSBzdXBwb3J0IHdh cyBhZGRlZC4KLS0tCiBORVdTICAgICAgICAgICAgICB8ICA2ICsrKysrKwogc3JjL3BjcmVzZWFy Y2guYyAgfCAgMiArLQogdGVzdHMvTWFrZWZpbGUuYW0gfCAgMSArCiB0ZXN0cy9wY3JlLXV0Zjgt dyB8IDI4ICsrKysrKysrKysrKysrKysrKysrKysrKysrKysKIDQgZmlsZXMgY2hhbmdlZCwgMzYg aW5zZXJ0aW9ucygrKSwgMSBkZWxldGlvbigtKQogY3JlYXRlIG1vZGUgMTAwNzU1IHRlc3RzL3Bj cmUtdXRmOC13CgpkaWZmIC0tZ2l0IGEvTkVXUyBiL05FV1MKaW5kZXggYjQwNDcwOC4uYTg2NTk0 MSAxMDA2NDQKLS0tIGEvTkVXUworKysgYi9ORVdTCkBAIC00LDYgKzQsMTIgQEAgR05VIGdyZXAg TkVXUyAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC0qLSBvdXRsaW5lIC0qLQoK ICoqIEJ1ZyBmaXhlcwoKKyAgV2l0aCAtUCwgc29tZSBub24tQVNDSUkgVVRGOCBjaGFyYWN0ZXJz IHdlcmUgbm90IHJlY29nbml6ZWQgYXMKKyAgd29yZC1jb25zaXR1ZW50IGR1ZSB0byBsYWNrIG9m IG91ciB1c2Ugb2YgdGhlIFBDUkVfVUNQIGZsYWcuIEUuZy4sCisgIGdpdmVuIGYoKXsgZWNobyBQ ZXLDunxMQ19BTEw9ZW5fVVMuVVRGLTggZ3JlcCAtUG8gIiQxIjsgfSBhbmQKKyAgdGhpcyBjb21t YW5kLCBlY2hvICQoZiAnclx3Jyk6JChmICcuXGInKSwgYmVmb3JlIGl0IHdvdWxkIHByaW50ICI6 ciIuCisgIEFmdGVyIHRoZSBmaXgsIGl0IHByaW50cyB0aGUgY29ycmVjdCByZXN1bHRzOiAicsO6 OsO6Ii4KKwogICBXaGVuIGdpdmVuIG11bHRpcGxlIHBhdHRlcm5zIHRoZSBsYXN0IG9mIHdoaWNo IGhhcyBhIGJhY2stcmVmZXJlbmNlLAogICBncmVwIG5vIGxvbmdlciBzb21ldGltZXMgbWlzdGFr ZW5seSBtYXRjaGVzIGxpbmVzIGluIHNvbWUgY2FzZXMuCiAgIFtCdWcjMzYxNDgjMTMgaW50cm9k dWNlZCBpbiBncmVwIDMuNF0KZGlmZiAtLWdpdCBhL3NyYy9wY3Jlc2VhcmNoLmMgYi9zcmMvcGNy ZXNlYXJjaC5jCmluZGV4IGExMDdmNGQuLjQ1YjY3ZWUgMTAwNjQ0Ci0tLSBhL3NyYy9wY3Jlc2Vh cmNoLmMKKysrIGIvc3JjL3BjcmVzZWFyY2guYwpAQCAtMTQ5LDcgKzE0OSw3IEBAIFBjb21waWxl IChjaGFyICpwYXR0ZXJuLCBpZHhfdCBzaXplLCByZWdfc3ludGF4X3QgaWdub3JlZCwgYm9vbCBl eGFjdCkKICAgICB7CiAgICAgICBpZiAoISBsb2NhbGVpbmZvLnVzaW5nX3V0ZjgpCiAgICAgICAg IGRpZSAoRVhJVF9UUk9VQkxFLCAwLCBfKCItUCBzdXBwb3J0cyBvbmx5IHVuaWJ5dGUgYW5kIFVU Ri04IGxvY2FsZXMiKSk7Ci0gICAgICBmbGFncyB8PSBQQ1JFMl9VVEY7CisgICAgICBmbGFncyB8 PSAoUENSRTJfVVRGIHwgUENSRTJfVUNQKTsKICNpZiAwCiAgICAgICAvKiBEbyBub3QgbWF0Y2gg aW5kaXZpZHVhbCBjb2RlIHVuaXRzIGJ1dCBvbmx5IFVURi04LiAgKi8KICAgICAgIGZsYWdzIHw9 IFBDUkUyX05FVkVSX0JBQ0tTTEFTSF9DOwpkaWZmIC0tZ2l0IGEvdGVzdHMvTWFrZWZpbGUuYW0g Yi90ZXN0cy9NYWtlZmlsZS5hbQppbmRleCBlMGIwNTAzLi5hNDdjZjVjIDEwMDY0NAotLS0gYS90 ZXN0cy9NYWtlZmlsZS5hbQorKysgYi90ZXN0cy9NYWtlZmlsZS5hbQpAQCAtMTQ3LDYgKzE0Nyw3 IEBAIFRFU1RTID0JCQkJCQlcCiAgIHBjcmUtaml0c3RhY2sJCQkJCVwKICAgcGNyZS1vCQkJCQlc CiAgIHBjcmUtdXRmOAkJCQkJXAorICBwY3JlLXV0ZjgtdwkJCQkJXAogICBwY3JlLXcJCQkJCVwK ICAgcGNyZS13eC1iYWNrcmVmCQkJCVwKICAgcGNyZS16CQkJCQlcCmRpZmYgLS1naXQgYS90ZXN0 cy9wY3JlLXV0ZjgtdyBiL3Rlc3RzL3BjcmUtdXRmOC13Cm5ldyBmaWxlIG1vZGUgMTAwNzU1Cmlu ZGV4IDAwMDAwMDAuLjRjZDdkYjYKLS0tIC9kZXYvbnVsbAorKysgYi90ZXN0cy9wY3JlLXV0Zjgt dwpAQCAtMCwwICsxLDI4IEBACisjIS9iaW4vc2gKKyMgRW5zdXJlIG5vbi1BU0NJSSBVVEYtOCBj aGFyYWN0ZXJzIGFyZSBjb3JyZWN0bHkgaWRlbnRpZmllZCBhcyB3b3JkLWNvbnNpdHVlbnQKKyMK KyMgQ29weXJpZ2h0IChDKSAyMDIzIEZyZWUgU29mdHdhcmUgRm91bmRhdGlvbiwgSW5jLgorIwor IyBDb3B5aW5nIGFuZCBkaXN0cmlidXRpb24gb2YgdGhpcyBmaWxlLCB3aXRoIG9yIHdpdGhvdXQg bW9kaWZpY2F0aW9uLAorIyBhcmUgcGVybWl0dGVkIGluIGFueSBtZWRpdW0gd2l0aG91dCByb3lh bHR5IHByb3ZpZGVkIHRoZSBjb3B5cmlnaHQKKyMgbm90aWNlIGFuZCB0aGlzIG5vdGljZSBhcmUg cHJlc2VydmVkLgorCisuICIke3NyY2Rpcj0ufS9pbml0LnNoIjsgcGF0aF9wcmVwZW5kXyAuLi9z cmMKK3JlcXVpcmVfZW5fdXRmOF9sb2NhbGVfCitMQ19BTEw9ZW5fVVMuVVRGLTgKK2V4cG9ydCBM Q19BTEwKK3JlcXVpcmVfcGNyZV8KKworZmFpbD0wCisKK2VjaG8gJ1BlcsO6Jz4gaW4gfHwgZnJh bWV3b3JrX2ZhaWx1cmVfCisKK2VjaG8gJ8O6JyA+IGV4cCB8fCBmcmFtZXdvcmtfZmFpbHVyZV8K K2dyZXAgLVBvICcuXGInIGluID4gb3V0IHx8IGZhaWw9MQorY29tcGFyZSBleHAgb3V0IHx8IGZh aWw9MQorCitlY2hvICdyw7onID4gZXhwIHx8IGZyYW1ld29ya19mYWlsdXJlXworZ3JlcCAtUG8g J3JcdycgaW4gPiBvdXQgfHwgZmFpbD0xCitjb21wYXJlIGV4cCBvdXQgfHwgZmFpbD0xCisKK0V4 aXQgJGZhaWwKLS0gCjIuMzkuMC4xMzIuZzhhNGU4ZjZhNjcKCg== --000000000000b0274d05f1a77c44-- From unknown Sun Jun 22 04:05:56 2025 X-Loop: help-debbugs@gnu.org Subject: bug#60618: unicode characters are not identified as such for \w and \b with -P Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 07 Jan 2023 07:38:04 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 60618 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Carlo Arenas Cc: 60618@debbugs.gnu.org Received: via spool by 60618-submit@debbugs.gnu.org id=B60618.167307707725310 (code B ref 60618); Sat, 07 Jan 2023 07:38:04 +0000 Received: (at 60618) by debbugs.gnu.org; 7 Jan 2023 07:37:57 +0000 Received: from localhost ([127.0.0.1]:56371 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE3m9-0006aA-Fp for submit@debbugs.gnu.org; Sat, 07 Jan 2023 02:37:57 -0500 Received: from mail-lf1-f46.google.com ([209.85.167.46]:36803) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE3m8-0006Zw-6y for 60618@debbugs.gnu.org; Sat, 07 Jan 2023 02:37:56 -0500 Received: by mail-lf1-f46.google.com with SMTP id j17so5176122lfr.3 for <60618@debbugs.gnu.org>; Fri, 06 Jan 2023 23:37:56 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=z3X7LNouvIZ4Nx/f17IZOaMFuI1rdBCNe3cBZHL7WGI=; b=0ULI+nEXC1m83HpvbtATVfkOw1/8VSlpx0eG5wFu8qkKBeSdRhiBkQWHofDZu4Eavw WQhEDotqUws59G9ldfGbve0aGnHTQ95Fw6wen1LFiEgb8WQmvJGTuy6G/YN4cmWL0InG tS6nUP/MV0HOnmCV+Tb2I32siV/A6QIjhH9kgFfOIw6Y52VW/fG+GCUOnAkQXPxP1L6J MRhgFrJsHscMXur6CwyrFNTkCNSeTXgsVKzNRtEmBXWqZU0DjixdSXi+7MJJgSCEFYQU wPFvcHI1Ca/ecE4NLEI5swE1WhSv0+Rw6fT5RDYcCLUWyu0J5PGEjre1GqwFhng84SkM pD6Q== X-Gm-Message-State: AFqh2kqz77SAMIUqejdUpue67ZhfWmxXoNMJBUpgEHaUhyoHOyxgwf9a iNUFIHaFLjv43pdNuK/ez9hiz1KPhZbTyrmqVAE= X-Google-Smtp-Source: AMrXdXvqUMKC8IOzyg4OwMSPzpEKj3Y5SrwwsWZG8xVjiuS/pXz32FYdEJK3/ygszz3VKVuMweYvBbVszj19uG9gwYI= X-Received: by 2002:ac2:523a:0:b0:4b6:e80b:7e44 with SMTP id i26-20020ac2523a000000b004b6e80b7e44mr2525028lfl.508.1673077070248; Fri, 06 Jan 2023 23:37:50 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jim Meyering Date: Fri, 6 Jan 2023 23:37:37 -0800 Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.2 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering wrote: > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas wrote: > > Reported to PCRE[1] with mention of GNU grep being also affected. > > > > [1] https://github.com/PCRE2Project/pcre2/issues/185 > > Yikes. This is a big deal. > Thank you for the patch and added test. > I made a tiny comment tweak and this test logic change that was > required to make the new test pass with the fixed version. > > -grep -Po 'r\w' in > out && fail=3D1 > +grep -Po 'r\w' in > out || fail=3D1 > > Also, make syntax-check required to change e.g., > > -compare out exp || fail=3D1 > +compare exp out || fail=3D1 > > Every bug fix needs a NEWS entry, so I added this: > > With -P, some non-ASCII UTF8 characters were not recognized as > word-constituent due to our omission of the PCRE_UCP flag. E.g., > given f(){ echo Per=C3=BA|LC_ALL=3Den_US.UTF-8 grep -Po "$1"; } and > this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r". > After the fix, it prints the correct results: "r=C3=BA:=C3=BA". > > Finally, I expanded the ChangeLog entry and gave credit where due. > > I'll push this tomorrow: Must also mention Karl Pettersson in the ChangeLog: pcre: use UCP in UTF mode This fixes a serious bug affecting word-boundary and word-constituent regul= ar expressions when the desired match involves non-ASCII UTF8 characters. * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF * tests/pcre-utf8-w: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Mention this. Reported by Gro-Tsen https://twitter.com/gro_tsen/status/161097235697287577= 7 via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185 This bug was present from grep-2.5, when --perl-regexp (-P) support was add= ed. From debbugs-submit-bounces@debbugs.gnu.org Sat Jan 07 17:55:38 2023 Received: (at control) by debbugs.gnu.org; 7 Jan 2023 22:55:38 +0000 Received: from localhost ([127.0.0.1]:59154 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pEI6E-0004c1-Am for submit@debbugs.gnu.org; Sat, 07 Jan 2023 17:55:38 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44080) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pEI6C-0004bn-Mo for control@debbugs.gnu.org; Sat, 07 Jan 2023 17:55:37 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 05E12160040 for ; Sat, 7 Jan 2023 14:55:29 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id wR1lfVLkjnh7 for ; Sat, 7 Jan 2023 14:55:28 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 49F6D160041 for ; Sat, 7 Jan 2023 14:55:28 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra.cs.ucla.edu 49F6D160041 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=78364E5A-2AF3-11ED-87FA-8298ECA2D365; t=1673132128; bh=Zhu4d6F1ZdxZRdxr2KZ24bQwrtVKtsmZTGwZPeTYXEw=; h=Message-ID:Date:MIME-Version:To:From:Subject:Content-Type: Content-Transfer-Encoding; b=U54IVUwUGBI6s5oCX0gaDzhKdlOXD/ccSo6o0/LjqWKJ3n/BeamTweHuSNkYMGSo8 Aqen3s6OJ8LEgldnbeSdwEyfO8XAd927NYASWh8K7B+ohiPSqRVVcdQNl3qyWqUgcd wyYT0vDPWRLvsmzpd9nR1gtPaYvZvASQTz/x4K3A= X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id VnQGQfbzbbME for ; Sat, 7 Jan 2023 14:55:28 -0800 (PST) Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com [172.91.119.151]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 1C6EC160040 for ; Sat, 7 Jan 2023 14:55:28 -0800 (PST) Message-ID: Date: Sat, 7 Jan 2023 14:54:53 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.4.2 Content-Language: en-US To: control@debbugs.gnu.org From: Paul Eggert Subject: merge 60618 60621 Organization: UCLA Computer Science Department Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) merge 60618 60621 From unknown Sun Jun 22 04:05:56 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Carlo Arenas Subject: bug#60618: closed (Re: bug#60618: unicode characters are not identified as such for \w and \b with -P) Message-ID: References: X-Gnu-PR-Message: they-closed 60618 X-Gnu-PR-Package: grep Reply-To: 60618@debbugs.gnu.org Date: Sun, 08 Jan 2023 02:30:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1673145003-28307-1" This is a multi-part message in MIME format... ------------=_1673145003-28307-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #60618: unicode characters are not identified as such for \w and \b with -P which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 60618@debbugs.gnu.org. --=20 60618: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D60618 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1673145003-28307-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 60618-done) by debbugs.gnu.org; 8 Jan 2023 02:29:09 +0000 Received: from localhost ([127.0.0.1]:59470 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pELQr-0007Kr-2L for submit@debbugs.gnu.org; Sat, 07 Jan 2023 21:29:09 -0500 Received: from mail-lj1-f170.google.com ([209.85.208.170]:34636) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pELQp-0007Ke-L2 for 60618-done@debbugs.gnu.org; Sat, 07 Jan 2023 21:29:08 -0500 Received: by mail-lj1-f170.google.com with SMTP id x37so5416543ljq.1 for <60618-done@debbugs.gnu.org>; Sat, 07 Jan 2023 18:29:07 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vVXGj/rGfPQUabWM0xXkv8hMld8MYi+K/zUpExHdOWo=; b=yXO6uTeGWPU2Bxc9qlqe8VA3W5S2G6d0puzBJLlTzkxaa0SULNkVXdZsA7w/+/CDVE Dd0wGhgVSlf/D5ayLbBSXBaEwypekmsYt6V5K3fgKfI4e0owXpls9J/IWy9y9BwApDnT iTRheP2SAC9CUmjTH3KdGhId4OLS1yVnL+3W84HhlKsoZAfgnxD+AD3eCBsnAqNjzWli m5AW0vhDHTx4F1skQ6mG5qwlZPLFdpitrPtUXiLFqwdXOk2wLt0jiOwGOvn+0zXlSy2G me2qeMcplDNG7zZEsAiWHORdfhHCj32zGDcKxi0rJz4dIKyxo6gkWq8X56/k1FxhT6U3 jdPA== X-Gm-Message-State: AFqh2kosocv+BS4D3s6cm3W0jVFKvZzEt4O05zwnm23MPYbW4XiQkBpT 9OtsgcmpoJV0/rNhmw1M1Jq60vrH+5AOLYxjZbs= X-Google-Smtp-Source: AMrXdXsc3yoDSaCi76Hp62g7Mdg1jqpOgqeZJLyrWScdATzxT7aXOn1INwGTwIvjhMIk4Q6r16RJPys+sJM6Kzv+0Uo= X-Received: by 2002:a05:651c:23a2:b0:280:507:d740 with SMTP id bk34-20020a05651c23a200b002800507d740mr742174ljb.523.1673144941786; Sat, 07 Jan 2023 18:29:01 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jim Meyering Date: Sat, 7 Jan 2023 18:28:49 -0800 Message-ID: Subject: Re: bug#60618: unicode characters are not identified as such for \w and \b with -P To: Carlo Arenas Content-Type: text/plain; charset="UTF-8" X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 60618-done Cc: 60618-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) On Fri, Jan 6, 2023 at 11:37 PM Jim Meyering wrote: > On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering wrote: > > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas wrote: > > > Reported to PCRE[1] with mention of GNU grep being also affected. > > > > > > [1] https://github.com/PCRE2Project/pcre2/issues/185 > > > > Yikes. This is a big deal. > > Thank you for the patch and added test. I've also added the new names to THANKS.in and pushed this: https://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3 ------------=_1673145003-28307-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 7 Jan 2023 03:48:14 +0000 Received: from localhost ([127.0.0.1]:56229 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE0Bp-0006H3-QW for submit@debbugs.gnu.org; Fri, 06 Jan 2023 22:48:14 -0500 Received: from lists.gnu.org ([209.51.188.17]:49012) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE0Bo-0006Gq-Ap for submit@debbugs.gnu.org; Fri, 06 Jan 2023 22:48:13 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pE0Bo-0000Sj-4b for bug-grep@gnu.org; Fri, 06 Jan 2023 22:48:12 -0500 Received: from mail-oi1-x22e.google.com ([2607:f8b0:4864:20::22e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1pE0Bm-0003jP-9q for bug-grep@gnu.org; Fri, 06 Jan 2023 22:48:11 -0500 Received: by mail-oi1-x22e.google.com with SMTP id n8so2687229oih.0 for ; Fri, 06 Jan 2023 19:48:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=I7vB+rWDcLUILLx4qWal32406Judsg15p93XJvAhsFA=; b=kaOvI5ugZqhOg6gQLjiTHvJqAQVX9cBE9FEU+13oBev5hIbNTfmjp2JQEr6V/FsT1Z KS16OYffk8Iuc38s0zKnmdxc68912d66k/vTidyZF86qkgfEiraGwwb5TfDT2UwdJT8p +3sa5dl1V67za4y9ixrWygVc3TNddYy26rcHuYYMpxgcPDTTxiXIip3GK6tY4LZGkKi5 cntGO4itqAcJo/ABEI+iXg01UkbrKbVfY7Ro7ihT8EJKydVjkOlSw11FUvvk00g0V2q3 DU9Jxh7vtYXWdryIkklE3MH3u53usmaAUgaga3fdLwDAlklc4JW/8KyatPnSUtJRehop 7cNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=I7vB+rWDcLUILLx4qWal32406Judsg15p93XJvAhsFA=; b=1gXqyoZ8knb5HOsgUN9SbAPHMkU5cylhW6jRFG728cwRvgXjxk3Kwcm6K/C1c6ZqeL sIhOqy6FcYGTsdKM6TTnY3aeJl3dmfATeeY8adbVXr7+nMF2djOFDd4Gi9FT88VBt3kK LGEoXD0ll9gQoRO5YpILBM+Ekyk4gADLFXZGwKzY9xyOV0TTdZRlcQJ3jmCZ1JrOaV24 Rzk4l+qe3TLBkGjyD69825tLmamRSinHmZn2kpxZxS605DIKGVr6YRjhIyP12HtOIOhP 4NxSFAAu+0gVrvAnCQBTLUoKUr2LwoKalEiQKEo41CFBfwCcqzX1ksNbOIlwxhtxMTWR 56ZQ== X-Gm-Message-State: AFqh2kpPASkZy9p9Y9UF4qrx2PpLR0L2FBwQ0Q7FtK/U3T0IU64W9fZQ SCCyMgjIc0R6oIZfSxwqY4xOokA8FKCCWoWq0fS1cTB1Zqc= X-Google-Smtp-Source: AMrXdXvNmndCRE2q1DpV4R2Ly17ONHHef6Z81IWjWhwUTgSzWcelILzZNpM0o1e+Coqtf8yFLjongwf4Mu01llcOFMU= X-Received: by 2002:a05:6808:9a8:b0:360:d0f8:2ecd with SMTP id e8-20020a05680809a800b00360d0f82ecdmr2764763oig.59.1673063287094; Fri, 06 Jan 2023 19:48:07 -0800 (PST) MIME-Version: 1.0 From: Carlo Arenas Date: Fri, 6 Jan 2023 19:48:01 -0800 Message-ID: Subject: unicode characters are not identified as such for \w and \b with -P To: bug-grep@gnu.org Content-Type: multipart/mixed; boundary="000000000000ea5c7805f1a4662f" Received-SPF: pass client-ip=2607:f8b0:4864:20::22e; envelope-from=carenas@gmail.com; helo=mail-oi1-x22e.google.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --000000000000ea5c7805f1a4662f Content-Type: text/plain; charset="UTF-8" Reported to PCRE[1] with mention of GNU grep being also affected. [1] https://github.com/PCRE2Project/pcre2/issues/185 --000000000000ea5c7805f1a4662f Content-Type: text/x-patch; charset="UTF-8"; name="0001-pcre-use-UCP-in-UTF-mode.patch" Content-Disposition: attachment; filename="0001-pcre-use-UCP-in-UTF-mode.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_lclemlgk0 RnJvbSBjMmQ0YTQzYjViMTVkZjdjODg1M2Q1OTFiZjZhZTg3MmM2MDJlZDE0IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiA9P1VURi04P3E/Q2FybG89MjBNYXJjZWxvPTIwQXJlbmFzPTIw QmVsPUMzPUIzbj89IDxjYXJlbmFzQGdtYWlsLmNvbT4KRGF0ZTogRnJpLCA2IEphbiAyMDIzIDE5 OjM0OjU2IC0wODAwClN1YmplY3Q6IFtQQVRDSF0gcGNyZTogdXNlIFVDUCBpbiBVVEYgbW9kZQoK KiBzcmMvcGNyZXNlYXJjaC5jOiBzZXQgUENSRTJfVUNQIHRvZ2V0aGVyIHdpdGggUENSRTJfVVRG CiogdGVzdHMvcGNyZS11dGY4LXc6IGFkZCB0ZXN0Ci0tLQogc3JjL3BjcmVzZWFyY2guYyAgfCAg MiArLQogdGVzdHMvTWFrZWZpbGUuYW0gfCAgMSArCiB0ZXN0cy9wY3JlLXV0ZjgtdyB8IDI4ICsr KysrKysrKysrKysrKysrKysrKysrKysrKysKIDMgZmlsZXMgY2hhbmdlZCwgMzAgaW5zZXJ0aW9u cygrKSwgMSBkZWxldGlvbigtKQogY3JlYXRlIG1vZGUgMTAwNzU1IHRlc3RzL3BjcmUtdXRmOC13 CgpkaWZmIC0tZ2l0IGEvc3JjL3BjcmVzZWFyY2guYyBiL3NyYy9wY3Jlc2VhcmNoLmMKaW5kZXgg YTEwN2Y0ZC4uNDViNjdlZSAxMDA2NDQKLS0tIGEvc3JjL3BjcmVzZWFyY2guYworKysgYi9zcmMv cGNyZXNlYXJjaC5jCkBAIC0xNDksNyArMTQ5LDcgQEAgUGNvbXBpbGUgKGNoYXIgKnBhdHRlcm4s IGlkeF90IHNpemUsIHJlZ19zeW50YXhfdCBpZ25vcmVkLCBib29sIGV4YWN0KQogICAgIHsKICAg ICAgIGlmICghIGxvY2FsZWluZm8udXNpbmdfdXRmOCkKICAgICAgICAgZGllIChFWElUX1RST1VC TEUsIDAsIF8oIi1QIHN1cHBvcnRzIG9ubHkgdW5pYnl0ZSBhbmQgVVRGLTggbG9jYWxlcyIpKTsK LSAgICAgIGZsYWdzIHw9IFBDUkUyX1VURjsKKyAgICAgIGZsYWdzIHw9IChQQ1JFMl9VVEYgfCBQ Q1JFMl9VQ1ApOwogI2lmIDAKICAgICAgIC8qIERvIG5vdCBtYXRjaCBpbmRpdmlkdWFsIGNvZGUg dW5pdHMgYnV0IG9ubHkgVVRGLTguICAqLwogICAgICAgZmxhZ3MgfD0gUENSRTJfTkVWRVJfQkFD S1NMQVNIX0M7CmRpZmYgLS1naXQgYS90ZXN0cy9NYWtlZmlsZS5hbSBiL3Rlc3RzL01ha2VmaWxl LmFtCmluZGV4IGUwYjA1MDMuLmE0N2NmNWMgMTAwNjQ0Ci0tLSBhL3Rlc3RzL01ha2VmaWxlLmFt CisrKyBiL3Rlc3RzL01ha2VmaWxlLmFtCkBAIC0xNDcsNiArMTQ3LDcgQEAgVEVTVFMgPQkJCQkJ CVwKICAgcGNyZS1qaXRzdGFjawkJCQkJXAogICBwY3JlLW8JCQkJCVwKICAgcGNyZS11dGY4CQkJ CQlcCisgIHBjcmUtdXRmOC13CQkJCQlcCiAgIHBjcmUtdwkJCQkJXAogICBwY3JlLXd4LWJhY2ty ZWYJCQkJXAogICBwY3JlLXoJCQkJCVwKZGlmZiAtLWdpdCBhL3Rlc3RzL3BjcmUtdXRmOC13IGIv dGVzdHMvcGNyZS11dGY4LXcKbmV3IGZpbGUgbW9kZSAxMDA3NTUKaW5kZXggMDAwMDAwMC4uNDMx Njg1YwotLS0gL2Rldi9udWxsCisrKyBiL3Rlc3RzL3BjcmUtdXRmOC13CkBAIC0wLDAgKzEsMjgg QEAKKyMhL2Jpbi9zaAorIyBVVEYtOCBjaGFyYWN0ZXJzIGFyZSBjb3JyZWN0bHkgaWRlbnRpZmll ZCBhcyBwYXJ0IG9mIGEgd29yZAorIworIyBDb3B5cmlnaHQgKEMpIDIwMjMtMjAyMyBGcmVlIFNv ZnR3YXJlIEZvdW5kYXRpb24sIEluYy4KKyMKKyMgQ29weWluZyBhbmQgZGlzdHJpYnV0aW9uIG9m IHRoaXMgZmlsZSwgd2l0aCBvciB3aXRob3V0IG1vZGlmaWNhdGlvbiwKKyMgYXJlIHBlcm1pdHRl ZCBpbiBhbnkgbWVkaXVtIHdpdGhvdXQgcm95YWx0eSBwcm92aWRlZCB0aGUgY29weXJpZ2h0Cisj IG5vdGljZSBhbmQgdGhpcyBub3RpY2UgYXJlIHByZXNlcnZlZC4KKworLiAiJHtzcmNkaXI9Ln0v aW5pdC5zaCI7IHBhdGhfcHJlcGVuZF8gLi4vc3JjCityZXF1aXJlX2VuX3V0ZjhfbG9jYWxlXwor TENfQUxMPWVuX1VTLlVURi04CitleHBvcnQgTENfQUxMCityZXF1aXJlX3BjcmVfCisKK2ZhaWw9 MAorCitlY2hvICdQZXLDuic+IGluIHx8IGZyYW1ld29ya19mYWlsdXJlXworCitlY2hvICfDuicg PiBleHAgfHwgZnJhbWV3b3JrX2ZhaWx1cmVfCitncmVwIC1QbyAnLlxiJyBpbiA+IG91dCB8fCBm YWlsPTEKK2NvbXBhcmUgb3V0IGV4cCB8fCBmYWlsPTEKKworZWNobyAncsO6JyA+IGV4cCB8fCBm cmFtZXdvcmtfZmFpbHVyZV8KK2dyZXAgLVBvICdyXHcnIGluID4gb3V0ICYmIGZhaWw9MQorY29t cGFyZSBvdXQgZXhwIHx8IGZhaWw9MQorCitFeGl0ICRmYWlsCi0tIAoyLjMwLjIKCg== --000000000000ea5c7805f1a4662f-- ------------=_1673145003-28307-1-- From unknown Sun Jun 22 04:05:56 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Karl Pettersson Subject: bug#60621: closed (Re: bug#60618: unicode characters are not identified as such for \w and \b with -P) Message-ID: References: X-Gnu-PR-Message: they-closed 60621 X-Gnu-PR-Package: grep Reply-To: 60621@debbugs.gnu.org Date: Sun, 08 Jan 2023 02:30:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1673145003-28307-3" This is a multi-part message in MIME format... ------------=_1673145003-28307-3 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #60618: grep -P does not set PCRE2_UCP which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 60621@debbugs.gnu.org. --=20 60618: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D60618 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1673145003-28307-3 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 60618-done) by debbugs.gnu.org; 8 Jan 2023 02:29:09 +0000 Received: from localhost ([127.0.0.1]:59470 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pELQr-0007Kr-2L for submit@debbugs.gnu.org; Sat, 07 Jan 2023 21:29:09 -0500 Received: from mail-lj1-f170.google.com ([209.85.208.170]:34636) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pELQp-0007Ke-L2 for 60618-done@debbugs.gnu.org; Sat, 07 Jan 2023 21:29:08 -0500 Received: by mail-lj1-f170.google.com with SMTP id x37so5416543ljq.1 for <60618-done@debbugs.gnu.org>; Sat, 07 Jan 2023 18:29:07 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=vVXGj/rGfPQUabWM0xXkv8hMld8MYi+K/zUpExHdOWo=; b=yXO6uTeGWPU2Bxc9qlqe8VA3W5S2G6d0puzBJLlTzkxaa0SULNkVXdZsA7w/+/CDVE Dd0wGhgVSlf/D5ayLbBSXBaEwypekmsYt6V5K3fgKfI4e0owXpls9J/IWy9y9BwApDnT iTRheP2SAC9CUmjTH3KdGhId4OLS1yVnL+3W84HhlKsoZAfgnxD+AD3eCBsnAqNjzWli m5AW0vhDHTx4F1skQ6mG5qwlZPLFdpitrPtUXiLFqwdXOk2wLt0jiOwGOvn+0zXlSy2G me2qeMcplDNG7zZEsAiWHORdfhHCj32zGDcKxi0rJz4dIKyxo6gkWq8X56/k1FxhT6U3 jdPA== X-Gm-Message-State: AFqh2kosocv+BS4D3s6cm3W0jVFKvZzEt4O05zwnm23MPYbW4XiQkBpT 9OtsgcmpoJV0/rNhmw1M1Jq60vrH+5AOLYxjZbs= X-Google-Smtp-Source: AMrXdXsc3yoDSaCi76Hp62g7Mdg1jqpOgqeZJLyrWScdATzxT7aXOn1INwGTwIvjhMIk4Q6r16RJPys+sJM6Kzv+0Uo= X-Received: by 2002:a05:651c:23a2:b0:280:507:d740 with SMTP id bk34-20020a05651c23a200b002800507d740mr742174ljb.523.1673144941786; Sat, 07 Jan 2023 18:29:01 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jim Meyering Date: Sat, 7 Jan 2023 18:28:49 -0800 Message-ID: Subject: Re: bug#60618: unicode characters are not identified as such for \w and \b with -P To: Carlo Arenas Content-Type: text/plain; charset="UTF-8" X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 60618-done Cc: 60618-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) On Fri, Jan 6, 2023 at 11:37 PM Jim Meyering wrote: > On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering wrote: > > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas wrote: > > > Reported to PCRE[1] with mention of GNU grep being also affected. > > > > > > [1] https://github.com/PCRE2Project/pcre2/issues/185 > > > > Yikes. This is a big deal. > > Thank you for the patch and added test. I've also added the new names to THANKS.in and pushed this: https://git.savannah.gnu.org/cgit/grep.git/commit/?id=5e3b760f65f13856e5717e5b9d935f5b4a615be3 ------------=_1673145003-28307-3 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 7 Jan 2023 07:37:37 +0000 Received: from localhost ([127.0.0.1]:56368 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pE3lo-0006ZV-Rb for submit@debbugs.gnu.org; Sat, 07 Jan 2023 02:37:37 -0500 Received: from lists.gnu.org ([209.51.188.17]:54704) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pDtXM-0002qX-Kh for submit@debbugs.gnu.org; Fri, 06 Jan 2023 15:42:00 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pDtXM-00039z-Cq for bug-grep@gnu.org; Fri, 06 Jan 2023 15:42:00 -0500 Received: from smtp.outgoing.loopia.se ([93.188.3.37]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pDtXK-0001ni-8Y for bug-grep@gnu.org; Fri, 06 Jan 2023 15:42:00 -0500 Received: from s807.loopia.se (localhost [127.0.0.1]) by s807.loopia.se (Postfix) with ESMTP id 3966A2F5F6FF for ; Fri, 6 Jan 2023 21:41:54 +0100 (CET) Received: from s979.loopia.se (unknown [172.22.191.6]) by s807.loopia.se (Postfix) with ESMTP id 2B0D72E2826B for ; Fri, 6 Jan 2023 21:41:54 +0100 (CET) Received: from s476.loopia.se (unknown [172.22.191.6]) by s979.loopia.se (Postfix) with ESMTP id 28DC710BC40B for ; Fri, 6 Jan 2023 21:41:54 +0100 (CET) X-Virus-Scanned: amavisd-new at amavis.loopia.se X-Spam-Flag: NO X-Spam-Score: -1 X-Spam-Level: X-Spam-Status: No, score=-1 tagged_above=-999 required=6.2 tests=[ALL_TRUSTED=-1] autolearn=disabled Received: from s981.loopia.se ([172.22.191.6]) by s476.loopia.se (s476.loopia.se [172.22.190.16]) (amavisd-new, port 10024) with LMTP id aLO3rm-ANgfA for ; Fri, 6 Jan 2023 21:41:53 +0100 (CET) X-Loopia-Auth: user X-Loopia-User: karl.pettersson@klpn.se X-Loopia-Originating-IP: 31.209.52.155 Received: from localhost (31-209-52-155.cust.bredband2.com [31.209.52.155]) (Authenticated sender: karl.pettersson@klpn.se) by s981.loopia.se (Postfix) with ESMTPSA id C1DA922B1765 for ; Fri, 6 Jan 2023 21:41:53 +0100 (CET) Date: Fri, 6 Jan 2023 21:41:53 +0100 From: Karl Pettersson To: bug-grep Subject: grep -P does not set PCRE2_UCP Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit Received-SPF: none client-ip=93.188.3.37; envelope-from=karl.pettersson@klpn.se; helo=smtp.outgoing.loopia.se X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 07 Jan 2023 02:37:32 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Hi Using grep -P for boundary matches yields incorrect results with non-ASCII letters: $ echo 'Öst' | grep -P '\bs' Öst The output should be nothing in this case, and the culprit seems to be this line in pcresearch.c: flags |= PCRE2_UTF; If the PCRE2_UCP flag is added according to this, the program behaves correctly: flags |= PCRE2_UTF|PCRE2_UCP; The pcre2grep test program in the pcre2 has the same problem, and I filed an issue there too: https://github.com/PCRE2Project/pcre2/issues/185 A Twitter discussion with more examples: https://twitter.com/gro_tsen/status/1610972356972875777 Kind regards -- Karl Pettersson Uppsala, Sverige/Sweden https://static-dust.klpn.se/ ------------=_1673145003-28307-3--