From unknown Wed Jun 18 00:12:09 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#38656 <38656@debbugs.gnu.org> To: bug#38656 <38656@debbugs.gnu.org> Subject: Status: [PATCH] grep: do not match invalid UTF-8 Reply-To: bug#38656 <38656@debbugs.gnu.org> Date: Wed, 18 Jun 2025 07:12:09 +0000 retitle 38656 [PATCH] grep: do not match invalid UTF-8 reassign 38656 grep submitter 38656 Paul Eggert severity 38656 normal tag 38656 patch thanks From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 18 01:05:36 2019 Received: (at submit) by debbugs.gnu.org; 18 Dec 2019 06:05:36 +0000 Received: from localhost ([127.0.0.1]:42593 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihSSl-00031J-VK for submit@debbugs.gnu.org; Wed, 18 Dec 2019 01:05:36 -0500 Received: from lists.gnu.org ([209.51.188.17]:33038) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihSSk-00031C-HB for submit@debbugs.gnu.org; Wed, 18 Dec 2019 01:05:35 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:59866) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1ihSSh-0005Bm-Ks for bug-grep@gnu.org; Wed, 18 Dec 2019 01:05:34 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.5 required=5.0 tests=BAYES_50,RCVD_IN_DNSWL_MED, URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ihSSg-0006Fh-72 for bug-grep@gnu.org; Wed, 18 Dec 2019 01:05:31 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39610) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ihSSf-0006DO-UY for bug-grep@gnu.org; Wed, 18 Dec 2019 01:05:30 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6CBE016018B for ; Tue, 17 Dec 2019 22:05:28 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id l3ojyHNn-PFR; Tue, 17 Dec 2019 22:05:26 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6D31C160179; Tue, 17 Dec 2019 22:05:26 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id u3ob3K3wo650; Tue, 17 Dec 2019 22:05:26 -0800 (PST) Received: from day.example.com (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 47B9C16016A; Tue, 17 Dec 2019 22:05:26 -0800 (PST) From: Paul Eggert To: bug-grep@gnu.org Subject: [PATCH] grep: do not match invalid UTF-8 Date: Tue, 17 Dec 2019 22:05:19 -0800 Message-Id: <20191218060519.29385-1-eggert@cs.ucla.edu> X-Mailer: git-send-email 2.17.1 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x [fuzzy] X-Received-From: 131.179.128.68 X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit Cc: Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Update Gnulib to latest. Also: * src/dfasearch.c (EGexecute): Use ptrdiff_t, not size_t, to match new Gnulib API. * tests/Makefile.am (TESTS): Add dfa-invalid-utf8. * tests/dfa-invalid-utf8: New file. --- NEWS | 5 ++++- gnulib | 2 +- src/dfasearch.c | 2 +- tests/Makefile.am | 1 + tests/dfa-invalid-utf8 | 29 +++++++++++++++++++++++++++++ 5 files changed, 36 insertions(+), 3 deletions(-) create mode 100755 tests/dfa-invalid-utf8 diff --git a/NEWS b/NEWS index b106e2f..b6ff57c 100644 --- a/NEWS +++ b/NEWS @@ -9,7 +9,10 @@ GNU grep NEWS -*- outline -*- ** Bug fixes - grep -Fw can no longer false match in non-UTF8 multibyte locales + '.' no longer matches some invalid byte sequences in UTF-8 locales. + [bug introduced in grep 2.7] + + grep -Fw can no longer false match in non-UTF-8 multibyte locales For example, this command would erroneously print its input line: echo ab | LC_CTYPE=ja_JP.eucjp grep -Fw b [Bug#38223 introduced in grep 2.28] diff --git a/gnulib b/gnulib index b7bf9f4..1219c34 160000 --- a/gnulib +++ b/gnulib @@ -1 +1 @@ -Subproject commit b7bf9f4361c8d78ccfda7a30ff31f7a406ea972e +Subproject commit 1219c343014ede881069bab554408b40e5455d9c diff --git a/src/dfasearch.c b/src/dfasearch.c index 6c95d8c..153281d 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -234,7 +234,7 @@ EGexecute (void *vdc, char const *buf, size_t size, size_t *match_size, if (!start_ptr) { char const *next_beg, *dfa_beg = beg; - size_t count = 0; + ptrdiff_t count = 0; bool exact_kwset_match = false; bool backref = false; diff --git a/tests/Makefile.am b/tests/Makefile.am index 82aebbf..dee6f46 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -86,6 +86,7 @@ TESTS = \ dfa-coverage \ dfa-heap-overrun \ dfa-infloop \ + dfa-invalid-utf8 \ dfaexec-multibyte \ empty \ empty-line \ diff --git a/tests/dfa-invalid-utf8 b/tests/dfa-invalid-utf8 new file mode 100755 index 0000000..1748043 --- /dev/null +++ b/tests/dfa-invalid-utf8 @@ -0,0 +1,29 @@ +#! /bin/sh +# Test whether "grep '.'" matches invalid UTF-8 byte sequences. +# +# Copyright 2019 Free Software Foundation, Inc. +# +# Copying and distribution of this file, with or without modification, +# are permitted in any medium without royalty provided the copyright +# notice and this notice are preserved. + +. "${srcdir=.}/init.sh"; path_prepend_ ../src +require_en_utf8_locale_ +require_compiled_in_MB_support + +fail=0 + +printf 'a\360\202\202\254b\n' >in1 || framework_failure_ +LC_ALL=en_US.UTF-8 grep 'a.b' in1 > out1 2> err +test $? -eq 1 || fail=1 +compare /dev/null out1 || fail=1 +compare /dev/null err1 || fail=1 + +printf 'a\360\202\202\254ba\360\202\202\254b\n' >in2 || + framework_failure_ +LC_ALL=en_US.UTF-8 grep -E '(a.b)\1' in2 > out2 2> err +test $? -eq 1 || fail=1 +compare /dev/null out2 || fail=1 +compare /dev/null err2 || fail=1 + +Exit $fail -- 2.17.1 From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 18 03:40:05 2019 Received: (at control) by debbugs.gnu.org; 18 Dec 2019 08:40:05 +0000 Received: from localhost ([127.0.0.1]:42637 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihUsH-000719-JA for submit@debbugs.gnu.org; Wed, 18 Dec 2019 03:40:05 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33264) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihUsD-00070H-G5 for control@debbugs.gnu.org; Wed, 18 Dec 2019 03:40:04 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B6B8A16016A for ; Wed, 18 Dec 2019 00:39:54 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id c4xX3CNrLmkY for ; Wed, 18 Dec 2019 00:39:54 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 2BB7B1601B6 for ; Wed, 18 Dec 2019 00:39:54 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id W_sMVC5Z0vDX for ; Wed, 18 Dec 2019 00:39:54 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 0B8F116016A for ; Wed, 18 Dec 2019 00:39:54 -0800 (PST) To: control@debbugs.gnu.org From: Paul Eggert Subject: 38656 has been fixed Organization: UCLA Computer Science Department Message-ID: Date: Wed, 18 Dec 2019 00:39:53 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) close 38656 From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 18 12:06:40 2019 Received: (at 38656) by debbugs.gnu.org; 18 Dec 2019 17:06:40 +0000 Received: from localhost ([127.0.0.1]:43867 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihcmW-0006rx-GV for submit@debbugs.gnu.org; Wed, 18 Dec 2019 12:06:40 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60874) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ihcmT-0006rj-Vw for 38656@debbugs.gnu.org; Wed, 18 Dec 2019 12:06:39 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6C31F16016A; Wed, 18 Dec 2019 09:06:32 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id E8M9zox0WXyO; Wed, 18 Dec 2019 09:06:31 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 766E5160179; Wed, 18 Dec 2019 09:06:31 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id W8oNpCxj3JhP; Wed, 18 Dec 2019 09:06:31 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 496E316016A; Wed, 18 Dec 2019 09:06:31 -0800 (PST) Subject: Re: [PATCH 4/4] dfa: do not match invalid UTF-8 To: Bruno Haible References: <20191218054724.28770-1-eggert@cs.ucla.edu> <20191218054724.28770-4-eggert@cs.ucla.edu> <1969595.3szdDT6rsk@omega> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Wed, 18 Dec 2019 09:06:30 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <1969595.3szdDT6rsk@omega> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 38656 Cc: bug-gnulib@gnu.org, 38656@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 12/18/19 12:48 AM, Bruno Haible wrote re my recent Gnulib change , with corresponding Grep change : > Do I understand it correctly that, as a consequence of this change, > 'grep' with a regex of '^.*$' will no longer match lines which contains > an invalid UTF-8 byte sequence? Yes and no. dfa.c's '^.*$' already rejected some lines with invalid UTF-8 byte sequences. The change merely made dfa.c reject all such lines. > - Is this effect on 'grep' intended? (And the workaround is to use the > "C" locale.) Yes. > - Is it consistent with the behaviour of regex and kwset, which 'grep' > also uses, depending on the arguments (as far as I understand)? No, in the sense that the matchers disagree about what to do with encoding errors. I think regex '.' matches the first byte of an encoding error (which would be hard to mimic in that part of dfa.c as this behavior requires lookahead). I don't know what kwset does. In some sense it doesn't matter, as neither POSIX nor the grep manual say what to do when the pattern or input contains encoding errors. I installed the patch because it seemed "wrong" to me that the "." pattern matched an invalid byte sequence of length 2 or more, with no characters in sight. Conversely, I suppose if the change significantly hurts performance, then it should be reverted (but with a comment explaining why dfa.c accepts more than just the valid UTF-8 byte sequences) or perhaps redone in a better way. I am cc'ing this to 38656@debbugs.gnu.org to give 'grep' lurkers a heads-up about this. From unknown Wed Jun 18 00:12:09 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Thu, 16 Jan 2020 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator