GNU bug report logs -
#38656
[PATCH] grep: do not match invalid UTF-8
Previous Next
Reported by: Paul Eggert <eggert <at> cs.ucla.edu>
Date: Wed, 18 Dec 2019 06:06:02 UTC
Severity: normal
Tags: patch
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 38656 in the body.
You can then email your comments to 38656 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#38656
; Package
grep
.
(Wed, 18 Dec 2019 06:06:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Wed, 18 Dec 2019 06:06:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Update Gnulib to latest. Also:
* src/dfasearch.c (EGexecute): Use ptrdiff_t, not size_t,
to match new Gnulib API.
* tests/Makefile.am (TESTS): Add dfa-invalid-utf8.
* tests/dfa-invalid-utf8: New file.
---
NEWS | 5 ++++-
gnulib | 2 +-
src/dfasearch.c | 2 +-
tests/Makefile.am | 1 +
tests/dfa-invalid-utf8 | 29 +++++++++++++++++++++++++++++
5 files changed, 36 insertions(+), 3 deletions(-)
create mode 100755 tests/dfa-invalid-utf8
diff --git a/NEWS b/NEWS
index b106e2f..b6ff57c 100644
--- a/NEWS
+++ b/NEWS
@@ -9,7 +9,10 @@ GNU grep NEWS -*- outline -*-
** Bug fixes
- grep -Fw can no longer false match in non-UTF8 multibyte locales
+ '.' no longer matches some invalid byte sequences in UTF-8 locales.
+ [bug introduced in grep 2.7]
+
+ grep -Fw can no longer false match in non-UTF-8 multibyte locales
For example, this command would erroneously print its input line:
echo ab | LC_CTYPE=ja_JP.eucjp grep -Fw b
[Bug#38223 introduced in grep 2.28]
diff --git a/gnulib b/gnulib
index b7bf9f4..1219c34 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit b7bf9f4361c8d78ccfda7a30ff31f7a406ea972e
+Subproject commit 1219c343014ede881069bab554408b40e5455d9c
diff --git a/src/dfasearch.c b/src/dfasearch.c
index 6c95d8c..153281d 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -234,7 +234,7 @@ EGexecute (void *vdc, char const *buf, size_t size, size_t *match_size,
if (!start_ptr)
{
char const *next_beg, *dfa_beg = beg;
- size_t count = 0;
+ ptrdiff_t count = 0;
bool exact_kwset_match = false;
bool backref = false;
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 82aebbf..dee6f46 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -86,6 +86,7 @@ TESTS = \
dfa-coverage \
dfa-heap-overrun \
dfa-infloop \
+ dfa-invalid-utf8 \
dfaexec-multibyte \
empty \
empty-line \
diff --git a/tests/dfa-invalid-utf8 b/tests/dfa-invalid-utf8
new file mode 100755
index 0000000..1748043
--- /dev/null
+++ b/tests/dfa-invalid-utf8
@@ -0,0 +1,29 @@
+#! /bin/sh
+# Test whether "grep '.'" matches invalid UTF-8 byte sequences.
+#
+# Copyright 2019 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_en_utf8_locale_
+require_compiled_in_MB_support
+
+fail=0
+
+printf 'a\360\202\202\254b\n' >in1 || framework_failure_
+LC_ALL=en_US.UTF-8 grep 'a.b' in1 > out1 2> err
+test $? -eq 1 || fail=1
+compare /dev/null out1 || fail=1
+compare /dev/null err1 || fail=1
+
+printf 'a\360\202\202\254ba\360\202\202\254b\n' >in2 ||
+ framework_failure_
+LC_ALL=en_US.UTF-8 grep -E '(a.b)\1' in2 > out2 2> err
+test $? -eq 1 || fail=1
+compare /dev/null out2 || fail=1
+compare /dev/null err2 || fail=1
+
+Exit $fail
--
2.17.1
bug closed, send any further explanations to
38656 <at> debbugs.gnu.org and Paul Eggert <eggert <at> cs.ucla.edu>
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Wed, 18 Dec 2019 08:41:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#38656
; Package
grep
.
(Wed, 18 Dec 2019 17:07:01 GMT)
Full text and
rfc822 format available.
Message #10 received at 38656 <at> debbugs.gnu.org (full text, mbox):
On 12/18/19 12:48 AM, Bruno Haible wrote re my recent Gnulib change
<https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=1219c343014ede881069bab554408b40e5455d9c>,
with corresponding Grep change
<https://git.savannah.gnu.org/cgit/grep.git/commit/?id=c9a6e4bf919e1b28970e11b29aa720a7e6144834>:
> Do I understand it correctly that, as a consequence of this change,
> 'grep' with a regex of '^.*$' will no longer match lines which contains
> an invalid UTF-8 byte sequence?
Yes and no. dfa.c's '^.*$' already rejected some lines with invalid UTF-8 byte
sequences. The change merely made dfa.c reject all such lines.
> - Is this effect on 'grep' intended? (And the workaround is to use the
> "C" locale.)
Yes.
> - Is it consistent with the behaviour of regex and kwset, which 'grep'
> also uses, depending on the arguments (as far as I understand)?
No, in the sense that the matchers disagree about what to do with encoding
errors. I think regex '.' matches the first byte of an encoding error (which
would be hard to mimic in that part of dfa.c as this behavior requires
lookahead). I don't know what kwset does.
In some sense it doesn't matter, as neither POSIX nor the grep manual say what
to do when the pattern or input contains encoding errors. I installed the patch
because it seemed "wrong" to me that the "." pattern matched an invalid byte
sequence of length 2 or more, with no characters in sight.
Conversely, I suppose if the change significantly hurts performance, then it
should be reverted (but with a comment explaining why dfa.c accepts more than
just the valid UTF-8 byte sequences) or perhaps redone in a better way.
I am cc'ing this to 38656 <at> debbugs.gnu.org to give 'grep' lurkers a heads-up
about this.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 16 Jan 2020 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 5 years and 153 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.