Paul Eggert wrote:
>> I was referring to text containing encoding errors without containing NULs
Ah - that makes sense.
The following experiment leads me to conclude that grep entirely suppresses
emitting any portion of a match that would contain an encoding error, rather
than emitting some substring of the match that can be correctly encoded.
That is, it seems that if grep is asked to emit what it thinks would be a
match with an encoding error, grep seems to suppress that output line
entirely, and continues looking for matches that it can emit without encoding
errors, and then at the end, if it saw a match that would have emitted an
encoding error, it issues the "Binary file ... matches" error, just
before exiting (or ending processing of that particular file.)
I demonstrated this by replacing the ELF executable of my previous example with
the output of the following C program, which issues every possible pair of bytes,
except for no nul and no 255 bytes:
main()
{
int i, j;
for (i = 1; i < 255; i++) {
for (j = 1; j < 255; j++)
printf("%c%c", i, j);
}
puts("");
}
So I tested on a file (/tmp/pjcc) containing (1) a bunch of ASCII C code,
(2) output from the above program, and (3) another copy of the same ASCII C code.
Then, with the following settings:
LC_COLLATE=C
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
LANG=en_US.UTF-8
I ran the command:
grep "'N'" /tmp/pjcc
I got the following output:
case 'N':
case 'N':
Binary file /tmp/pjcc matches
The "case 'N':" string appears once in the C code used in the file, but
there are two copies of that C code in the file, so that grep prints that line twice.
I also double checked that my file /tmp/pjcc did not contain any nul bytes.
The three character sequence 'N' also appears in the middle section of
all non-nul, non-255 pairs of bytes, as well as in the ASCII C code, and
it was (I presume) the match on that section of the file that caused grep
to issue the ""Binary file /tmp/pjcc matches complaint at the
end of its processing of that file.
If on the other hand, I ran the command:
grep "'N':" /tmp/pjcc
then I got the output:
case 'N':
case 'N':
without any complaint that the Binary file /tmp/pjcc matches.
The four character sequence 'N': appears (twice) in the C code,
but zero times in the middle section of all non-nul, non-255 pairs of bytes.
From this I conclude that if grep, in its default mode, is asked to emit a matching
pattern that would contain encoding errors, that it does not trim the output to what
would encode correctly and continue onward, but rather emits nothing for that match,
continues onward looking for more matches that it can emit correctly, and then
prints the "Binary file ... matches" error just before it exits or goes to the
next file.
If I were designing grep from scratch, and had infinite resources, I might refer to
have grep emit some substring of each match that it can encode correctly, rather
than emit nothing in case of an encoding error.
However, I can't imagine that this is worth the effort, and (being a stick
in the mud old fart) I usually recommend against incompatible changes
unless strongly necessary.
So ... whatever ... nevermind ... as they say.
--
Paul Jackson
pj@usa.net