Paul Eggert wrote:

>> I was referring to text containing encoding errors without containing NULs

Ah - that makes sense.

The following experiment leads me to conclude that grep entirely suppresses

emitting any portion of a match that would contain an encoding error, rather

than emitting some substring of the match that can be correctly encoded.

That is, it seems that if grep is asked to emit what it thinks would be a

match with an encoding error, grep seems to suppress that output line

entirely, and continues looking for matches that it can emit without encoding

errors, and then at the end, if it saw a match that would have emitted an

encoding error, it issues the "Binary file ... matches" error, just

before exiting (or ending processing of that particular file.)

I demonstrated this by replacing the ELF executable of my previous example with

the output of the following C program, which issues every possible pair of bytes,

except for no nul and no 255 bytes:

main()

{

int i, j;

for (i = 1; i < 255; i++) {

for (j = 1; j < 255; j++)

printf("%c%c", i, j);

}

puts("");

}

So I tested on a file (/tmp/pjcc) containing (1) a bunch of ASCII C code,

(2) output from the above program, and (3) another copy of the same ASCII C code.

Then, with the following settings:

LC_COLLATE=C

LANGUAGE=en_US.UTF-8

LC_ALL=en_US.UTF-8

LANG=en_US.UTF-8

I ran the command:

grep "'N'" /tmp/pjcc

I got the following output:

case 'N':

Binary file /tmp/pjcc matches

The "case 'N':" string appears once in the C code used in the file, but

there are two copies of that C code in the file, so that grep prints that line twice.

I also double checked that my file /tmp/pjcc did not contain any nul bytes.

The three character sequence 'N' also appears in the middle section of

all non-nul, non-255 pairs of bytes, as well as in the ASCII C code, and

it was (I presume) the match on that section of the file that caused grep

to issue the ""Binary file /tmp/pjcc matches complaint at the

end of its processing of that file.

If on the other hand, I ran the command:

grep "'N':" /tmp/pjcc

then I got the output:

case 'N':

without any complaint that the Binary file /tmp/pjcc matches.

The four character sequence 'N': appears (twice) in the C code,

but zero times in the middle section of all non-nul, non-255 pairs of bytes.

From this I conclude that if grep, in its default mode, is asked to emit a matching

pattern that would contain encoding errors, that it does not trim the output to what

would encode correctly and continue onward, but rather emits nothing for that match,

continues onward looking for more matches that it can emit correctly, and then

prints the "Binary file ... matches" error just before it exits or goes to the

next file.

If I were designing grep from scratch, and had infinite resources, I might refer to

have grep emit some substring of each match that it can encode correctly, rather

than emit nothing in case of an encoding error.

However, I can't imagine that this is worth the effort, and (being a stick

in the mud old fart) I usually recommend against incompatible changes

unless strongly necessary.

So ... whatever ... nevermind ... as they say.

Paul Jackson

pj@usa.net