GNU bug report logs -
#30326
grep not searching through a text file (thinking it binary)
Previous Next
Reported by: "L. A. Walsh" <gnu <at> tlinx.org>
Date: Fri, 2 Feb 2018 19:31:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your bug report
#30326: grep not searching through a text file (thinking it binary)
which was filed against the grep package, has been closed.
The explanation is attached below, along with your original report.
If you require more details, please reply to 30326 <at> debbugs.gnu.org.
--
30326: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=30326
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
tag 30326 notabug
thanks
On 02/02/2018 01:30 PM, L. A. Walsh wrote:
> I've used grep to search through my mbox-format emails for decades, but
> I've run into a case where it seems to be ignore a text mailbox
> because, I guess, it thinks it is "binary"
Yes, that's correct.
> If I used "-Par" it finds it.
Yes, that's also correct.
>
> It seems that grep believes the file to binary and ignores it, though
> "file" calls it "text".
The file is conditionally text. The POSIX definition of a text file is
one whose lines consist of valid characters in the current locale - but
note this definition is locale-dependent! So a file that is text under
one locale may be binary under another. When you are grepping a file
encoded correctly for the current locale, you get the output you want;
when you are grepping a file that contains encoding errors for the
current locale, POSIX says behavior is undefined, so GNU grep warns you
that the file is binary (in the current locale); and your use of -a
tells grep to process it anyways. As 'file' reported that your file was
using non-ISO extended-ASCII, it probable means the file was encoded for
an 8-bit single-byte locale; and my guess is that you were running grep
under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
inputs as encoding errors. Hence the warning that your file is binary,
under the current locale.
You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
valid character, and thus where you will never encounter encoding errors
(you may encounter OTHER things that make your file binary, such as
embedded NULs, but that's a different matter).
This behavior is documented and intentional, so I'm closing this as not
a bug in the tracker. However, feel free to add further comments or
questions to the thread.
And perhaps we could tweak the grep diagnostics to clarify whether a
file is binary because NUL bytes were encountered, vs. a file is binary
because encoding errors were encountered.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
[signature.asc (application/pgp-signature, attachment)]
[Message part 5 (message/rfc822, inline)]
I've used grep to search through my mbox-format emails for decades, but
I've run into a case where it seems to be ignore a text mailbox
because, I guess, it thinks it is "binary" (I think ignoring binary
is a default in my aliases file).
I used:
> grep -Pr 'Game:\s+NCSOFT' *
and it ignored a mailbox named 'Domain': that contained the
string:
" =E2=80=A2=09Game: NCSOFT"
> file Domain
Domain: Non-ISO extended-ASCII text, with very long lines
If I used "-Par" it finds it.
It seems that grep believes the file to binary and ignores it, though
"file" calls it "text".
Any ideas?
grep -V
grep (GNU grep) 2.21.31-adf9
Maybe grep is being a bit overzealous in calling files 'binary'?
This bug report was last modified 7 years and 35 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.