GNU bug report logs - #30326
grep not searching through a text file (thinking it binary)

Previous Next

Package: grep;

Reported by: "L. A. Walsh" <gnu <at> tlinx.org>

Date: Fri, 2 Feb 2018 19:31:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Eric Blake <eblake <at> redhat.com>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#30326: closed (grep not searching through a text file
 (thinking it binary))
Date: Fri, 02 Feb 2018 19:56:02 +0000
[Message part 1 (text/plain, inline)]
Your message dated Fri, 2 Feb 2018 13:55:00 -0600
with message-id <2c00563c-9347-c596-4ade-a87bd9262ca1 <at> redhat.com>
and subject line Re: bug#30326: grep not searching through a text file (thinking it binary)
has caused the debbugs.gnu.org bug report #30326,
regarding grep not searching through a text file (thinking it binary)
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
30326: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=30326
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: "L. A. Walsh" <gnu <at> tlinx.org>
To: bug-grep <at> gnu.org
Subject: grep not searching through a text file (thinking it binary)
Date: Fri, 02 Feb 2018 11:30:07 -0800
I've used grep to search through my mbox-format emails for decades, but
I've run into a case where it seems to be ignore a text mailbox
because, I guess, it thinks it is "binary" (I think ignoring binary
is a default in my aliases file).

I used:

>  grep -Pr 'Game:\s+NCSOFT' *

and it ignored a mailbox named 'Domain': that contained the
string:
"                                    =E2=80=A2=09Game: NCSOFT"

>  file Domain
Domain: Non-ISO extended-ASCII text, with very long lines


If I used "-Par" it finds it.

It seems that grep believes the file to binary and ignores it, though
"file" calls it "text".

Any ideas?

grep -V
grep (GNU grep) 2.21.31-adf9

Maybe grep is being a bit overzealous in calling files 'binary'?










[Message part 3 (message/rfc822, inline)]
From: Eric Blake <eblake <at> redhat.com>
To: "L. A. Walsh" <gnu <at> tlinx.org>, 30326-done <at> debbugs.gnu.org,
 GNU bug control <control <at> debbugs.gnu.org>
Subject: Re: bug#30326: grep not searching through a text file (thinking it
 binary)
Date: Fri, 2 Feb 2018 13:55:00 -0600
[Message part 4 (text/plain, inline)]
tag 30326 notabug
thanks

On 02/02/2018 01:30 PM, L. A. Walsh wrote:
> I've used grep to search through my mbox-format emails for decades, but
> I've run into a case where it seems to be ignore a text mailbox
> because, I guess, it thinks it is "binary"

Yes, that's correct.

> If I used "-Par" it finds it.

Yes, that's also correct.

> 
> It seems that grep believes the file to binary and ignores it, though
> "file" calls it "text".

The file is conditionally text.  The POSIX definition of a text file is
one whose lines consist of valid characters in the current locale - but
note this definition is locale-dependent!  So a file that is text under
one locale may be binary under another.  When you are grepping a file
encoded correctly for the current locale, you get the output you want;
when you are grepping a file that contains encoding errors for the
current locale, POSIX says behavior is undefined, so GNU grep warns you
that the file is binary (in the current locale); and your use of -a
tells grep to process it anyways.  As 'file' reported that your file was
using non-ISO extended-ASCII, it probable means the file was encoded for
an 8-bit single-byte locale; and my guess is that you were running grep
under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
inputs as encoding errors.  Hence the warning that your file is binary,
under the current locale.

You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
valid character, and thus where you will never encounter encoding errors
(you may encounter OTHER things that make your file binary, such as
embedded NULs, but that's a different matter).

This behavior is documented and intentional, so I'm closing this as not
a bug in the tracker.  However, feel free to add further comments or
questions to the thread.

And perhaps we could tweak the grep diagnostics to clarify whether a
file is binary because NUL bytes were encountered, vs. a file is binary
because encoding errors were encountered.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 7 years and 34 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.