GNU bug report logs - #30326
grep not searching through a text file (thinking it binary)

Previous Next

Package: grep;

Reported by: "L. A. Walsh" <gnu <at> tlinx.org>

Date: Fri, 2 Feb 2018 19:31:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Full log


Message #15 received at 30326-done <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <gnu <at> tlinx.org>
Cc: 30326-done <at> debbugs.gnu.org, GNU bug control <control <at> debbugs.gnu.org>
Subject: Re: bug#30326: grep not searching through a text file (thinking it
 binary)
Date: Fri, 02 Feb 2018 12:09:23 -0800
Grep was around long before POSIX, as were most of the unix
utils.

Grep was able to find text strings in mboxes without a POSIX
definition telling it that it was "broken". 

I don't want it displaying random binary that throws my
terminal into weird modes, which is why I skip binary
files. To have grep searching through some mailboxes
while skipping others, randomly based on what email
happens to be in the box at the time, is hardly a useful
utility.

I did not ask for POSIXLY_CORRECT -- if you need to have it be
POSIXLY Correct, then use the existing var, but grep is now
broken -- since POSIX doesn't define "text" files "out in the real
world", but only for files that adhere to the POSIX standard.

People don't write emails that adhere to the POSIX standard.

Also, FWIW, grep's manpage doesn't say it is limited to posix-only
files.  It's summary says:
      grep, egrep, fgrep - print lines matching a pattern

which it does not do.  It doesn't say "print lines matching
a pattern only from POSIX text files.



Eric Blake wrote:
> tag 30326 notabug
> thanks
>
> On 02/02/2018 01:30 PM, L. A. Walsh wrote:
>   
>> I've used grep to search through my mbox-format emails for decades, but
>> I've run into a case where it seems to be ignore a text mailbox
>> because, I guess, it thinks it is "binary"
>>     
>
> Yes, that's correct.
>
>   
>> If I used "-Par" it finds it.
>>     
>
> Yes, that's also correct.
>
>   
>> It seems that grep believes the file to binary and ignores it, though
>> "file" calls it "text".
>>     
>
> The file is conditionally text.  The POSIX definition of a text file is
> one whose lines consist of valid characters in the current locale - but
> note this definition is locale-dependent!  So a file that is text under
> one locale may be binary under another.  When you are grepping a file
> encoded correctly for the current locale, you get the output you want;
> when you are grepping a file that contains encoding errors for the
> current locale, POSIX says behavior is undefined, so GNU grep warns you
> that the file is binary (in the current locale); and your use of -a
> tells grep to process it anyways.  As 'file' reported that your file was
> using non-ISO extended-ASCII, it probable means the file was encoded for
> an 8-bit single-byte locale; and my guess is that you were running grep
> under a UTF-8 locale, and generally, UTF-8 treats 8-bit single-byte
> inputs as encoding errors.  Hence the warning that your file is binary,
> under the current locale.
>
> You can also use 'LC_ALL=C grep' to force a locale where EVERY byte is a
> valid character, and thus where you will never encounter encoding errors
> (you may encounter OTHER things that make your file binary, such as
> embedded NULs, but that's a different matter).
>
> This behavior is documented and intentional, so I'm closing this as not
> a bug in the tracker.  However, feel free to add further comments or
> questions to the thread.
>
> And perhaps we could tweak the grep diagnostics to clarify whether a
> file is binary because NUL bytes were encountered, vs. a file is binary
> because encoding errors were encountered.
>
>   




This bug report was last modified 7 years and 34 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.