GNU bug report logs - #20638
BUG: standard & extended RE's don't find NUL's :-(

Previous Next

Package: grep;

Reported by: "L. A. Walsh" <gnu <at> tlinx.org>

Date: Sun, 24 May 2015 00:06:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Linda Walsh <gnu <at> tlinx.org>
To: Eric Blake <eblake <at> redhat.com>
Cc: 20638 <at> debbugs.gnu.org
Subject: bug#20638: BUG: standard & extended RE's don't find NUL's :-(
Date: Sun, 24 May 2015 23:48:03 -0700

Eric Blake wrote:
> On 05/23/2015 06:04 PM, L. A. Walsh wrote:
>   
>> the standard & extended RE's don't find find NUL's:
>>     
>
> Because NULs imply binary data,
I can think of multiple cases were at least 1 'nul'
would be found in text data -- the most prime example
being that it is a Microsoft Text file. 

While MS usually uses a BOM at the beginning of
files, since NT's original format was only LSB/UCS-2, one
still runs into the occasional file -- but just rare enough that
I don't have the vim command to change it in the buffer to a compat
format that I waste time looking it up.

But more to the point some unix files were designed to
work on file -- not just limited to text -- 'strings' for
example.  Right now, it seems grep has lost much in the
'robust' category -- I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file

"-a" doesn't work, BTW:

Ishtar:/tmp> grep -a '\000\000' zeros
Ishtar:/tmp> echo $?
1
Ishtar:/tmp> grep -P '\000\000' zeros 
Binary file zeros matches

But there it is -- if grep wasn't meant to handle binary files,
it wouldn't know to call 'zeroes' a binary file.

Many of the coreutils have worked equally well on binary
as well as txt.  (cat, split, tr, wc to name a few).  But how
can 'shuf' claim to work on input lines yet have this allowed:

  -z, --zero-terminated
line delimiter is NUL, not newline.

'nl' claims the file, 'zeros' (4k of nulls -- created
by bash, that can write a file of zeros, but not read it)
is 1 line.

'pr' will print it (though not too well).

'xargs': <zeros xargs -0 |wc    
     1       0    4096

POSIX is a least common denominator -- it is not a standard
of quality in any way.  People argue to dumb down POSIX
utils, because some corp wants to get a posix label but
has a few shortcomings -- so they donate enough money and
posix changes it's rules.

'less' works with it, but 'more' works faster (just doesn't
display ctl chars). --- but one of the files I searched through
was base64 encoded, and in at least 2 places in the file were
a a run of ~100-200 zeros (in a 10k or more file). 

(That's what I'm looking for -- signs of corruption)...

>  and grepping binary data has unspecified
> results per POSIX.  What's more, the NEWS for 2.21 documents that grep
> is now taking the liberty of treating NUL as a line terminator when -a
> is not in effect, thanks to the behavior being otherwise unspecified by
> POSIX.
>   
----
With a "-0" switch, I presume (not default behavior -- that would
be ungood :^/ )

> Try using 'grep -a' to force grep to treat the file as non-binary, in
> spite of the NULs.
>   
doesn't work -- as mentioned above.  I'd say it's a bug
fair and square...




This bug report was last modified 9 years and 363 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.