GNU bug report logs -
#20638
BUG: standard & extended RE's don't find NUL's :-(
Previous Next
Reported by: "L. A. Walsh" <gnu <at> tlinx.org>
Date: Sun, 24 May 2015 00:06:02 UTC
Severity: normal
Tags: notabug
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
Message #11 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Eric Blake wrote:
> On 05/23/2015 06:04 PM, L. A. Walsh wrote:
>
>> the standard & extended RE's don't find find NUL's:
>>
>
> Because NULs imply binary data,
I can think of multiple cases were at least 1 'nul'
would be found in text data -- the most prime example
being that it is a Microsoft Text file.
While MS usually uses a BOM at the beginning of
files, since NT's original format was only LSB/UCS-2, one
still runs into the occasional file -- but just rare enough that
I don't have the vim command to change it in the buffer to a compat
format that I waste time looking it up.
But more to the point some unix files were designed to
work on file -- not just limited to text -- 'strings' for
example. Right now, it seems grep has lost much in the
'robust' category -- I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file
"-a" doesn't work, BTW:
Ishtar:/tmp> grep -a '\000\000' zeros
Ishtar:/tmp> echo $?
1
Ishtar:/tmp> grep -P '\000\000' zeros
Binary file zeros matches
But there it is -- if grep wasn't meant to handle binary files,
it wouldn't know to call 'zeroes' a binary file.
Many of the coreutils have worked equally well on binary
as well as txt. (cat, split, tr, wc to name a few). But how
can 'shuf' claim to work on input lines yet have this allowed:
-z, --zero-terminated
line delimiter is NUL, not newline.
'nl' claims the file, 'zeros' (4k of nulls -- created
by bash, that can write a file of zeros, but not read it)
is 1 line.
'pr' will print it (though not too well).
'xargs': <zeros xargs -0 |wc
1 0 4096
POSIX is a least common denominator -- it is not a standard
of quality in any way. People argue to dumb down POSIX
utils, because some corp wants to get a posix label but
has a few shortcomings -- so they donate enough money and
posix changes it's rules.
'less' works with it, but 'more' works faster (just doesn't
display ctl chars). --- but one of the files I searched through
was base64 encoded, and in at least 2 places in the file were
a a run of ~100-200 zeros (in a 10k or more file).
(That's what I'm looking for -- signs of corruption)...
> and grepping binary data has unspecified
> results per POSIX. What's more, the NEWS for 2.21 documents that grep
> is now taking the liberty of treating NUL as a line terminator when -a
> is not in effect, thanks to the behavior being otherwise unspecified by
> POSIX.
>
----
With a "-0" switch, I presume (not default behavior -- that would
be ungood :^/ )
> Try using 'grep -a' to force grep to treat the file as non-binary, in
> spite of the NULs.
>
doesn't work -- as mentioned above. I'd say it's a bug
fair and square...
This bug report was last modified 9 years and 363 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.