#20638 - BUG: standard & extended RE's don't find NUL's :-(

GNU bug report logs - #20638
BUG: standard & extended RE's don't find NUL's :-(

Package: grep;

Reported by: "L. A. Walsh" <gnu <at> tlinx.org>

Date: Sun, 24 May 2015 00:06:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Message #11 received at 20638 <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <gnu <at> tlinx.org> To: Eric Blake <eblake <at> redhat.com> Cc: 20638 <at> debbugs.gnu.org Subject: Re: bug#20638: BUG: standard & extended RE's don't find NUL's :-( Date: Sun, 24 May 2015 23:48:03 -0700

Eric Blake wrote: > On 05/23/2015 06:04 PM, L. A. Walsh wrote: > >> the standard & extended RE's don't find find NUL's: >> > > Because NULs imply binary data, I can think of multiple cases were at least 1 'nul' would be found in text data -- the most prime example being that it is a Microsoft Text file. While MS usually uses a BOM at the beginning of files, since NT's original format was only LSB/UCS-2, one still runs into the occasional file -- but just rare enough that I don't have the vim command to change it in the buffer to a compat format that I waste time looking it up. But more to the point some unix files were designed to work on file -- not just limited to text -- 'strings' for example. Right now, it seems grep has lost much in the 'robust' category -- I had one file that it bailed on saying it has an invalid UTF-8 encoding -- but the line was recursive starting from '.' -- and it didn't name the file "-a" doesn't work, BTW: Ishtar:/tmp> grep -a '\000\000' zeros Ishtar:/tmp> echo $? 1 Ishtar:/tmp> grep -P '\000\000' zeros Binary file zeros matches But there it is -- if grep wasn't meant to handle binary files, it wouldn't know to call 'zeroes' a binary file. Many of the coreutils have worked equally well on binary as well as txt. (cat, split, tr, wc to name a few). But how can 'shuf' claim to work on input lines yet have this allowed: -z, --zero-terminated line delimiter is NUL, not newline. 'nl' claims the file, 'zeros' (4k of nulls -- created by bash, that can write a file of zeros, but not read it) is 1 line. 'pr' will print it (though not too well). 'xargs': <zeros xargs -0 |wc 1 0 4096 POSIX is a least common denominator -- it is not a standard of quality in any way. People argue to dumb down POSIX utils, because some corp wants to get a posix label but has a few shortcomings -- so they donate enough money and posix changes it's rules. 'less' works with it, but 'more' works faster (just doesn't display ctl chars). --- but one of the files I searched through was base64 encoded, and in at least 2 places in the file were a a run of ~100-200 zeros (in a 10k or more file). (That's what I'm looking for -- signs of corruption)... > and grepping binary data has unspecified > results per POSIX. What's more, the NEWS for 2.21 documents that grep > is now taking the liberty of treating NUL as a line terminator when -a > is not in effect, thanks to the behavior being otherwise unspecified by > POSIX. > ---- With a "-0" switch, I presume (not default behavior -- that would be ungood :^/ ) > Try using 'grep -a' to force grep to treat the file as non-binary, in > spite of the NULs. > doesn't work -- as mentioned above. I'd say it's a bug fair and square...

This bug report was last modified 9 years and 363 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20638 BUG: standard & extended RE's don't find NUL's :-(

GNU bug report logs - #20638
BUG: standard & extended RE's don't find NUL's :-(