GNU bug report logs -
#20638
BUG: standard & extended RE's don't find NUL's :-(
Previous Next
Reported by: "L. A. Walsh" <gnu <at> tlinx.org>
Date: Sun, 24 May 2015 00:06:02 UTC
Severity: normal
Tags: notabug
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20638 in the body.
You can then email your comments to 20638 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Sun, 24 May 2015 00:06:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"L. A. Walsh" <gnu <at> tlinx.org>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 24 May 2015 00:06:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
the standard & extended RE's don't find find NUL's:
> dd if=/dev/zero of=zeros bs=4k count=1
> command grep -Pq '\000\000' zeros && echo "badness"
badness
> command grep -Eq '\000\000' zeros && echo "badness"
> command grep -Gq '\000\000' zeros && echo "badness"
> command grep -q '\000\000' zeros && echo "badness"
> rpm -q grep
grep-2.20-2.4.1.x86_64
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Sun, 24 May 2015 13:00:09 GMT)
Full text and
rfc822 format available.
Message #8 received at 20638 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 05/23/2015 06:04 PM, L. A. Walsh wrote:
> the standard & extended RE's don't find find NUL's:
Because NULs imply binary data, and grepping binary data has unspecified
results per POSIX. What's more, the NEWS for 2.21 documents that grep
is now taking the liberty of treating NUL as a line terminator when -a
is not in effect, thanks to the behavior being otherwise unspecified by
POSIX.
Try using 'grep -a' to force grep to treat the file as non-binary, in
spite of the NULs.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Mon, 25 May 2015 06:49:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Eric Blake wrote:
> On 05/23/2015 06:04 PM, L. A. Walsh wrote:
>
>> the standard & extended RE's don't find find NUL's:
>>
>
> Because NULs imply binary data,
I can think of multiple cases were at least 1 'nul'
would be found in text data -- the most prime example
being that it is a Microsoft Text file.
While MS usually uses a BOM at the beginning of
files, since NT's original format was only LSB/UCS-2, one
still runs into the occasional file -- but just rare enough that
I don't have the vim command to change it in the buffer to a compat
format that I waste time looking it up.
But more to the point some unix files were designed to
work on file -- not just limited to text -- 'strings' for
example. Right now, it seems grep has lost much in the
'robust' category -- I had one file that it bailed on
saying it has an invalid UTF-8 encoding -- but the line was
recursive starting from '.' -- and it didn't name the file
"-a" doesn't work, BTW:
Ishtar:/tmp> grep -a '\000\000' zeros
Ishtar:/tmp> echo $?
1
Ishtar:/tmp> grep -P '\000\000' zeros
Binary file zeros matches
But there it is -- if grep wasn't meant to handle binary files,
it wouldn't know to call 'zeroes' a binary file.
Many of the coreutils have worked equally well on binary
as well as txt. (cat, split, tr, wc to name a few). But how
can 'shuf' claim to work on input lines yet have this allowed:
-z, --zero-terminated
line delimiter is NUL, not newline.
'nl' claims the file, 'zeros' (4k of nulls -- created
by bash, that can write a file of zeros, but not read it)
is 1 line.
'pr' will print it (though not too well).
'xargs': <zeros xargs -0 |wc
1 0 4096
POSIX is a least common denominator -- it is not a standard
of quality in any way. People argue to dumb down POSIX
utils, because some corp wants to get a posix label but
has a few shortcomings -- so they donate enough money and
posix changes it's rules.
'less' works with it, but 'more' works faster (just doesn't
display ctl chars). --- but one of the files I searched through
was base64 encoded, and in at least 2 places in the file were
a a run of ~100-200 zeros (in a 10k or more file).
(That's what I'm looking for -- signs of corruption)...
> and grepping binary data has unspecified
> results per POSIX. What's more, the NEWS for 2.21 documents that grep
> is now taking the liberty of treating NUL as a line terminator when -a
> is not in effect, thanks to the behavior being otherwise unspecified by
> POSIX.
>
----
With a "-0" switch, I presume (not default behavior -- that would
be ungood :^/ )
> Try using 'grep -a' to force grep to treat the file as non-binary, in
> spite of the NULs.
>
doesn't work -- as mentioned above. I'd say it's a bug
fair and square...
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Mon, 25 May 2015 15:20:03 GMT)
Full text and
rfc822 format available.
Message #14 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Linda Walsh wrote:
> I had one file that it bailed on
> saying it has an invalid UTF-8 encoding -- but the line was
> recursive starting from '.' -- and it didn't name the file
That's pretty vague. Can you reproduce that problem? I don't observe it:
$ mkdir d
$ printf 'a\200\n' >d/f
$ printf 'b\200\n' >d/g
$ grep -r a d
Binary file d/f matches
> "-a" doesn't work, BTW:
>
> Ishtar:/tmp> grep -a '\000\000' zeros
> Ishtar:/tmp> echo $?
> 1
That's the way 'grep' has always behaved. The regular expression '\0' matches
the string "0", not the NUL byte.
> Ishtar:/tmp> grep -P '\000\000' zeros Binary file zeros matches
I don't follow this example; perhaps some text was omitted? Anyway, -P has
always treated files containing zeros as binary files too, ever since -P has
been introduced. It's the same as without -P.
> But there it is -- if grep wasn't meant to handle binary files,
> it wouldn't know to call 'zeroes' a binary file.
Obviously, grep *is* meant to handle binary files; it's documented to handle
them in a particular way.
> how can 'shuf' claim to work on input lines yet have this allowed:
>
> -z, --zero-terminated
> line delimiter is NUL, not newline.
I don't follow this point. -z is a nice feature; we don't want to get rid of it.
> People argue to dumb down POSIX
> utils, because some corp wants to get a posix label but
> has a few shortcomings -- so they donate enough money and
> posix changes it's rules.
I'm afraid you've gone off the deep end here.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Mon, 25 May 2015 19:47:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> Linda Walsh wrote:
>
>> I had one file that it bailed on
>> saying it has an invalid UTF-8 encoding -- but the line was
>> recursive starting from '.' -- and it didn't name the file
----
I didn't report that as 'a bug', because when I went back to reproduce
it -- low level physics took over -- i.e. the closer I looked, the more
uncertain the problem became! I did change the grep * into a for i in
*;do echo file;grep file;...but couldn't find the file that gave the
message...Grrr. I will bet it was with the '-P' option, since the
standard Regex in perl complains about such things and since I was only
interested in status (was using -q _because_ I was searching for a
binary pattern -- the '\000\000') I got the warning but nothing else.
If I run into it again, maybe I can find it w/o looking too closely
then that uncertainty principle won't kick in... ;-)
>
> That's pretty vague. Can you reproduce that problem? I don't observe
> it:
>
> $ mkdir d
> $ printf 'a\200\n' >d/f
> $ printf 'b\200\n' >d/g
> $ grep -r a d
> Binary file d/f matches
>
>> "-a" doesn't work, BTW:
>>
>> Ishtar:/tmp> grep -a '\000\000' zeros
>> Ishtar:/tmp> echo $?
>> 1
>
> That's the way 'grep' has always behaved. The regular expression '\0'
> matches the string "0", not the NUL byte.
>
>> Ishtar:/tmp> grep -P '\000\000' zeros Binary file zeros matches
>
> I don't follow this example; perhaps some text was omitted? Anyway,
> -P has always treated files containing zeros as binary files too, ever
> since -P has been introduced. It's the same as without -P.
>
>> But there it is -- if grep wasn't meant to handle binary files,
>> it wouldn't know to call 'zeroes' a binary file.
>
> Obviously, grep *is* meant to handle binary files; it's documented to
> handle them in a particular way.
---
Nevertheless, it is documented, that '\ddd' or '\xHH' can be used
to match a single character of the value specified. '\000\000' is
found in 'zeroes' (as mentioned in the original report -- a file
filled with 4k of nulls), with the -P switch, but not the -a switch.
That behavior violates the documentation.
>
>> how can 'shuf' claim to work on input lines yet have this allowed:
>>
>> -z, --zero-terminated
>> line delimiter is NUL, not newline.
>
> I don't follow this point. -z is a nice feature; we don't want to get
> rid of it.
----
Nice of you to not read the previous notes. The argument was that
a NUL in a file made it non-text -- therefore it woudln't be a "line".
>
>> People argue to dumb down POSIX
>> utils, because some corp wants to get a posix label but
>> has a few shortcomings -- so they donate enough money and
>> posix changes it's rules.
>
> I'm afraid you've gone off the deep end here.
I didn't bring up POSIX, Eric did. Again, nice of you to jump
in the middle of a conversation and not read the earlier notes...
:-)
*Cheers* Paul...(et al).
-linda
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Mon, 25 May 2015 19:55:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Linda Walsh wrote:
> it is documented, that '\ddd' or '\xHH' can be used
> to match a single character of the value specified.
I don't see where it's documented to behave that way. Perhaps you're looking at
the wrong documentation?
> The argument was that
> a NUL in a file made it non-text -- therefore it woudln't be a "line".
Obviously -z changes the definition of a line. -z is explicitly designed to
operate on files containing NUL bytes. So that argument was not coherent.
>> I'm afraid you've gone off the deep end here.
> I didn't bring up POSIX, Eric did.
Eric's comments didn't incorporate conspiracy theories about corporate payoffs;
yours did.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Mon, 25 May 2015 22:23:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> Linda Walsh wrote:
>> it is documented, that '\ddd' or '\xHH' can be used
>> to match a single character of the value specified.
>
> I don't see where it's documented to behave that way. Perhaps you're
> looking at the wrong documentation?
Perhaps you want to tell me where the documentation on the
standard and/or extended RE's is that you use?
I think I was referred to a number of different manpages...
it's the first reference under "See Also" at the bottom of the
grep page: awk. From the awk manpage:
String Constants
String constants in AWK are sequences of characters enclosed
between
double quotes (like "value"). Within strings, certain escape
sequences
are recognized, as in C. These are:
\\ A literal backslash.
\a The "alert" character; usually the ASCII BEL character.
\b Backspace.
\f Form-feed.
\n Newline.
\r Carriage return.
\t Horizontal tab.
\v Vertical tab.
\xhex digits
The character represented by the string of hexadecimal
digits fol-
lowing the \x. As in ISO C, all following hexadecimal
digits are
considered part of the escape sequence. (This feature
should tell
us something about language design by committee.) E.g.,
"\x1B" is
the ASCII ESC (escape) character.
\ddd The character represented by the 1-, 2-, or 3-digit
sequence of
octal digits. E.g., "\033" is the ASCII ESC (escape) character.
>> The argument was that
>> a NUL in a file made it non-text -- therefore it woudln't be a "line".
>
> Obviously -z changes the definition of a line. -z is explicitly
> designed to operate on files containing NUL bytes. So that argument
> was not coherent.
---
That is my opinion, also, but nevertheless, that '\000' implies binary
was said
early in this bug-discusion -- I was refuting that. The other thing that
corrupts some tools is not working well if there is no terminating LF at
the end
of a page of text. (i.e. some editors will text-based files by adding
an extra
LF at the end, which can cause problems with config files in some cases.
>
>>> I'm afraid you've gone off the deep end here.
>> I didn't bring up POSIX, Eric did.
>
> Eric's comments didn't incorporate conspiracy theories about corporate
> payoffs; yours did.
---
I am stating facts. The ones who had the most influence on posix in
the past
were the largest "gold sponsors". Now, it's fewer of them and more
'silver'....
but they, historically have had the most influence on such standards
organizations.
I will remind you that POSIX described its initial mission statement as
"descriptive" -- not "prescriptive". That changed ~ 2003 or so when
they started
telling implementors what they had to remove to be posix compliant.
The worst violation I can think of is removing the ability for rm to be used
easily and safely to remove everything under a specific directory:
"rm -fr --one-file-system ." -- It might be good to have a 1 char name for
that. For some reason I remember "-x" being a reasonable choice.
"rm" was always described to do a depth-first traversal, which means it
shouldn't
even look at top-paths except to descend into them.That was changed
making coreutils rm's that follow that standard, unreliable for removing
dir contents (w/o removing the dir).
I have good reasons -- not conspiracy, but capitalistic reasons for
what I say, and if you don't believe money and capitalism run this country,
I'd have to say it was you, who had gone off the deep end.
But if you had -- I can probably welcome you -- I think I live in the
deep end... ;-)
linda
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Mon, 25 May 2015 22:59:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Linda Walsh wrote:
> Perhaps you want to tell me where the documentation on the
> standard and/or extended RE's is that you use?
We're talking about grep, so the relevant documentation is the grep manual, not
the awk manual or other random stuff you might find on the Internet. Type 'info
grep'. Or if you're in Emacs, type 'C-h i m grep RET'.
> I have good reasons -- not conspiracy, but capitalistic reasons for
> what I say
Whether you do or not, they're irrelevant to this discussion and to be honest
that tinfoil-hat stuff isn't helping your case.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Tue, 26 May 2015 01:20:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> Linda Walsh wrote:
>
>> Perhaps you want to tell me where the documentation on the
>> standard and/or extended RE's is that you use?
>
> We're talking about grep, so the relevant documentation is the grep
> manual, not the awk manual or other random stuff you might find on the
> Internet. Type 'info grep'. Or if you're in Emacs, type 'C-h i m
> grep RET'.
----
From the coreutils-5.97 info page:
Backslash escapes
A backslash followed by a character not listed below causes an
error message.
`\a'
Control-G.
`\b'
Control-H.
`\f'
Control-L.
`\n'
Control-J.
`\r'
Control-M.
`\t'
Control-I.
`\v'
Control-K.
`\OOO'
The character with the value given by OOO, which is 1 to 3
octal digits,
`\\'
A backslash.
----
It didn't have 'hex' back then. But you've broken backward compatibility.
That would normally be a regression. You like to think that I wear a
tinfoil hat -- but I just have a good memory for how grep used to operate.
Maybe you should do some memory strengthening exercises (though I admit my
memory isn't what it always was, it was in this case).
Should I file this as a 2nd bug, that grep broke backward compat?
It *used* to be compatible with 'awk's regex, which is why it is the
first entry in the "See also".
>
> Whether you do or not, they're irrelevant to this discussion and to be
> honest that tinfoil-hat stuff isn't helping your case.
----
no tinfoil hat -- just a good memory, something you might find useful to
work on! ;-)
-linda
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Tue, 26 May 2015 01:30:12 GMT)
Full text and
rfc822 format available.
Message #32 received at 20638 <at> debbugs.gnu.org (full text, mbox):
> From the coreutils-5.97 info page:
Like I said, we're talking about grep, so you need to look at the grep manual.
grep is not part of coreutils, so you're barking up the wrong tree again.
> It *used* to be compatible with 'awk's regex
No, that's never been true. It wasn't true even back in the late 1970s, when I
first used grep and awk. If you want to blame someone, blame the Bell Labs
hackers who wrote them in the first place.
> no tinfoil hat -- just a good memory, something you might find useful to
> work on! ;-)
In this particular case I'm afraid your memory has played tricks on you.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Tue, 26 May 2015 02:14:02 GMT)
Full text and
rfc822 format available.
Message #35 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> Linda Walsh wrote:
>
>> Perhaps you want to tell me where the documentation on the
>> standard and/or extended RE's is that you use?
----
Here is another:
*POSIX Extended Regular Expression Syntax:
(http://www.boost.org/doc/libs/1_43_0/libs/regex/doc/html/boost_regex/syntax/basic_extended.html)
Escapes
The POSIX standard defines no escape sequences for POSIX-Extended
regular expressions, except that:
* Any special character preceded by an escape shall match itself.
* The effect of any ordinary character being preceded by an escape
is undefined.
* An escape inside a character class declaration shall match itself:
in other words the escape character is not "special" inside a character
class declaration; so [\^] will match either a literal '\' or a '^'.
However, that's rather restrictive, so the following standard-compatible
extensions are also supported by Boost.Regex:
Escapes matching a specific character
The following escape sequences are all synonyms for single characters:
Escape
Character
\a
'\a'
\e
0x1B
\f
\f
\n
\n
\r
\r
\t
\t
\v
\v
\b
\b (but only inside a character class declaration).
\cX
An ASCII escape sequence - the character whose code point is X % 32
\xdd
A hexadecimal escape sequence - matches the single character whose code
point is 0xdd.
\x{dddd}
A hexadecimal escape sequence - matches the single character whose code
point is 0xdddd.
\0ddd
An octal escape sequence - matches the single character whose code point
is 0ddd.
\N{Name}
Matches the single character which has the symbolic name name. For
example \\N{newline} matches the single character \n.
*
>
> We're talking about grep, so the relevant documentation is the grep
> manual, not the awk manual or other random stuff you might find on the
> Internet. Type 'info grep'. Or if you're in Emacs, type 'C-h i m
> grep RET'.
-----
Again another example of \000 octal and \x hex.
Most desccriptions of the chars grep takes say it was designed so that
awk, sed, tr -- any core linux util that takes regexes - to be *the
ssame* so people didn't have to learn a different syntax for each tool.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#20638
; Package
grep
.
(Tue, 26 May 2015 02:31:02 GMT)
Full text and
rfc822 format available.
Message #38 received at 20638 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> In this particular case I'm afraid your memory has played tricks on you
---
You may be right ..;-(
I found these:
http://unix.stackexchange.com/questions/19491/how-to-specify-characters-using-hexadecimal-codes-in-grep
http://stackoverflow.com/questions/6319878/using-grep-to-search-for-hex-strings-in-a-file
and two others which pointed to the '-P' option
as being the only way in newer grep's..
Needless to say, I am scandalized...
However, One could always ask that those be added so as to be compatible
w/sed, awk....etc?
I.e. an RFE??
I'm pretty sure the grep didn't have backreferences in it before either.
You going to tell me those date back to Bell labs as well?
I.e.-- if you look at earlier info pages, there wasn't a separate regex
for grep section (that I could fine).... many of the regex-taking utils
pointed at each other for more clarification.
I thought it was from those that grep had the same notation. Not exactly
a faulty memory, but improbable logic?
Added tag(s) notabug.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Sat, 30 May 2015 20:05:05 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
20638 <at> debbugs.gnu.org and "L. A. Walsh" <gnu <at> tlinx.org>
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Sat, 30 May 2015 20:05:06 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 28 Jun 2015 11:24:06 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 362 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.