GNU bug report logs -
#32704
Can grep search for a line feed and a null character at the same time?
Previous Next
To reply to this bug, email your comments to 32704 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Tue, 11 Sep 2018 16:27:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
21naown <at> gmail.com
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Tue, 11 Sep 2018 16:27:01 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
I found someone who asked the same question on “Stack Overflow”, still
unanswered, but this person did not ask it on the mailing list.
Here are the details of the question which are nearly similar to my case:
https://stackoverflow.com/questions/50295772/can-grep-search-for-a-line-feed-and-a-null-character-at-the-same-time
Thank you for your understanding.
Best regards.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Tue, 11 Sep 2018 17:04:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 32704 <at> debbugs.gnu.org (full text, mbox):
On 9/11/18 11:25 AM, 21naown <at> gmail.com wrote:
> Hello,
>
>
> I found someone who asked the same question on “Stack Overflow”, still
> unanswered, but this person did not ask it on the mailing list.
>
> Here are the details of the question which are nearly similar to my case:
> https://stackoverflow.com/questions/50295772/can-grep-search-for-a-line-feed-and-a-null-character-at-the-same-time
Per 'info grep':
15. How can I match across lines?
Standard grep cannot do this, as it is fundamentally line-based.
Therefore, merely using the ‘[:space:]’ character class does not
match newlines in the way you might expect.
With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input and
output “line” is null-terminated; *note Other Options::. Thus, you
can match newlines in the input, but typically if there is a match
the entire input is output, so this usage is often combined with
output-suppressing options like ‘-q’, e.g.:
printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]\+bar'
If this does not suffice, you can transform the input before giving
it to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other
utilities that are designed to operate across lines.
Grep does not have the ability to match hex or octal backslash
sequences, and a literal newline in the pattern is taken as a separation
of patterns. Use of [:space:] to include newline alongside other things
sort of works. But maybe we really do have a bug - when -z is in
effect, I'd expect NUL, rather than newline, to be the byte that
separates separate patterns in the pattern argument (and thus expressing
a literal newline, as in shells that understand $'\n$', to be viable for
writing a single pattern that matches exactly one newline byte at the
end of a NUL-separated record).
That said, your EASIEST approach is to use iconv to recode your file out
of UTF-16 (which is NOT conducive to multi-byte processing), into
something friendlier like UTF-8, and then use grep on the converted file.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Tue, 11 Sep 2018 17:15:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 32704 <at> debbugs.gnu.org (full text, mbox):
On 9/11/18 10:03 AM, Eric Blake wrote:
> maybe we really do have a bug - when -z is in effect, I'd expect NUL,
> rather than newline, to be the byte that separates separate patterns
> in the pattern argument
You're right, I think it's a bug that grep -zf FILE uses newline
separators in FILE. It should use NUL separators.
This cannot be done for NUL bytes in command-line patterns, though,
since command-line arguments cannot contain NUL bytes.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Tue, 11 Sep 2018 17:41:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 32704 <at> debbugs.gnu.org (full text, mbox):
On 9/11/18 12:14 PM, Paul Eggert wrote:
> On 9/11/18 10:03 AM, Eric Blake wrote:
>> maybe we really do have a bug - when -z is in effect, I'd expect NUL,
>> rather than newline, to be the byte that separates separate patterns
>> in the pattern argument
>
> You're right, I think it's a bug that grep -zf FILE uses newline
> separators in FILE. It should use NUL separators.
>
> This cannot be done for NUL bytes in command-line patterns, though,
> since command-line arguments cannot contain NUL bytes.
Indeed. But that merely means that on the command line, when -z is in
effect, you can't specify multiple patterns (but instead have to use -f
FILE if that's what you really want). Meanwhile, the effect on being
able to match a literal newline would be observable from either the
command line or -f FILE.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Sat, 15 Sep 2018 17:07:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 32704 <at> debbugs.gnu.org (full text, mbox):
On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
> Thank you for your messages.
>
> It is possible I did not understand correctly your messages, because
> grep finds hex sequences with the “-Pa” options at least.
grep -P introduces a completely different regex engine, with its own
quirks. As such, it does introduce different rules on backslash
sequences accepted.
>
> Examples—“input.txt” contains, from the file system, for example
> “\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00”:
>
> grep -Pa '\x00' input.txt
> → found
> grep -Pza '\x0A' input.txt
> → found
> grep -Pa '\x0A\x00' input.txt
This will never match - when you are not using -z, there are no \x0A in
the input stream (they have all been consumed by grep parsing one line
at a time, ending at \x0A). Instead, you'll want to search for '^\x00'
or '\x00$' for a pattern anchored to a line transition, to find patterns
where newline was next to NUL.
> grep -Pza '\x0A\x00' input.txt
> → not found for the both
Similarly, when you are using -z, there are no \x00 in the input stream
(they have all been consumed by grep parsing one NUL-terminated record
at a time, ending at \x00). Instead, you'll want to search for '^\x0a'
or '\x0a$' for a pattern anchored to a record transition, to find
patterns where newline was next to NUL.
>
> But is it at least possible to find “\x0A\x00” with grep?
If you bend the rules by throwing -P into the mix, yes :)
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Sat, 15 Sep 2018 17:44:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 32704 <at> debbugs.gnu.org (full text, mbox):
Thank you for your messages.
It is possible I did not understand correctly your messages, because
grep finds hex sequences with the “-Pa” options at least.
Examples—“input.txt” contains, from the file system, for example
“\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00”:
grep -Pa '\x00' input.txt
→ found
grep -Pza '\x0A' input.txt
→ found
grep -Pa '\x0A\x00' input.txt
grep -Pza '\x0A\x00' input.txt
→ not found for the both
But is it at least possible to find “\x0A\x00” with grep?
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Sat, 15 Sep 2018 17:58:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 32704 <at> debbugs.gnu.org (full text, mbox):
Le 15/09/2018 à 19:06, Eric Blake a écrit :
> On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
>> Thank you for your messages.
>>
>> It is possible I did not understand correctly your messages, because
>> grep finds hex sequences with the “-Pa” options at least.
>
> grep -P introduces a completely different regex engine, with its own
> quirks. As such, it does introduce different rules on backslash
> sequences accepted.
>
>>
>> Examples—“input.txt” contains, from the file system, for example
>> “\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00”:
>>
>> grep -Pa '\x00' input.txt
>> → found
>> grep -Pza '\x0A' input.txt
>> → found
>> grep -Pa '\x0A\x00' input.txt
>
> This will never match - when you are not using -z, there are no \x0A
> in the input stream (they have all been consumed by grep parsing one
> line at a time, ending at \x0A). Instead, you'll want to search for
> '^\x00' or '\x00$' for a pattern anchored to a line transition, to
> find patterns where newline was next to NUL.
>
>> grep -Pza '\x0A\x00' input.txt
>> → not found for the both
>
> Similarly, when you are using -z, there are no \x00 in the input
> stream (they have all been consumed by grep parsing one
> NUL-terminated record at a time, ending at \x00). Instead, you'll
> want to search for '^\x0a' or '\x0a$' for a pattern anchored to a
> record transition, to find patterns where newline was next to NUL.
>
>>
>> But is it at least possible to find “\x0A\x00” with grep?
>
> If you bend the rules by throwing -P into the mix, yes :)
>
So it is possible to find “\x0A\x00” alone, but for example
“\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with the
“-P” option?
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Sat, 15 Sep 2018 20:21:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 32704 <at> debbugs.gnu.org (full text, mbox):
Hello,
On 15/09/18 11:57 AM, 21naown <at> gmail.com wrote:
> Le 15/09/2018 à 19:06, Eric Blake a écrit :
>> On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
>>> But is it at least possible to find “\x0A\x00” with grep?
>>
>> If you bend the rules by throwing -P into the mix, yes :)
>>
> So it is possible to find “\x0A\x00” alone, but for example
> “\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with the
> “-P” option?
If I may suggest a different tool, GNU sed can handle such regexes more
easily than grep.
The 'trick' is to accumulate multiple lines into memory, then run the
regex on the entire buffer.
1.
If you input is small enough to fit in memory,
you can load the entire file into memory,
and run the regex on the buffer:
$ printf
'\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00'
\
| LC_ALL=C sed -n 'H;$!d ; x ; /\x0a\x00/q0 ; q1' \
&& echo MATCH || echo NO-MATCH
The "H;$!d" commands accumulate lines into the hold buffer.
The "x" command copies the hold buffer into the pattern buffer.
Then the regex "/\x0a\x00/" searches in the buffer.
If there was a match, sed quits with exit code 0 ("q0").
Otherwise, sed quits with exit code 1 ("q1").
2.
If the file is too big to fit in memory,
you can process it line-by-line like so:
$ printf
'\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00'
\
| LC_ALL=C sed -n 'N;/\x00\x0a/q0;$q1;D;' \
&& echo MATCH || echo NO-MATCH
The N,D commands work in tandem to append the next line into the
buffer, then delete the last line from the buffer (think FIFO).
The regex then operates on the buffer which contains the last two lines.
More details are in the manual:
https://www.gnu.org/software/sed/manual/sed.html#Multiline-techniques
https://www.gnu.org/software/sed/manual/sed.html#Text-search-across-multiple-lines
regards,
- assaf
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Sat, 15 Sep 2018 20:28:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 32704 <at> debbugs.gnu.org (full text, mbox):
On 9/15/18 12:57 PM, 21naown <at> gmail.com wrote:
>>> But is it at least possible to find “\x0A\x00” with grep?
>>
>> If you bend the rules by throwing -P into the mix, yes :)
>>
> So it is possible to find “\x0A\x00” alone, but for example
> “\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with the
> “-P” option?
Correct. It is impossible to find the record terminator in the middle of
a pattern, whether that terminator is \n (default) or NUL (-z). It is
therefore impossible to find a multi-record match using grep. The
string you listed contains both \x00 and \x0a, so regardless of which of
those two bytes you pick as the record terminator, it is impossible to
use grep to find that substring in your file. You'll have to resort to
a tool that supports multiline matching, since grep is not such a tool.
It IS possible, of course, to change your data, for example:
tr '\0' '\xff' < file | grep $modified_pattern | tr '\xff' '\0'
assuming that \xff didn't appear anywhere else in the file; although it
may make matching harder if you don't have the right record terminators
any longer. Or, if your input data is encoded in UTF-16, it's easiest
to convert it into UTF-8 for the grep:
iconv -f UTF-16 -t UTF-8 < file | grep $modified_pattern \
| iconv -f UTF-8 -t UTF-16
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization: qemu.org | libvirt.org
Information forwarded
to
bug-grep <at> gnu.org
:
bug#32704
; Package
grep
.
(Mon, 17 Sep 2018 15:57:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 32704 <at> debbugs.gnu.org (full text, mbox):
Hello Assaf.
Thank you Assaf and Eric for your suggestions. I will also look at the
tool “pcregrep”.
--------------------------------------------------------------------------------
Thank you Eric for having answered the question of the subject:
Le 15/09/2018 à 22:27, Eric Blake a écrit :
> On 9/15/18 12:57 PM, 21naown <at> gmail.com wrote:
>
>> So it is possible to find “\x0A\x00” alone, but for example
>> “\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with
>> the “-P” option?
>
> Correct. It is impossible to find the record terminator in the middle
> of a pattern, whether that terminator is \n (default) or NUL (-z). It
> is therefore impossible to find a multi-record match using grep. The
> string you listed contains both \x00 and \x0a, so regardless of which
> of those two bytes you pick as the record terminator, it is impossible
> to use grep to find that substring in your file. You'll have to
> resort to a tool that supports multiline matching, since grep is not
> such a tool.
Severity set to 'wishlist' from 'normal'
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Mon, 21 Sep 2020 19:48:01 GMT)
Full text and
rfc822 format available.
This bug report was last modified 4 years and 328 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.