GNU bug report logs - #32704
Can grep search for a line feed and a null character at the same time?

Previous Next

Package: grep;

Reported by: 21naown <at> gmail.com

Date: Tue, 11 Sep 2018 16:27:01 UTC

Severity: wishlist

To reply to this bug, email your comments to 32704 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Tue, 11 Sep 2018 16:27:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to 21naown <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 11 Sep 2018 16:27:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: 21naown <at> gmail.com
To: bug-grep <at> gnu.org
Subject: Can grep search for a line feed and a null character at the same time?
Date: Tue, 11 Sep 2018 18:25:20 +0200
Hello,


I found someone who asked the same question on “Stack Overflow”, still 
unanswered, but this person did not ask it on the mailing list.

Here are the details of the question which are nearly similar to my case:
https://stackoverflow.com/questions/50295772/can-grep-search-for-a-line-feed-and-a-null-character-at-the-same-time


Thank you for your understanding.

Best regards.





Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Tue, 11 Sep 2018 17:04:01 GMT) Full text and rfc822 format available.

Message #8 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Tue, 11 Sep 2018 12:03:17 -0500
On 9/11/18 11:25 AM, 21naown <at> gmail.com wrote:
> Hello,
> 
> 
> I found someone who asked the same question on “Stack Overflow”, still 
> unanswered, but this person did not ask it on the mailing list.
> 
> Here are the details of the question which are nearly similar to my case:
> https://stackoverflow.com/questions/50295772/can-grep-search-for-a-line-feed-and-a-null-character-at-the-same-time 

Per 'info grep':

  15. How can I match across lines?

     Standard grep cannot do this, as it is fundamentally line-based.
     Therefore, merely using the ‘[:space:]’ character class does not
     match newlines in the way you might expect.

     With the GNU ‘grep’ option ‘-z’ (‘--null-data’), each input and
     output “line” is null-terminated; *note Other Options::.  Thus, you
     can match newlines in the input, but typically if there is a match
     the entire input is output, so this usage is often combined with
     output-suppressing options like ‘-q’, e.g.:

          printf 'foo\nbar\n' | grep -z -q 'foo[[:space:]]\+bar'

     If this does not suffice, you can transform the input before giving
     it to ‘grep’, or turn to ‘awk’, ‘sed’, ‘perl’, or many other
     utilities that are designed to operate across lines.

Grep does not have the ability to match hex or octal backslash 
sequences, and a literal newline in the pattern is taken as a separation 
of patterns.  Use of [:space:] to include newline alongside other things 
sort of works.  But maybe we really do have a bug - when -z is in 
effect, I'd expect NUL, rather than newline, to be the byte that 
separates separate patterns in the pattern argument (and thus expressing 
a literal newline, as in shells that understand $'\n$', to be viable for 
writing a single pattern that matches exactly one newline byte at the 
end of a NUL-separated record).

That said, your EASIEST approach is to use iconv to recode your file out 
of UTF-16 (which is NOT conducive to multi-byte processing), into 
something friendlier like UTF-8, and then use grep on the converted file.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Tue, 11 Sep 2018 17:15:02 GMT) Full text and rfc822 format available.

Message #11 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Eric Blake <eblake <at> redhat.com>, 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Tue, 11 Sep 2018 10:14:51 -0700
On 9/11/18 10:03 AM, Eric Blake wrote:
> maybe we really do have a bug - when -z is in effect, I'd expect NUL, 
> rather than newline, to be the byte that separates separate patterns 
> in the pattern argument

You're right, I think it's a bug that grep -zf FILE uses newline 
separators in FILE. It should use NUL separators.

This cannot be done for NUL bytes in command-line patterns, though, 
since command-line arguments cannot contain NUL bytes.





Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Tue, 11 Sep 2018 17:41:02 GMT) Full text and rfc822 format available.

Message #14 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Tue, 11 Sep 2018 12:39:54 -0500
On 9/11/18 12:14 PM, Paul Eggert wrote:
> On 9/11/18 10:03 AM, Eric Blake wrote:
>> maybe we really do have a bug - when -z is in effect, I'd expect NUL, 
>> rather than newline, to be the byte that separates separate patterns 
>> in the pattern argument
> 
> You're right, I think it's a bug that grep -zf FILE uses newline 
> separators in FILE. It should use NUL separators.
> 
> This cannot be done for NUL bytes in command-line patterns, though, 
> since command-line arguments cannot contain NUL bytes.

Indeed.  But that merely means that on the command line, when -z is in 
effect, you can't specify multiple patterns (but instead have to use -f 
FILE if that's what you really want).  Meanwhile, the effect on being 
able to match a literal newline would be observable from either the 
command line or -f FILE.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Sat, 15 Sep 2018 17:07:01 GMT) Full text and rfc822 format available.

Message #17 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Sat, 15 Sep 2018 12:06:47 -0500
On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
> Thank you for your messages.
> 
> It is possible I did not understand correctly your messages, because 
> grep finds hex sequences with the “-Pa” options at least.

grep -P introduces a completely different regex engine, with its own 
quirks.  As such, it does introduce different rules on backslash 
sequences accepted.

> 
> Examples—“input.txt” contains, from the file system, for example 
> “\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00”: 
> 
> grep -Pa '\x00' input.txt
> → found
> grep -Pza '\x0A' input.txt
> → found
> grep -Pa '\x0A\x00' input.txt

This will never match - when you are not using -z, there are no \x0A in 
the input stream (they have all been consumed by grep parsing one line 
at a time, ending at \x0A).  Instead, you'll want to search for '^\x00' 
or '\x00$' for a pattern anchored to a line transition, to find patterns 
where newline was next to NUL.

> grep -Pza '\x0A\x00' input.txt
> → not found for the both

Similarly, when you are using -z, there are no \x00 in the input stream 
(they  have all been consumed by grep parsing one NUL-terminated record 
at a time, ending at \x00).  Instead, you'll want to search for '^\x0a' 
or '\x0a$' for a pattern anchored to a record transition, to find 
patterns where newline was next to NUL.

> 
> But is it at least possible to find “\x0A\x00” with grep?

If you bend the rules by throwing -P into the mix, yes :)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Sat, 15 Sep 2018 17:44:01 GMT) Full text and rfc822 format available.

Message #20 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: 21naown <at> gmail.com
To: 32704 <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Sat, 15 Sep 2018 18:43:44 +0200
Thank you for your messages.

It is possible I did not understand correctly your messages, because 
grep finds hex sequences with the “-Pa” options at least.

Examples—“input.txt” contains, from the file system, for example 
“\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00”:
grep -Pa '\x00' input.txt
→ found
grep -Pza '\x0A' input.txt
→ found
grep -Pa '\x0A\x00' input.txt
grep -Pza '\x0A\x00' input.txt
→ not found for the both

But is it at least possible to find “\x0A\x00” with grep?




Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Sat, 15 Sep 2018 17:58:02 GMT) Full text and rfc822 format available.

Message #23 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: 21naown <at> gmail.com
To: 32704 <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Sat, 15 Sep 2018 19:57:24 +0200
Le 15/09/2018 à 19:06, Eric Blake a écrit :
> On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
>> Thank you for your messages.
>>
>> It is possible I did not understand correctly your messages, because 
>> grep finds hex sequences with the “-Pa” options at least.
>
> grep -P introduces a completely different regex engine, with its own 
> quirks.  As such, it does introduce different rules on backslash 
> sequences accepted.
>
>>
>> Examples—“input.txt” contains, from the file system, for example 
>> “\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00”: 
>>
>> grep -Pa '\x00' input.txt
>> → found
>> grep -Pza '\x0A' input.txt
>> → found
>> grep -Pa '\x0A\x00' input.txt
>
> This will never match - when you are not using -z, there are no \x0A 
> in the input stream (they have all been consumed by grep parsing one 
> line at a time, ending at \x0A).  Instead, you'll want to search for 
> '^\x00' or '\x00$' for a pattern anchored to a line transition, to 
> find patterns where newline was next to NUL.
>
>> grep -Pza '\x0A\x00' input.txt
>> → not found for the both
>
> Similarly, when you are using -z, there are no \x00 in the input 
> stream (they  have all been consumed by grep parsing one 
> NUL-terminated record at a time, ending at \x00).  Instead, you'll 
> want to search for '^\x0a' or '\x0a$' for a pattern anchored to a 
> record transition, to find patterns where newline was next to NUL.
>
>>
>> But is it at least possible to find “\x0A\x00” with grep?
>
> If you bend the rules by throwing -P into the mix, yes :)
>
So it is possible to find “\x0A\x00” alone, but for example 
“\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with the 
“-P” option?




Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Sat, 15 Sep 2018 20:21:01 GMT) Full text and rfc822 format available.

Message #26 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org, Eric Blake <eblake <at> redhat.com>, 
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Sat, 15 Sep 2018 14:20:40 -0600
Hello,

On 15/09/18 11:57 AM, 21naown <at> gmail.com wrote:
> Le 15/09/2018 à 19:06, Eric Blake a écrit :
>> On 9/15/18 11:43 AM, 21naown <at> gmail.com wrote:
>>> But is it at least possible to find “\x0A\x00” with grep?
>>
>> If you bend the rules by throwing -P into the mix, yes :)
>>
> So it is possible to find “\x0A\x00” alone, but for example 
> “\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with the 
> “-P” option?

If I may suggest a different tool, GNU sed can handle such regexes more 
easily than grep.
The 'trick' is to accumulate multiple lines into memory, then run the
regex on the entire buffer.

1.
If you input is small enough to fit in memory,
you can load the entire file into memory,
and run the regex on the buffer:

$ printf 
'\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00' 
\
     | LC_ALL=C sed -n 'H;$!d ; x ; /\x0a\x00/q0 ; q1' \
           && echo MATCH || echo NO-MATCH

The "H;$!d" commands accumulate lines into the hold buffer.
The "x" command copies the hold buffer into the pattern buffer.
Then the regex "/\x0a\x00/" searches in the buffer.
If there was a match, sed quits with exit code 0 ("q0").
Otherwise, sed quits with exit code 1 ("q1").


2.
If the file is too big to fit in memory,
you can process it line-by-line like so:

$ printf 
'\xFF\xFE\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\x00\x73\x00\x74\x00\x5F\x00\x74\x00\x77\x00\x6F\x00\x0D\x00\x0A\x00' 
\
     | LC_ALL=C sed -n 'N;/\x00\x0a/q0;$q1;D;' \
             && echo MATCH || echo NO-MATCH

The N,D commands work in tandem to append the next line into the
buffer, then delete the last line from the buffer (think FIFO).
The regex then operates on the buffer which contains the last two lines.



More details are in the manual:
 https://www.gnu.org/software/sed/manual/sed.html#Multiline-techniques
https://www.gnu.org/software/sed/manual/sed.html#Text-search-across-multiple-lines



regards,
 - assaf





Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Sat, 15 Sep 2018 20:28:01 GMT) Full text and rfc822 format available.

Message #29 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: 21naown <at> gmail.com, 32704 <at> debbugs.gnu.org, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Sat, 15 Sep 2018 15:27:08 -0500
On 9/15/18 12:57 PM, 21naown <at> gmail.com wrote:

>>> But is it at least possible to find “\x0A\x00” with grep?
>>
>> If you bend the rules by throwing -P into the mix, yes :)
>>
> So it is possible to find “\x0A\x00” alone, but for example 
> “\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with the 
> “-P” option?

Correct. It is impossible to find the record terminator in the middle of 
a pattern, whether that terminator is \n (default) or NUL (-z).  It is 
therefore impossible to find a multi-record match using grep.  The 
string you listed contains both \x00 and \x0a, so regardless of which of 
those two bytes you pick as the record terminator, it is impossible to 
use grep to find that substring in your file.  You'll have to resort to 
a tool that supports multiline matching, since grep is not such a tool.

It IS possible, of course, to change your data, for example:

tr '\0' '\xff' < file | grep $modified_pattern | tr '\xff' '\0'

assuming that \xff didn't appear anywhere else in the file; although it 
may make matching harder if you don't have the right record terminators 
any longer.  Or, if your input data is encoded in UTF-16, it's easiest 
to convert it into UTF-8 for the grep:

iconv -f UTF-16 -t UTF-8 < file | grep $modified_pattern \
  | iconv -f UTF-8 -t UTF-16

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




Information forwarded to bug-grep <at> gnu.org:
bug#32704; Package grep. (Mon, 17 Sep 2018 15:57:02 GMT) Full text and rfc822 format available.

Message #32 received at 32704 <at> debbugs.gnu.org (full text, mbox):

From: 21naown <at> gmail.com
To: 32704 <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 Eric Blake <eblake <at> redhat.com>, Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#32704: Can grep search for a line feed and a null character
 at the same time?
Date: Mon, 17 Sep 2018 17:56:52 +0200
Hello Assaf.

Thank you Assaf and Eric for your suggestions. I will also look at the 
tool “pcregrep”.

--------------------------------------------------------------------------------

Thank you Eric for having answered the question of the subject:

Le 15/09/2018 à 22:27, Eric Blake a écrit :
> On 9/15/18 12:57 PM, 21naown <at> gmail.com wrote:
>
>> So it is possible to find “\x0A\x00” alone, but for example 
>> “\x74\x00\x0D\x00\x0A\x00\x74\x00\x65\00” is impossible to find with 
>> the “-P” option?
>
> Correct. It is impossible to find the record terminator in the middle 
> of a pattern, whether that terminator is \n (default) or NUL (-z).  It 
> is therefore impossible to find a multi-record match using grep.  The 
> string you listed contains both \x00 and \x0a, so regardless of which 
> of those two bytes you pick as the record terminator, it is impossible 
> to use grep to find that substring in your file.  You'll have to 
> resort to a tool that supports multiline matching, since grep is not 
> such a tool.




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 21 Sep 2020 19:48:01 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 328 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.