GNU bug report logs - #73360
Error when a long list is provided to grep with "--binary-files=without-match" option

Previous Next

Package: grep;

Reported by: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>

Date: Thu, 19 Sep 2024 14:29:04 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 73360 in the body.
You can then email your comments to 73360 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Thu, 19 Sep 2024 14:29:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 19 Sep 2024 14:29:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Thu, 19 Sep 2024 10:49:31 -0300
[Message part 1 (text/plain, inline)]
Hello. I'm trying to use grep to get the list of all non-binary files in a
given folder. I tried with the 2.20 and the 3.11 release.

For some reason, grep is providing 2 false negatives when the list is huge.
This issue does not happen if I break the grep input with "xargs -n X".

Check below:

[opc <at> oradiff-core dbhome_1]$ grep -V
grep (GNU grep) 3.11
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.

[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
-Il '.' > /tmp/list1.list

[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> /tmp/list2.list

[opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai

[opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
  23397 /tmp/list1.list
  23395 /tmp/list2.list
  46792 total

The output should not show any difference.

The same issue was also reproduced in grep 2.20.

Thanks,
Rodrigo
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Fri, 20 Sep 2024 00:22:02 GMT) Full text and rfc822 format available.

Message #8 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: jackson <at> fastmail.com
To: "Rodrigo Jorge" <rodrigoaraujorge <at> gmail.com>, 73360 <at> debbugs.gnu.org
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Thu, 19 Sep 2024 19:19:26 -0500
I can't reproduce this.  I am running "grep (GNU grep) 3.11" and "xargs (GNU findutils) 4.10.0" on an Artix distribution.

I have a directory that has 52422 regular files in it, over twice your example with some 23397 files.

I get the same result regardless of whether or not I constrain xargs to "-n 100" arguments per exec.

Could you:

1) See if you can see any difference in if or how xargs invokes grep on the two files that are coming up different, by looking for those two missing filenames in the tracing output from using the  xargs --verbose option.

2) Probably not helpful, but is there anything strange about these two missing files:
    < ./apex/images/apex_ui/psd/apex_5_ui.ai
    < ./apex/images/apex_ui/psd/apex-logo.ai
   Are their sizes and file types, from the 'file' command, similar to some of the other files?

My wild guess speculation would be that you're hitting some unknown limit on xargs when invoked with an argument list that is right at the limit of what your system allows.  But I wouldn't bet a cheap beer on that guess being right. 

-- 
  Paul Jackson
  jackson <at> fastmail.fm




Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Fri, 20 Sep 2024 03:45:02 GMT) Full text and rfc822 format available.

Message #11 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
Cc: 73360 <at> debbugs.gnu.org
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Thu, 19 Sep 2024 20:44:01 -0700
I suggest using xargs -t to see how 'grep' is actually being invoked. 
Then run the individual 'grep' commands that xargs -t reports, and see 
which one misbehaves (or possibly you'll find that none of the 
individual 'grep' commands are misbehaving, and the problem lies elsewhere).




Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Fri, 20 Sep 2024 13:33:02 GMT) Full text and rfc822 format available.

Message #14 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: "David G. Pickett" <dgpickett <at> aol.com>
To: "73360 <at> debbugs.gnu.org" <73360 <at> debbugs.gnu.org>, 
 Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Fri, 20 Sep 2024 13:32:03 +0000 (UTC)
[Message part 1 (text/plain, inline)]
 While the output may be bulky, on Linux you can try the strace command to see exactly what it is up to.  It will show the execvp() call, for instance.  You might need a bigger -s!
$ strace -f -v -s 262144 <YOUR_CMD>

    On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <rodrigoaraujorge <at> gmail.com> wrote:   

 Hello. I'm trying to use grep to get the list of all non-binary files in a
given folder. I tried with the 2.20 and the 3.11 release.

For some reason, grep is providing 2 false negatives when the list is huge.
This issue does not happen if I break the grep input with "xargs -n X".

Check below:

[opc <at> oradiff-core dbhome_1]$ grep -V
grep (GNU grep) 3.11
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.

[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
-Il '.' > /tmp/list1.list

[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> /tmp/list2.list

[opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai

[opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
  23397 /tmp/list1.list
  23395 /tmp/list2.list
  46792 total

The output should not show any difference.

The same issue was also reproduced in grep 2.20.

Thanks,
Rodrigo
  
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Fri, 20 Sep 2024 13:57:01 GMT) Full text and rfc822 format available.

Message #17 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
To: "David G. Pickett" <dgpickett <at> aol.com>
Cc: "73360 <at> debbugs.gnu.org" <73360 <at> debbugs.gnu.org>
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Fri, 20 Sep 2024 10:54:31 -0300
[Message part 1 (text/plain, inline)]
I could reproduce the same issue without xargs, so I think we can take it
out of the picture:

[user <at> server folder]$ find -type f -not -path "./.patch_storage/*" -not
-name "tfa_setup" -print > /tmp/file.list
[user <at> server folder]$ wc -l /tmp/file.list
37443 /tmp/file.list

[user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' >
/tmp/list1.list
[user <at> server folder]$ wc -l /tmp/list1.list
23405 /tmp/list1.list

[user <at> server folder]$ grep -Il '.' $(cat /tmp/file.list) > /tmp/list2.list
[user <at> server folder]$ wc -l /tmp/list2.list
23403 /tmp/list2.list

[user <at> server folder]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai
[user <at> server folder]$

So we can see that running *"grep -Il '.' $(cat /tmp/file.list)"* will also
skip those 2 files, unless the problem is actually bringing them, and xargs
are adding those 2 files somehow.

Those files are PDFs:

[user <at> server folder]$ file ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex_5_ui.ai: PDF document, version 1.5
[user <at> server folder]$ file ./apex/images/apex_ui/psd/apex-logo.ai
./apex/images/apex_ui/psd/apex-logo.ai: PDF document, version 1.5

[user <at> server folder]$ head ./apex/images/apex_ui/psd/apex_5_ui.ai
%����1.5
<</Length 39582/Subtype/XML/Type/Metadata>>stream8 0 R 209 0 R]/ON[6 0 R 7
0 R 210 0 R]/Order 211 0 R/RBGroups[]>>/OCGs[6 0 R 7 0 R 5 0 R 208 0 R 210
0 R 209 0 R]>>/Pages 3 0 R/Type/Catalog>>
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011
66.145661, 2012/02/06-14:56:27        ">
   <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
      <rdf:Description rdf:about=""
            xmlns:dc="http://purl.org/dc/elements/1.1/">
         <dc:format>application/pdf</dc:format>
         <dc:title>
            <rdf:Alt>

I could also find exactly the point it breaks:

[user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' | wc -l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 1000 grep -Il '.' | wc
-l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 2000 grep -Il '.' | wc
-l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 2871 grep -Il '.' | wc
-l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' | wc
-l
23403

I will reply shortly with the strace findings.

On Fri, Sep 20, 2024 at 10:32 AM David G. Pickett <dgpickett <at> aol.com> wrote:

> While the output may be bulky, on Linux you can try the strace command to
> see exactly what it is up to.  It will show the execvp() call, for
> instance.  You might need a bigger -s!
>
> $ strace -f -v -s 262144 <YOUR_CMD>
>
> On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <
> rodrigoaraujorge <at> gmail.com> wrote:
>
>
> Hello. I'm trying to use grep to get the list of all non-binary files in a
> given folder. I tried with the 2.20 and the 3.11 release.
>
> For some reason, grep is providing 2 false negatives when the list is huge.
> This issue does not happen if I break the grep input with "xargs -n X".
>
> Check below:
>
> [opc <at> oradiff-core dbhome_1]$ grep -V
> grep (GNU grep) 3.11
> Copyright (C) 2023 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> https://gnu.org/licenses/gpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
>
> Written by Mike Haertel and others; see
> <https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
>
> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
> -Il '.' > /tmp/list1.list
>
> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> > /tmp/list2.list
>
> [opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
> 12268,12269d12267
> < ./apex/images/apex_ui/psd/apex_5_ui.ai
> < ./apex/images/apex_ui/psd/apex-logo.ai
>
> [opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
>   23397 /tmp/list1.list
>   23395 /tmp/list2.list
>   46792 total
>
> The output should not show any difference.
>
> The same issue was also reproduced in grep 2.20.
>
> Thanks,
> Rodrigo
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Fri, 20 Sep 2024 14:25:01 GMT) Full text and rfc822 format available.

Message #20 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
To: "David G. Pickett" <dgpickett <at> aol.com>
Cc: "73360 <at> debbugs.gnu.org" <73360 <at> debbugs.gnu.org>
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Fri, 20 Sep 2024 11:22:17 -0300
[Message part 1 (text/plain, inline)]
Ok, more things were discovered. After I had a problem exactly at the
"xargs -n 2872", I ran the xargs again with the "-t" flag to get the
command, and noticed that the 2 missing files were exactly the 2 last ones
on the command file list.

grep -Il . "{ 2870 files }" ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex-logo.ai

Now if I run:

[user <at> server folder]$ cat /tmp/cmd1
grep -Il . ./apex/images/apex_ui/psd/apex_5_ui.ai ./apex/images/apex_ui/psd/
apex-logo.ai ... "{ 2870 files }"

[user <at> server folder]$ wc -c /tmp/cmd1
131049 /tmp/cmd1

[user <at> server folder]$ cat /tmp/cmd2
grep -Il . "{ 2870 files }" ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex-logo.ai
[user <at> server folder]$ wc -c /tmp/cmd2
131049 /tmp/cmd2


[user <at> server folder]$ sh /tmp/cmd1 | wc -l
1072
[user <at> server folder]$ sh /tmp/cmd2 | wc -l
1070

In other words, depending on the location on the command line where those 2
files are provided to grep, we will have a different result.

Can I run those 2 grep commands with some sort of debug flag and send them
back for analysis? The file list is exactly the same, just changing the
file order.

Thanks,
Rodrigo

On Fri, Sep 20, 2024 at 10:54 AM Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
wrote:

> I could reproduce the same issue without xargs, so I think we can take it
> out of the picture:
>
> [user <at> server folder]$ find -type f -not -path "./.patch_storage/*" -not
> -name "tfa_setup" -print > /tmp/file.list
> [user <at> server folder]$ wc -l /tmp/file.list
> 37443 /tmp/file.list
>
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' >
> /tmp/list1.list
> [user <at> server folder]$ wc -l /tmp/list1.list
> 23405 /tmp/list1.list
>
> [user <at> server folder]$ grep -Il '.' $(cat /tmp/file.list) > /tmp/list2.list
> [user <at> server folder]$ wc -l /tmp/list2.list
> 23403 /tmp/list2.list
>
> [user <at> server folder]$ diff /tmp/list1.list /tmp/list2.list
> 12268,12269d12267
> < ./apex/images/apex_ui/psd/apex_5_ui.ai
> < ./apex/images/apex_ui/psd/apex-logo.ai
> [user <at> server folder]$
>
> So we can see that running *"grep -Il '.' $(cat /tmp/file.list)"* will
> also skip those 2 files, unless the problem is actually bringing them, and
> xargs are adding those 2 files somehow.
>
> Those files are PDFs:
>
> [user <at> server folder]$ file ./apex/images/apex_ui/psd/apex_5_ui.ai
> ./apex/images/apex_ui/psd/apex_5_ui.ai: PDF document, version 1.5
> [user <at> server folder]$ file ./apex/images/apex_ui/psd/apex-logo.ai
> ./apex/images/apex_ui/psd/apex-logo.ai: PDF document, version 1.5
>
> [user <at> server folder]$ head ./apex/images/apex_ui/psd/apex_5_ui.ai
> %����1.5
> <</Length 39582/Subtype/XML/Type/Metadata>>stream8 0 R 209 0 R]/ON[6 0 R 7
> 0 R 210 0 R]/Order 211 0 R/RBGroups[]>>/OCGs[6 0 R 7 0 R 5 0 R 208 0 R 210
> 0 R 209 0 R]>>/Pages 3 0 R/Type/Catalog>>
> <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011
> 66.145661, 2012/02/06-14:56:27        ">
>    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
>       <rdf:Description rdf:about=""
>             xmlns:dc="http://purl.org/dc/elements/1.1/">
>          <dc:format>application/pdf</dc:format>
>          <dc:title>
>             <rdf:Alt>
>
> I could also find exactly the point it breaks:
>
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' | wc
> -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 1000 grep -Il '.' |
> wc -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2000 grep -Il '.' |
> wc -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2871 grep -Il '.' |
> wc -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' |
> wc -l
> 23403
>
> I will reply shortly with the strace findings.
>
> On Fri, Sep 20, 2024 at 10:32 AM David G. Pickett <dgpickett <at> aol.com>
> wrote:
>
>> While the output may be bulky, on Linux you can try the strace command to
>> see exactly what it is up to.  It will show the execvp() call, for
>> instance.  You might need a bigger -s!
>>
>> $ strace -f -v -s 262144 <YOUR_CMD>
>>
>> On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <
>> rodrigoaraujorge <at> gmail.com> wrote:
>>
>>
>> Hello. I'm trying to use grep to get the list of all non-binary files in a
>> given folder. I tried with the 2.20 and the 3.11 release.
>>
>> For some reason, grep is providing 2 false negatives when the list is
>> huge.
>> This issue does not happen if I break the grep input with "xargs -n X".
>>
>> Check below:
>>
>> [opc <at> oradiff-core dbhome_1]$ grep -V
>> grep (GNU grep) 3.11
>> Copyright (C) 2023 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <
>> https://gnu.org/licenses/gpl.html>.
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.
>>
>> Written by Mike Haertel and others; see
>> <https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
>>
>> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
>> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
>> -Il '.' > /tmp/list1.list
>>
>> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
>> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
>> > /tmp/list2.list
>>
>> [opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
>> 12268,12269d12267
>> < ./apex/images/apex_ui/psd/apex_5_ui.ai
>> < ./apex/images/apex_ui/psd/apex-logo.ai
>>
>> [opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
>>   23397 /tmp/list1.list
>>   23395 /tmp/list2.list
>>   46792 total
>>
>> The output should not show any difference.
>>
>> The same issue was also reproduced in grep 2.20.
>>
>> Thanks,
>> Rodrigo
>>
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Sat, 21 Sep 2024 03:35:02 GMT) Full text and rfc822 format available.

Message #23 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: jackson <at> fastmail.com
To: "Rodrigo Jorge" <rodrigoaraujorge <at> gmail.com>,
 "David G. Pickett" <dgpickett <at> aol.com>
Cc: "73360 <at> debbugs.gnu.org" <73360 <at> debbugs.gnu.org>
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Fri, 20 Sep 2024 22:31:30 -0500
Rodrigo wrote:

>> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' | wc -l
>> 23403

Since this problem is reproduced using that particular /tmp/file.list,
therefore if that file.list does not contain any confidential information,
and if you chose to let all of us see that file.list, then any of us should
be able to easily reproduce this problem.

-- 
  Paul Jackson
  jackson <at> fastmail.fm




Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Sat, 21 Sep 2024 05:43:02 GMT) Full text and rfc822 format available.

Message #26 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>,
 "David G. Pickett" <dgpickett <at> aol.com>
Cc: "73360 <at> debbugs.gnu.org" <73360 <at> debbugs.gnu.org>
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Fri, 20 Sep 2024 22:41:40 -0700
On 2024-09-20 07:22, Rodrigo Jorge wrote:
> Can I run those 2 grep commands with some sort of debug flag and send them
> back for analysis? The file list is exactly the same, just changing the
> file order.

Unfortunately there's no debug flag. Of course you can run grep under 
GDB but it will require some expertise to puzzle out why the last two 
files are treated differently.

Do you see the same problem if you run in the C locale? That is, set 
LC_ALL="C" in the environment.

What does 'strace' say about grep's reading of the two files in 
question? Can you give the strace output for just those two files?

I have the sneaking suspicion that the script is assuming properties of 
'grep' that are not documented and that are not guaranteed.  grep -I's 
heuristic for determining whether a file is "binary" is designed for 
that particular grep run, and does not necessarily agree with what other 
programs think are "binary files", or even what other instances of 
'grep' think are "binary files". The strace output might help clear up 
whether this is what is happening.




Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Sat, 21 Sep 2024 19:14:02 GMT) Full text and rfc822 format available.

Message #29 received at 73360 <at> debbugs.gnu.org (full text, mbox):

From: "David G. Pickett" <dgpickett <at> aol.com>
To: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>, 
 Paul Eggert <eggert <at> cs.ucla.edu>
Cc: "73360 <at> debbugs.gnu.org" <73360 <at> debbugs.gnu.org>
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Sat, 21 Sep 2024 19:13:03 +0000 (UTC)
[Message part 1 (text/plain, inline)]
 Linux strace (like Solaris truss) is a bit less confusing than gdb, and does not need assistance from a symbol preserving compile option -g and lack of strip.  It can even start tracing running processes for which you have no source code.
    On Saturday, September 21, 2024 at 01:41:42 AM EDT, Paul Eggert <eggert <at> cs.ucla.edu> wrote:   

 On 2024-09-20 07:22, Rodrigo Jorge wrote:
> Can I run those 2 grep commands with some sort of debug flag and send them
> back for analysis? The file list is exactly the same, just changing the
> file order.

Unfortunately there's no debug flag. Of course you can run grep under 
GDB but it will require some expertise to puzzle out why the last two 
files are treated differently.

Do you see the same problem if you run in the C locale? That is, set 
LC_ALL="C" in the environment.

What does 'strace' say about grep's reading of the two files in 
question? Can you give the strace output for just those two files?

I have the sneaking suspicion that the script is assuming properties of 
'grep' that are not documented and that are not guaranteed.  grep -I's 
heuristic for determining whether a file is "binary" is designed for 
that particular grep run, and does not necessarily agree with what other 
programs think are "binary files", or even what other instances of 
'grep' think are "binary files". The strace output might help clear up 
whether this is what is happening.
  
[Message part 2 (text/html, inline)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sun, 22 Sep 2024 06:41:02 GMT) Full text and rfc822 format available.

Notification sent to Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>:
bug acknowledged by developer. (Sun, 22 Sep 2024 06:41:02 GMT) Full text and rfc822 format available.

Message #34 received at 73360-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
Cc: 73360-done <at> debbugs.gnu.org
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Sat, 21 Sep 2024 23:39:38 -0700
[Message part 1 (text/plain, inline)]
On 2024-09-20 22:41, Paul Eggert wrote:
> I have the sneaking suspicion that the script is assuming properties of 
> 'grep' that are not documented and that are not guaranteed.

In looking into the code a bit more, I can see some places where that is 
what is happening.

A couple of things.

First, grep 3.11 uses buffer sizes that depend on earlier files that it 
has scanned, and this affects whether grep decides later files are 
binary. This can lead to the sort of confusion that you mentioned. There 
are performance reasons to think that grep should not grow buffer sizes 
for later files merely because earlier files had very long lines, as 
huge buffers can hurt performance; so I installed onto the development 
repository on Savannah the first attached patch to fix that. As a side 
effect this may fix the symptoms you observed.

Second, 'grep' is not a good tool for determining whether a file is text 
or binary, since the definition of "text" vs "binary" is 
application-specific and grep's definition is suitable for 'grep' and 
it's problematic to use it elsewhere. I installed the second attached 
patch to try to document this better.

Hope this helps.

Boldly closing this bug as fixed; if I'm wrong we can reopen it.
[0001-grep-avoid-huge-reads.patch (text/x-patch, attachment)]
[0002-doc-warn-re-using-grep-to-detect-binary-files.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#73360; Package grep. (Mon, 23 Sep 2024 13:00:02 GMT) Full text and rfc822 format available.

Message #37 received at 73360-done <at> debbugs.gnu.org (full text, mbox):

From: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 73360-done <at> debbugs.gnu.org
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Mon, 23 Sep 2024 09:57:41 -0300
[Message part 1 (text/plain, inline)]
Thanks, Paul.

I tried to clone and compile your latest changes from the Savannah repo but
since some extra requirements are probably needed to compile from master
branch (that are beyond my knowledge), I ended up not being able to
validate it. Anyway, thanks for the correction and fix implementation!

Regards,
Rodrigo

On Sun, Sep 22, 2024 at 3:39 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 2024-09-20 22:41, Paul Eggert wrote:
> > I have the sneaking suspicion that the script is assuming properties of
> > 'grep' that are not documented and that are not guaranteed.
>
> In looking into the code a bit more, I can see some places where that is
> what is happening.
>
> A couple of things.
>
> First, grep 3.11 uses buffer sizes that depend on earlier files that it
> has scanned, and this affects whether grep decides later files are
> binary. This can lead to the sort of confusion that you mentioned. There
> are performance reasons to think that grep should not grow buffer sizes
> for later files merely because earlier files had very long lines, as
> huge buffers can hurt performance; so I installed onto the development
> repository on Savannah the first attached patch to fix that. As a side
> effect this may fix the symptoms you observed.
>
> Second, 'grep' is not a good tool for determining whether a file is text
> or binary, since the definition of "text" vs "binary" is
> application-specific and grep's definition is suitable for 'grep' and
> it's problematic to use it elsewhere. I installed the second attached
> patch to try to document this better.
>
> Hope this helps.
>
> Boldly closing this bug as fixed; if I'm wrong we can reopen it.
[Message part 2 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 22 Oct 2024 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 297 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.