GNU bug report logs -
#73360
Error when a long list is provided to grep with "--binary-files=without-match" option
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 73360 in the body.
You can then email your comments to 73360 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Thu, 19 Sep 2024 14:29:04 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Thu, 19 Sep 2024 14:29:04 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello. I'm trying to use grep to get the list of all non-binary files in a
given folder. I tried with the 2.20 and the 3.11 release.
For some reason, grep is providing 2 false negatives when the list is huge.
This issue does not happen if I break the grep input with "xargs -n X".
Check below:
[opc <at> oradiff-core dbhome_1]$ grep -V
grep (GNU grep) 3.11
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
-Il '.' > /tmp/list1.list
[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> /tmp/list2.list
[opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai
[opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
23397 /tmp/list1.list
23395 /tmp/list2.list
46792 total
The output should not show any difference.
The same issue was also reproduced in grep 2.20.
Thanks,
Rodrigo
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Fri, 20 Sep 2024 00:22:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 73360 <at> debbugs.gnu.org (full text, mbox):
I can't reproduce this. I am running "grep (GNU grep) 3.11" and "xargs (GNU findutils) 4.10.0" on an Artix distribution.
I have a directory that has 52422 regular files in it, over twice your example with some 23397 files.
I get the same result regardless of whether or not I constrain xargs to "-n 100" arguments per exec.
Could you:
1) See if you can see any difference in if or how xargs invokes grep on the two files that are coming up different, by looking for those two missing filenames in the tracing output from using the xargs --verbose option.
2) Probably not helpful, but is there anything strange about these two missing files:
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai
Are their sizes and file types, from the 'file' command, similar to some of the other files?
My wild guess speculation would be that you're hitting some unknown limit on xargs when invoked with an argument list that is right at the limit of what your system allows. But I wouldn't bet a cheap beer on that guess being right.
--
Paul Jackson
jackson <at> fastmail.fm
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Fri, 20 Sep 2024 03:45:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 73360 <at> debbugs.gnu.org (full text, mbox):
I suggest using xargs -t to see how 'grep' is actually being invoked.
Then run the individual 'grep' commands that xargs -t reports, and see
which one misbehaves (or possibly you'll find that none of the
individual 'grep' commands are misbehaving, and the problem lies elsewhere).
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Fri, 20 Sep 2024 13:33:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 73360 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
While the output may be bulky, on Linux you can try the strace command to see exactly what it is up to. It will show the execvp() call, for instance. You might need a bigger -s!
$ strace -f -v -s 262144 <YOUR_CMD>
On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <rodrigoaraujorge <at> gmail.com> wrote:
Hello. I'm trying to use grep to get the list of all non-binary files in a
given folder. I tried with the 2.20 and the 3.11 release.
For some reason, grep is providing 2 false negatives when the list is huge.
This issue does not happen if I break the grep input with "xargs -n X".
Check below:
[opc <at> oradiff-core dbhome_1]$ grep -V
grep (GNU grep) 3.11
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
-Il '.' > /tmp/list1.list
[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> /tmp/list2.list
[opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai
[opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
23397 /tmp/list1.list
23395 /tmp/list2.list
46792 total
The output should not show any difference.
The same issue was also reproduced in grep 2.20.
Thanks,
Rodrigo
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Fri, 20 Sep 2024 13:57:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 73360 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I could reproduce the same issue without xargs, so I think we can take it
out of the picture:
[user <at> server folder]$ find -type f -not -path "./.patch_storage/*" -not
-name "tfa_setup" -print > /tmp/file.list
[user <at> server folder]$ wc -l /tmp/file.list
37443 /tmp/file.list
[user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' >
/tmp/list1.list
[user <at> server folder]$ wc -l /tmp/list1.list
23405 /tmp/list1.list
[user <at> server folder]$ grep -Il '.' $(cat /tmp/file.list) > /tmp/list2.list
[user <at> server folder]$ wc -l /tmp/list2.list
23403 /tmp/list2.list
[user <at> server folder]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai
[user <at> server folder]$
So we can see that running *"grep -Il '.' $(cat /tmp/file.list)"* will also
skip those 2 files, unless the problem is actually bringing them, and xargs
are adding those 2 files somehow.
Those files are PDFs:
[user <at> server folder]$ file ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex_5_ui.ai: PDF document, version 1.5
[user <at> server folder]$ file ./apex/images/apex_ui/psd/apex-logo.ai
./apex/images/apex_ui/psd/apex-logo.ai: PDF document, version 1.5
[user <at> server folder]$ head ./apex/images/apex_ui/psd/apex_5_ui.ai
%����1.5
<</Length 39582/Subtype/XML/Type/Metadata>>stream8 0 R 209 0 R]/ON[6 0 R 7
0 R 210 0 R]/Order 211 0 R/RBGroups[]>>/OCGs[6 0 R 7 0 R 5 0 R 208 0 R 210
0 R 209 0 R]>>/Pages 3 0 R/Type/Catalog>>
<?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011
66.145661, 2012/02/06-14:56:27 ">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:format>application/pdf</dc:format>
<dc:title>
<rdf:Alt>
I could also find exactly the point it breaks:
[user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' | wc -l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 1000 grep -Il '.' | wc
-l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 2000 grep -Il '.' | wc
-l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 2871 grep -Il '.' | wc
-l
23405
[user <at> server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' | wc
-l
23403
I will reply shortly with the strace findings.
On Fri, Sep 20, 2024 at 10:32 AM David G. Pickett <dgpickett <at> aol.com> wrote:
> While the output may be bulky, on Linux you can try the strace command to
> see exactly what it is up to. It will show the execvp() call, for
> instance. You might need a bigger -s!
>
> $ strace -f -v -s 262144 <YOUR_CMD>
>
> On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <
> rodrigoaraujorge <at> gmail.com> wrote:
>
>
> Hello. I'm trying to use grep to get the list of all non-binary files in a
> given folder. I tried with the 2.20 and the 3.11 release.
>
> For some reason, grep is providing 2 false negatives when the list is huge.
> This issue does not happen if I break the grep input with "xargs -n X".
>
> Check below:
>
> [opc <at> oradiff-core dbhome_1]$ grep -V
> grep (GNU grep) 3.11
> Copyright (C) 2023 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> https://gnu.org/licenses/gpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.
>
> Written by Mike Haertel and others; see
> <https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
>
> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
> -Il '.' > /tmp/list1.list
>
> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> > /tmp/list2.list
>
> [opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
> 12268,12269d12267
> < ./apex/images/apex_ui/psd/apex_5_ui.ai
> < ./apex/images/apex_ui/psd/apex-logo.ai
>
> [opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
> 23397 /tmp/list1.list
> 23395 /tmp/list2.list
> 46792 total
>
> The output should not show any difference.
>
> The same issue was also reproduced in grep 2.20.
>
> Thanks,
> Rodrigo
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Fri, 20 Sep 2024 14:25:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 73360 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Ok, more things were discovered. After I had a problem exactly at the
"xargs -n 2872", I ran the xargs again with the "-t" flag to get the
command, and noticed that the 2 missing files were exactly the 2 last ones
on the command file list.
grep -Il . "{ 2870 files }" ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex-logo.ai
Now if I run:
[user <at> server folder]$ cat /tmp/cmd1
grep -Il . ./apex/images/apex_ui/psd/apex_5_ui.ai ./apex/images/apex_ui/psd/
apex-logo.ai ... "{ 2870 files }"
[user <at> server folder]$ wc -c /tmp/cmd1
131049 /tmp/cmd1
[user <at> server folder]$ cat /tmp/cmd2
grep -Il . "{ 2870 files }" ./apex/images/apex_ui/psd/apex_5_ui.ai
./apex/images/apex_ui/psd/apex-logo.ai
[user <at> server folder]$ wc -c /tmp/cmd2
131049 /tmp/cmd2
[user <at> server folder]$ sh /tmp/cmd1 | wc -l
1072
[user <at> server folder]$ sh /tmp/cmd2 | wc -l
1070
In other words, depending on the location on the command line where those 2
files are provided to grep, we will have a different result.
Can I run those 2 grep commands with some sort of debug flag and send them
back for analysis? The file list is exactly the same, just changing the
file order.
Thanks,
Rodrigo
On Fri, Sep 20, 2024 at 10:54 AM Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
wrote:
> I could reproduce the same issue without xargs, so I think we can take it
> out of the picture:
>
> [user <at> server folder]$ find -type f -not -path "./.patch_storage/*" -not
> -name "tfa_setup" -print > /tmp/file.list
> [user <at> server folder]$ wc -l /tmp/file.list
> 37443 /tmp/file.list
>
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' >
> /tmp/list1.list
> [user <at> server folder]$ wc -l /tmp/list1.list
> 23405 /tmp/list1.list
>
> [user <at> server folder]$ grep -Il '.' $(cat /tmp/file.list) > /tmp/list2.list
> [user <at> server folder]$ wc -l /tmp/list2.list
> 23403 /tmp/list2.list
>
> [user <at> server folder]$ diff /tmp/list1.list /tmp/list2.list
> 12268,12269d12267
> < ./apex/images/apex_ui/psd/apex_5_ui.ai
> < ./apex/images/apex_ui/psd/apex-logo.ai
> [user <at> server folder]$
>
> So we can see that running *"grep -Il '.' $(cat /tmp/file.list)"* will
> also skip those 2 files, unless the problem is actually bringing them, and
> xargs are adding those 2 files somehow.
>
> Those files are PDFs:
>
> [user <at> server folder]$ file ./apex/images/apex_ui/psd/apex_5_ui.ai
> ./apex/images/apex_ui/psd/apex_5_ui.ai: PDF document, version 1.5
> [user <at> server folder]$ file ./apex/images/apex_ui/psd/apex-logo.ai
> ./apex/images/apex_ui/psd/apex-logo.ai: PDF document, version 1.5
>
> [user <at> server folder]$ head ./apex/images/apex_ui/psd/apex_5_ui.ai
> %����1.5
> <</Length 39582/Subtype/XML/Type/Metadata>>stream8 0 R 209 0 R]/ON[6 0 R 7
> 0 R 210 0 R]/Order 211 0 R/RBGroups[]>>/OCGs[6 0 R 7 0 R 5 0 R 208 0 R 210
> 0 R 209 0 R]>>/Pages 3 0 R/Type/Catalog>>
> <?xpacket begin="" id="W5M0MpCehiHzreSzNTczkc9d"?>
> <x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011
> 66.145661, 2012/02/06-14:56:27 ">
> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
> <rdf:Description rdf:about=""
> xmlns:dc="http://purl.org/dc/elements/1.1/">
> <dc:format>application/pdf</dc:format>
> <dc:title>
> <rdf:Alt>
>
> I could also find exactly the point it breaks:
>
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 100 grep -Il '.' | wc
> -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 1000 grep -Il '.' |
> wc -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2000 grep -Il '.' |
> wc -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2871 grep -Il '.' |
> wc -l
> 23405
> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' |
> wc -l
> 23403
>
> I will reply shortly with the strace findings.
>
> On Fri, Sep 20, 2024 at 10:32 AM David G. Pickett <dgpickett <at> aol.com>
> wrote:
>
>> While the output may be bulky, on Linux you can try the strace command to
>> see exactly what it is up to. It will show the execvp() call, for
>> instance. You might need a bigger -s!
>>
>> $ strace -f -v -s 262144 <YOUR_CMD>
>>
>> On Thursday, September 19, 2024 at 10:29:30 AM EDT, Rodrigo Jorge <
>> rodrigoaraujorge <at> gmail.com> wrote:
>>
>>
>> Hello. I'm trying to use grep to get the list of all non-binary files in a
>> given folder. I tried with the 2.20 and the 3.11 release.
>>
>> For some reason, grep is providing 2 false negatives when the list is
>> huge.
>> This issue does not happen if I break the grep input with "xargs -n X".
>>
>> Check below:
>>
>> [opc <at> oradiff-core dbhome_1]$ grep -V
>> grep (GNU grep) 3.11
>> Copyright (C) 2023 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later <
>> https://gnu.org/licenses/gpl.html>.
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.
>>
>> Written by Mike Haertel and others; see
>> <https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
>>
>> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
>> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
>> -Il '.' > /tmp/list1.list
>>
>> [opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
>> -not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
>> > /tmp/list2.list
>>
>> [opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
>> 12268,12269d12267
>> < ./apex/images/apex_ui/psd/apex_5_ui.ai
>> < ./apex/images/apex_ui/psd/apex-logo.ai
>>
>> [opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
>> 23397 /tmp/list1.list
>> 23395 /tmp/list2.list
>> 46792 total
>>
>> The output should not show any difference.
>>
>> The same issue was also reproduced in grep 2.20.
>>
>> Thanks,
>> Rodrigo
>>
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Sat, 21 Sep 2024 03:35:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 73360 <at> debbugs.gnu.org (full text, mbox):
Rodrigo wrote:
>> [user <at> server folder]$ cat /tmp/file.list | xargs -n 2872 grep -Il '.' | wc -l
>> 23403
Since this problem is reproduced using that particular /tmp/file.list,
therefore if that file.list does not contain any confidential information,
and if you chose to let all of us see that file.list, then any of us should
be able to easily reproduce this problem.
--
Paul Jackson
jackson <at> fastmail.fm
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Sat, 21 Sep 2024 05:43:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 73360 <at> debbugs.gnu.org (full text, mbox):
On 2024-09-20 07:22, Rodrigo Jorge wrote:
> Can I run those 2 grep commands with some sort of debug flag and send them
> back for analysis? The file list is exactly the same, just changing the
> file order.
Unfortunately there's no debug flag. Of course you can run grep under
GDB but it will require some expertise to puzzle out why the last two
files are treated differently.
Do you see the same problem if you run in the C locale? That is, set
LC_ALL="C" in the environment.
What does 'strace' say about grep's reading of the two files in
question? Can you give the strace output for just those two files?
I have the sneaking suspicion that the script is assuming properties of
'grep' that are not documented and that are not guaranteed. grep -I's
heuristic for determining whether a file is "binary" is designed for
that particular grep run, and does not necessarily agree with what other
programs think are "binary files", or even what other instances of
'grep' think are "binary files". The strace output might help clear up
whether this is what is happening.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Sat, 21 Sep 2024 19:14:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 73360 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Linux strace (like Solaris truss) is a bit less confusing than gdb, and does not need assistance from a symbol preserving compile option -g and lack of strip. It can even start tracing running processes for which you have no source code.
On Saturday, September 21, 2024 at 01:41:42 AM EDT, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
On 2024-09-20 07:22, Rodrigo Jorge wrote:
> Can I run those 2 grep commands with some sort of debug flag and send them
> back for analysis? The file list is exactly the same, just changing the
> file order.
Unfortunately there's no debug flag. Of course you can run grep under
GDB but it will require some expertise to puzzle out why the last two
files are treated differently.
Do you see the same problem if you run in the C locale? That is, set
LC_ALL="C" in the environment.
What does 'strace' say about grep's reading of the two files in
question? Can you give the strace output for just those two files?
I have the sneaking suspicion that the script is assuming properties of
'grep' that are not documented and that are not guaranteed. grep -I's
heuristic for determining whether a file is "binary" is designed for
that particular grep run, and does not necessarily agree with what other
programs think are "binary files", or even what other instances of
'grep' think are "binary files". The strace output might help clear up
whether this is what is happening.
[Message part 2 (text/html, inline)]
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Sun, 22 Sep 2024 06:41:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
:
bug acknowledged by developer.
(Sun, 22 Sep 2024 06:41:02 GMT)
Full text and
rfc822 format available.
Message #34 received at 73360-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 2024-09-20 22:41, Paul Eggert wrote:
> I have the sneaking suspicion that the script is assuming properties of
> 'grep' that are not documented and that are not guaranteed.
In looking into the code a bit more, I can see some places where that is
what is happening.
A couple of things.
First, grep 3.11 uses buffer sizes that depend on earlier files that it
has scanned, and this affects whether grep decides later files are
binary. This can lead to the sort of confusion that you mentioned. There
are performance reasons to think that grep should not grow buffer sizes
for later files merely because earlier files had very long lines, as
huge buffers can hurt performance; so I installed onto the development
repository on Savannah the first attached patch to fix that. As a side
effect this may fix the symptoms you observed.
Second, 'grep' is not a good tool for determining whether a file is text
or binary, since the definition of "text" vs "binary" is
application-specific and grep's definition is suitable for 'grep' and
it's problematic to use it elsewhere. I installed the second attached
patch to try to document this better.
Hope this helps.
Boldly closing this bug as fixed; if I'm wrong we can reopen it.
[0001-grep-avoid-huge-reads.patch (text/x-patch, attachment)]
[0002-doc-warn-re-using-grep-to-detect-binary-files.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#73360
; Package
grep
.
(Mon, 23 Sep 2024 13:00:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 73360-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks, Paul.
I tried to clone and compile your latest changes from the Savannah repo but
since some extra requirements are probably needed to compile from master
branch (that are beyond my knowledge), I ended up not being able to
validate it. Anyway, thanks for the correction and fix implementation!
Regards,
Rodrigo
On Sun, Sep 22, 2024 at 3:39 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 2024-09-20 22:41, Paul Eggert wrote:
> > I have the sneaking suspicion that the script is assuming properties of
> > 'grep' that are not documented and that are not guaranteed.
>
> In looking into the code a bit more, I can see some places where that is
> what is happening.
>
> A couple of things.
>
> First, grep 3.11 uses buffer sizes that depend on earlier files that it
> has scanned, and this affects whether grep decides later files are
> binary. This can lead to the sort of confusion that you mentioned. There
> are performance reasons to think that grep should not grow buffer sizes
> for later files merely because earlier files had very long lines, as
> huge buffers can hurt performance; so I installed onto the development
> repository on Savannah the first attached patch to fix that. As a side
> effect this may fix the symptoms you observed.
>
> Second, 'grep' is not a good tool for determining whether a file is text
> or binary, since the definition of "text" vs "binary" is
> application-specific and grep's definition is suitable for 'grep' and
> it's problematic to use it elsewhere. I installed the second attached
> patch to try to document this better.
>
> Hope this helps.
>
> Boldly closing this bug as fixed; if I'm wrong we can reopen it.
[Message part 2 (text/html, inline)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 22 Oct 2024 11:24:09 GMT)
Full text and
rfc822 format available.
This bug report was last modified 297 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.