GNU bug report logs -
#73360
Error when a long list is provided to grep with "--binary-files=without-match" option
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Sat, 21 Sep 2024 23:39:38 -0700
with message-id <3acf1f78-7ac4-4391-8d68-f8683730b085 <at> cs.ucla.edu>
and subject line Re: bug#73360: Error when a long list is provided to grep with "--binary-files=without-match" option
has caused the debbugs.gnu.org bug report #73360,
regarding Error when a long list is provided to grep with "--binary-files=without-match" option
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
73360: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=73360
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
[Message part 3 (text/plain, inline)]
Hello. I'm trying to use grep to get the list of all non-binary files in a
given folder. I tried with the 2.20 and the 3.11 release.
For some reason, grep is providing 2 false negatives when the list is huge.
This issue does not happen if I break the grep input with "xargs -n X".
Check below:
[opc <at> oradiff-core dbhome_1]$ grep -V
grep (GNU grep) 3.11
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
-Il '.' > /tmp/list1.list
[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> /tmp/list2.list
[opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai
[opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
23397 /tmp/list1.list
23395 /tmp/list2.list
46792 total
The output should not show any difference.
The same issue was also reproduced in grep 2.20.
Thanks,
Rodrigo
[Message part 4 (text/html, inline)]
[Message part 5 (message/rfc822, inline)]
[Message part 6 (text/plain, inline)]
On 2024-09-20 22:41, Paul Eggert wrote:
> I have the sneaking suspicion that the script is assuming properties of
> 'grep' that are not documented and that are not guaranteed.
In looking into the code a bit more, I can see some places where that is
what is happening.
A couple of things.
First, grep 3.11 uses buffer sizes that depend on earlier files that it
has scanned, and this affects whether grep decides later files are
binary. This can lead to the sort of confusion that you mentioned. There
are performance reasons to think that grep should not grow buffer sizes
for later files merely because earlier files had very long lines, as
huge buffers can hurt performance; so I installed onto the development
repository on Savannah the first attached patch to fix that. As a side
effect this may fix the symptoms you observed.
Second, 'grep' is not a good tool for determining whether a file is text
or binary, since the definition of "text" vs "binary" is
application-specific and grep's definition is suitable for 'grep' and
it's problematic to use it elsewhere. I installed the second attached
patch to try to document this better.
Hope this helps.
Boldly closing this bug as fixed; if I'm wrong we can reopen it.
[0001-grep-avoid-huge-reads.patch (text/x-patch, attachment)]
[0002-doc-warn-re-using-grep-to-detect-binary-files.patch (text/x-patch, attachment)]
This bug report was last modified 297 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.