GNU bug report logs - #73360
Error when a long list is provided to grep with "--binary-files=without-match" option

Previous Next

Package: grep;

Reported by: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>

Date: Thu, 19 Sep 2024 14:29:04 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#73360: closed (Error when a long list is provided to grep
 with "--binary-files=without-match" option)
Date: Sun, 22 Sep 2024 06:41:02 +0000
[Message part 1 (text/plain, inline)]
Your message dated Sat, 21 Sep 2024 23:39:38 -0700
with message-id <3acf1f78-7ac4-4391-8d68-f8683730b085 <at> cs.ucla.edu>
and subject line Re: bug#73360: Error when a long list is provided to grep with "--binary-files=without-match" option
has caused the debbugs.gnu.org bug report #73360,
regarding Error when a long list is provided to grep with "--binary-files=without-match" option
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
73360: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=73360
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Thu, 19 Sep 2024 10:49:31 -0300
[Message part 3 (text/plain, inline)]
Hello. I'm trying to use grep to get the list of all non-binary files in a
given folder. I tried with the 2.20 and the 3.11 release.

For some reason, grep is providing 2 false negatives when the list is huge.
This issue does not happen if I break the grep input with "xargs -n X".

Check below:

[opc <at> oradiff-core dbhome_1]$ grep -V
grep (GNU grep) 3.11
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.

[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 -n 100 grep
-Il '.' > /tmp/list1.list

[opc <at> oradiff-core dbhome_1]$ find -type f -not -path "./.patch_storage/*"
-not -name "tfa_setup" -print0 2>> /tmp/error.list | xargs -0 grep -Il '.'
> /tmp/list2.list

[opc <at> oradiff-core dbhome_1]$ diff /tmp/list1.list /tmp/list2.list
12268,12269d12267
< ./apex/images/apex_ui/psd/apex_5_ui.ai
< ./apex/images/apex_ui/psd/apex-logo.ai

[opc <at> oradiff-core dbhome_1]$ wc -l /tmp/list1.list /tmp/list2.list
  23397 /tmp/list1.list
  23395 /tmp/list2.list
  46792 total

The output should not show any difference.

The same issue was also reproduced in grep 2.20.

Thanks,
Rodrigo
[Message part 4 (text/html, inline)]
[Message part 5 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Rodrigo Jorge <rodrigoaraujorge <at> gmail.com>
Cc: 73360-done <at> debbugs.gnu.org
Subject: Re: bug#73360: Error when a long list is provided to grep with
 "--binary-files=without-match" option
Date: Sat, 21 Sep 2024 23:39:38 -0700
[Message part 6 (text/plain, inline)]
On 2024-09-20 22:41, Paul Eggert wrote:
> I have the sneaking suspicion that the script is assuming properties of 
> 'grep' that are not documented and that are not guaranteed.

In looking into the code a bit more, I can see some places where that is 
what is happening.

A couple of things.

First, grep 3.11 uses buffer sizes that depend on earlier files that it 
has scanned, and this affects whether grep decides later files are 
binary. This can lead to the sort of confusion that you mentioned. There 
are performance reasons to think that grep should not grow buffer sizes 
for later files merely because earlier files had very long lines, as 
huge buffers can hurt performance; so I installed onto the development 
repository on Savannah the first attached patch to fix that. As a side 
effect this may fix the symptoms you observed.

Second, 'grep' is not a good tool for determining whether a file is text 
or binary, since the definition of "text" vs "binary" is 
application-specific and grep's definition is suitable for 'grep' and 
it's problematic to use it elsewhere. I installed the second attached 
patch to try to document this better.

Hope this helps.

Boldly closing this bug as fixed; if I'm wrong we can reopen it.
[0001-grep-avoid-huge-reads.patch (text/x-patch, attachment)]
[0002-doc-warn-re-using-grep-to-detect-binary-files.patch (text/x-patch, attachment)]

This bug report was last modified 297 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.