GNU bug report logs - #21558
checking for a binary file is not deterministic

Previous Next

Package: grep;

Reported by: Benno Schulenberg <bensberg <at> justemail.net>

Date: Fri, 25 Sep 2015 09:12:01 UTC

Severity: normal

Merged with 19230, 19985, 20526

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21558 in the body.
You can then email your comments to 21558 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#21558; Package grep. (Fri, 25 Sep 2015 09:12:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Benno Schulenberg <bensberg <at> justemail.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 25 Sep 2015 09:12:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Benno Schulenberg <bensberg <at> justemail.net>
To: Grep <bug-grep <at> gnu.org>
Subject: checking for a binary file is not deterministic
Date: Fri, 25 Sep 2015 11:11:06 +0200
Hi,

When piping a certain diff into grep-2.21, it sometimes thinks
it is a binary file, and sometimes treats it as text.  The latter
behaviour is expected and desired.  I think grep should never
consider standard input to be binary.

For lack of a simple recipe, here is the actual use case:

  wget http://http.debian.net/debian/pool/main/g/gtkorphan/gtkorphan_0.4.4.orig.tar.gz
  tar -xf gtkorphan_0.4.4.orig.tar.gz
  cd gtkorphan-0.4.4/
  mkdir fresh
  # the command rsync does not work at this location:
  for lang in pt_BR bg zh_CN hr cs da nl eo fi fr de hu id it lv pl ru sr sv vi;  do \
    wget http://translationproject.org/PO-files/$lang/gtkorphan-0.4.3.$lang.po -O fresh/$lang.po; \
  done

  diff -ur po fresh | /usr/local/bin/grep "Only in" | grep "fi"

That last command sometimes outputs:

  Only in fresh: fi.po
  Only in po: Makefile.in.in

and sometimes:

  Binary file (standard input) matches

(If you can't get the second output, try hitting Enter a few times
and then running the command again, and again, and again.  If you
still can't get both outputs, try using the en_US.utf8 locale.)


What seems to happening is that sometimes grep will look
far enough to see the diff between po/fr.po and fresh/fr.po
(which contains some ISO8859-1 codes), and sometimes
not.  When deleting fresh/bg.po and fresh/de.po, grep will
always see those codes and will always consider the input
to be binary.

I can of course use -a to force grep to see standard input
as text, but still... I think the determining whether a file
is text or binary should be deterministic: it should always
yield the same result when the input is the same.


$ /usr/local/bin/grep --version | head -1
/usr/local/bin/grep (GNU grep) 2.21

$ grep --version | head -1
grep (GNU grep) 2.21

$ diff --version | head -1
diff (GNU diffutils) 2.8.1

$ locale
LANG=eo.utf8
LANGUAGE=en
LC_CTYPE="eo.utf8"
LC_NUMERIC="eo.utf8"
LC_TIME="eo.utf8"
LC_COLLATE="eo.utf8"
LC_MONETARY="eo.utf8"
LC_MESSAGES="eo.utf8"
LC_PAPER="eo.utf8"
LC_NAME="eo.utf8"
LC_ADDRESS="eo.utf8"
LC_TELEPHONE="eo.utf8"
LC_MEASUREMENT="eo.utf8"
LC_IDENTIFICATION="eo.utf8"
LC_ALL=

Benno

-- 
http://www.fastmail.com - Accessible with your email software
                          or over the web





Information forwarded to bug-grep <at> gnu.org:
bug#21558; Package grep. (Fri, 25 Sep 2015 18:03:02 GMT) Full text and rfc822 format available.

Message #8 received at 21558 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Benno Schulenberg <bensberg <at> justemail.net>, 21558 <at> debbugs.gnu.org
Subject: Re: bug#21558: checking for a binary file is not deterministic
Date: Fri, 25 Sep 2015 11:02:21 -0700
Thanks for the bug report.  This appears to be basically the same as 
Bug#20526.  An idea to fix it in a deterministic way was proposed here:

http://bugs.gnu.org/20526#35

and this seems to have been received positively, but nobody has had the 
time to implement it yet.  In the meantime I'll merge the two bug reports.




Merged 19230 19985 20526 21558. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Fri, 25 Sep 2015 18:05:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#21558; Package grep. (Fri, 25 Sep 2015 18:55:01 GMT) Full text and rfc822 format available.

Message #13 received at 21558 <at> debbugs.gnu.org (full text, mbox):

From: Benno Schulenberg <bensberg <at> justemail.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 21558 <at> debbugs.gnu.org
Subject: Re: bug#21558: checking for a binary file is not deterministic
Date: Fri, 25 Sep 2015 20:54:16 +0200
On Fri, Sep 25, 2015, at 20:02, Paul Eggert wrote:
> Thanks for the bug report.  This appears to be basically the same as 
> Bug#20526.

Well, not quite.  That grep will see misencoded files as binary data,
I understand.  But what perplexed me is that grep would *sometimes*
see the piped data as binary, and sometimes not.  How is this possible?

When I pipe the data to a file and then grep the file, it is *always*
seen as binary.  Why then not the input stream?  Do you understand
how it can differ from one run to the other?

>  An idea to fix it in a deterministic way was proposed here:
> 
> http://bugs.gnu.org/20526#35

If I understand it correctly, it would mean that in my example the
piped data would never be classified as binary because the first
grep will never output any of the misencoded bytes.  Right?
If so, then that would be a pretty good change.

Benno

-- 
http://www.fastmail.com - A fast, anti-spam email service.





Information forwarded to bug-grep <at> gnu.org:
bug#21558; Package grep. (Fri, 25 Sep 2015 19:18:02 GMT) Full text and rfc822 format available.

Message #16 received at 21558 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Benno Schulenberg <bensberg <at> justemail.net>
Cc: 21558 <at> debbugs.gnu.org
Subject: Re: bug#21558: checking for a binary file is not deterministic
Date: Fri, 25 Sep 2015 12:17:57 -0700
On 09/25/2015 11:54 AM, Benno Schulenberg wrote:
> On Fri, Sep 25, 2015, at 20:02, Paul Eggert wrote:
>> Thanks for the bug report.  This appears to be basically the same as
>> Bug#20526.
> Well, not quite.  That grep will see misencoded files as binary data,
> I understand.  But what perplexed me is that grep would *sometimes*
> see the piped data as binary, and sometimes not.  How is this possible?

Grep reads the first buffer out of the pipe and decides based on that 
buffer whether the input is binary.  Due to timing issues the pipe's 
first buffer may contain more or fewer bytes, depending on the run.  The 
change proposed for Bug#20526 would change grep so that it uses a 
deterministic algorithm, independent of the number of bytes it happens 
to get in the first input buffer.

> If I understand it correctly, it would mean that in my example the
> piped data would never be classified as binary because the first
> grep will never output any of the misencoded bytes.  Right?
>

Yes, that's the idea.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 06 Feb 2016 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 138 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.