GNU bug report logs -
#21558
checking for a binary file is not deterministic
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21558 in the body.
You can then email your comments to 21558 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#21558
; Package
grep
.
(Fri, 25 Sep 2015 09:12:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Benno Schulenberg <bensberg <at> justemail.net>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Fri, 25 Sep 2015 09:12:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
When piping a certain diff into grep-2.21, it sometimes thinks
it is a binary file, and sometimes treats it as text. The latter
behaviour is expected and desired. I think grep should never
consider standard input to be binary.
For lack of a simple recipe, here is the actual use case:
wget http://http.debian.net/debian/pool/main/g/gtkorphan/gtkorphan_0.4.4.orig.tar.gz
tar -xf gtkorphan_0.4.4.orig.tar.gz
cd gtkorphan-0.4.4/
mkdir fresh
# the command rsync does not work at this location:
for lang in pt_BR bg zh_CN hr cs da nl eo fi fr de hu id it lv pl ru sr sv vi; do \
wget http://translationproject.org/PO-files/$lang/gtkorphan-0.4.3.$lang.po -O fresh/$lang.po; \
done
diff -ur po fresh | /usr/local/bin/grep "Only in" | grep "fi"
That last command sometimes outputs:
Only in fresh: fi.po
Only in po: Makefile.in.in
and sometimes:
Binary file (standard input) matches
(If you can't get the second output, try hitting Enter a few times
and then running the command again, and again, and again. If you
still can't get both outputs, try using the en_US.utf8 locale.)
What seems to happening is that sometimes grep will look
far enough to see the diff between po/fr.po and fresh/fr.po
(which contains some ISO8859-1 codes), and sometimes
not. When deleting fresh/bg.po and fresh/de.po, grep will
always see those codes and will always consider the input
to be binary.
I can of course use -a to force grep to see standard input
as text, but still... I think the determining whether a file
is text or binary should be deterministic: it should always
yield the same result when the input is the same.
$ /usr/local/bin/grep --version | head -1
/usr/local/bin/grep (GNU grep) 2.21
$ grep --version | head -1
grep (GNU grep) 2.21
$ diff --version | head -1
diff (GNU diffutils) 2.8.1
$ locale
LANG=eo.utf8
LANGUAGE=en
LC_CTYPE="eo.utf8"
LC_NUMERIC="eo.utf8"
LC_TIME="eo.utf8"
LC_COLLATE="eo.utf8"
LC_MONETARY="eo.utf8"
LC_MESSAGES="eo.utf8"
LC_PAPER="eo.utf8"
LC_NAME="eo.utf8"
LC_ADDRESS="eo.utf8"
LC_TELEPHONE="eo.utf8"
LC_MEASUREMENT="eo.utf8"
LC_IDENTIFICATION="eo.utf8"
LC_ALL=
Benno
--
http://www.fastmail.com - Accessible with your email software
or over the web
Information forwarded
to
bug-grep <at> gnu.org
:
bug#21558
; Package
grep
.
(Fri, 25 Sep 2015 18:03:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 21558 <at> debbugs.gnu.org (full text, mbox):
Thanks for the bug report. This appears to be basically the same as
Bug#20526. An idea to fix it in a deterministic way was proposed here:
http://bugs.gnu.org/20526#35
and this seems to have been received positively, but nobody has had the
time to implement it yet. In the meantime I'll merge the two bug reports.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#21558
; Package
grep
.
(Fri, 25 Sep 2015 18:55:01 GMT)
Full text and
rfc822 format available.
Message #13 received at 21558 <at> debbugs.gnu.org (full text, mbox):
On Fri, Sep 25, 2015, at 20:02, Paul Eggert wrote:
> Thanks for the bug report. This appears to be basically the same as
> Bug#20526.
Well, not quite. That grep will see misencoded files as binary data,
I understand. But what perplexed me is that grep would *sometimes*
see the piped data as binary, and sometimes not. How is this possible?
When I pipe the data to a file and then grep the file, it is *always*
seen as binary. Why then not the input stream? Do you understand
how it can differ from one run to the other?
> An idea to fix it in a deterministic way was proposed here:
>
> http://bugs.gnu.org/20526#35
If I understand it correctly, it would mean that in my example the
piped data would never be classified as binary because the first
grep will never output any of the misencoded bytes. Right?
If so, then that would be a pretty good change.
Benno
--
http://www.fastmail.com - A fast, anti-spam email service.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#21558
; Package
grep
.
(Fri, 25 Sep 2015 19:18:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 21558 <at> debbugs.gnu.org (full text, mbox):
On 09/25/2015 11:54 AM, Benno Schulenberg wrote:
> On Fri, Sep 25, 2015, at 20:02, Paul Eggert wrote:
>> Thanks for the bug report. This appears to be basically the same as
>> Bug#20526.
> Well, not quite. That grep will see misencoded files as binary data,
> I understand. But what perplexed me is that grep would *sometimes*
> see the piped data as binary, and sometimes not. How is this possible?
Grep reads the first buffer out of the pipe and decides based on that
buffer whether the input is binary. Due to timing issues the pipe's
first buffer may contain more or fewer bytes, depending on the run. The
change proposed for Bug#20526 would change grep so that it uses a
deterministic algorithm, independent of the number of bytes it happens
to get in the first input buffer.
> If I understand it correctly, it would mean that in my example the
> piped data would never be classified as binary because the first
> grep will never output any of the misencoded bytes. Right?
>
Yes, that's the idea.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 06 Feb 2016 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 138 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.