GNU bug report logs - #22838
New 'Binary file' detection considered harmful

Previous Next

Package: grep;

Reported by: Marcello Perathoner <marcello <at> perathoner.de>

Date: Sun, 28 Feb 2016 18:13:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #38 received at 22838 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Marcello Perathoner <marcello <at> perathoner.de>,
 Paul Eggert <eggert <at> cs.ucla.edu>, 22838 <at> debbugs.gnu.org
Subject: Re: bug#22838: New 'Binary file' detection considered harmful
Date: Mon, 29 Feb 2016 15:37:55 -0700
[Message part 1 (text/plain, inline)]
On 02/29/2016 01:11 PM, Marcello Perathoner wrote:

>> Yes, locale dependencies on standard behavior can be annoying.
>>
> 
> You assume that a user will only ever want to grep text files encoded in
> the machine's locale. That is not so.

You've been relying on undefined behavior, and it caught up with you.
It's the same as asking for us to keep use-after-free "working" in a
multithreaded program because it has always "worked" in your older
single-threaded program when nothing was perturbing the memory between
free() and its latent use.  A latent bug in your usage is still a bug in
your usage, even if it took a change in grep's defaults to expose your
problem.

And meanwhile, newer grep 2.23 has improved the heuristics to only
complain about a binary file if it would otherwise be outputting
encoding errors (rather than blindly complaining about the encoding
error up front and stopping processing immediately), which does
alleviate some of the worst of the change caused by your undefined usage
(that is, you can still grep for valid encodings, and get reasonable
results so long as the valid text doesn't mix with lines with invalid
encodings).

> 
> As a German user I have on my disk files in many encodings: utf-8,
> iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
> CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
> old WordStar files that used control characters inside.
> 
> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.

Yes, but then you are no longer relying on undefined behavior, and
therefore have a leg to stand on if we break that behavior.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 8 years and 256 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.