From unknown Fri Jun 20 07:21:42 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#22838 <22838@debbugs.gnu.org> To: bug#22838 <22838@debbugs.gnu.org> Subject: Status: New 'Binary file' detection considered harmful Reply-To: bug#22838 <22838@debbugs.gnu.org> Date: Fri, 20 Jun 2025 14:21:42 +0000 retitle 22838 New 'Binary file' detection considered harmful reassign 22838 grep submitter 22838 Marcello Perathoner severity 22838 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 28 13:12:37 2016 Received: (at submit) by debbugs.gnu.org; 28 Feb 2016 18:12:37 +0000 Received: from localhost ([127.0.0.1]:50972 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aa5pd-0006EC-66 for submit@debbugs.gnu.org; Sun, 28 Feb 2016 13:12:37 -0500 Received: from eggs.gnu.org ([208.118.235.92]:33476) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aZzLo-0005w2-Q9 for submit@debbugs.gnu.org; Sun, 28 Feb 2016 06:17:25 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aZzLi-0001Ee-Qr for submit@debbugs.gnu.org; Sun, 28 Feb 2016 06:17:19 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_40,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:33121) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aZzLi-0001Ea-Na for submit@debbugs.gnu.org; Sun, 28 Feb 2016 06:17:18 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52679) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aZzLh-00022f-QX for bug-grep@gnu.org; Sun, 28 Feb 2016 06:17:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aZzLe-0001EH-Ji for bug-grep@gnu.org; Sun, 28 Feb 2016 06:17:17 -0500 Received: from larissa.perathoner.de ([85.10.209.172]:57252) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aZzLe-0001E6-9W for bug-grep@gnu.org; Sun, 28 Feb 2016 06:17:14 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=perathoner.de; s=2; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Date:Message-ID:Subject:From:To; bh=tYf+qbyjW5AIrIcRZZGdCIitwEmAZE/4gOAN4E/7Tio=; b=AOSC2nmTXFASSBn+eEzyd3UrxM/C6tQJ0wopOPYvq9w1DzF2t5FJPsV7Ci032WkgI6nuMkC5acBDRSqbKavyRUPkxum3dvWDGF7x39hsEftc7wsI8TIrrlSi1c2UinvMSqth9XpUy05Gz0VZ6+GgDKNVt+KlVWHKy3NWjkdqywT5aQgpO7Kr/GIfRMIBA8LO89xBZNHCBbVo/V34TcxEKQfwXXL4pYEK1byvEIH5uYpfG1yv02L7yvCgPoFAPGIvANQCY7wniMcoMjmoHoELZHJNFMPUB1RkpW3UVRW6ETg63e8obF37KCUXtbmnRPV5khk9vXkY9v/Nvu3C; Received: from 2001-4dd0-425d-0-77e6-94a9-296f-d2f0.ipv6dyn.netcologne.de ([2001:4dd0:425d:0:77e6:94a9:296f:d2f0]) by larissa.perathoner.de with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1aZzLZ-0004kY-9o for bug-grep@gnu.org; Sun, 28 Feb 2016 12:17:09 +0100 To: bug-grep@gnu.org From: Marcello Perathoner Subject: New 'Binary file' detection considered harmful Message-ID: <56D2D733.60506@perathoner.de> Date: Sun, 28 Feb 2016 12:17:07 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.1 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sun, 28 Feb 2016 13:12:35 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.1 (----) The new heuristics to detect 'Binary files' should be reverted to the old one (before 2.20) as the new one has too big a potential to silently fail important tasks. One of the most important use cases of grep is processing file lists, eg. in the pipe: find | grep | tar. This is often done by backup software, eg. the in debian package 'backup2l'. The new behaviour of grep -- to output 'Binary file matches' after output started -- has silently broken the 'backup2l' script and has the potential of silently breaking many other backup scripts as well. Test case: $ find /etc/ssl/certs/ | LANG= grep pem Outcome: grep will stop with 'Binary file (standard input) matches' after outputting a small percentage of the existing .pem files. Expected behaviour: grep should list all .pem files. This behaviour is particularly insidious because users may not notice that their backup archives are a bit smaller than before or that their backups complete a bit faster, while many thousand files may be missing. Q: Why do you use LANG= ? A: To illustrate the problem and because 'backup2l' does that. Q: Why don't people use the -a switch? A: People may not notice anything wrong with their backups until they need them. Q: Why don't you file a bug against 'backup2l'? A: I will. But this is such a common use case that I suspect that many of the backup scripts that people wrote just for themselves are now broken. Q: Why don't you just set the correct locale? A: Even then it suffices to have one bogus-encoded filename somewhere to break your whole backup. It is easy to catch such a file from the internet or from song or picture metadata. Regards -- Marcello Perathoner From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 28 17:13:53 2016 Received: (at 22838) by debbugs.gnu.org; 28 Feb 2016 22:13:53 +0000 Received: from localhost ([127.0.0.1]:51058 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aa9b7-000050-35 for submit@debbugs.gnu.org; Sun, 28 Feb 2016 17:13:53 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44388) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aa9b5-0008WS-7F for 22838@debbugs.gnu.org; Sun, 28 Feb 2016 17:13:51 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9B131160FD0; Sun, 28 Feb 2016 14:13:44 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 4edutx00F-nR; Sun, 28 Feb 2016 14:13:43 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D1055160FD5; Sun, 28 Feb 2016 14:13:43 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id B_kzc-iqQ5cH; Sun, 28 Feb 2016 14:13:43 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id B0B9D160FD0; Sun, 28 Feb 2016 14:13:43 -0800 (PST) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56D37117.5060007@cs.ucla.edu> Date: Sun, 28 Feb 2016 14:13:43 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <56D2D733.60506@perathoner.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Marcello Perathoner wrote: > The new behaviour of grep -- to output 'Binary file matches' after output > started I assume that the "new behavior" you're talking about is for grep 2.21 (2014-11-23) and later, as that's the version of grep that started outputting "Binary file matches" due to input encoding errors. For example, on my platform (Ubuntu 15.10), the shell command: LC_ALL=C awk 'BEGIN {for(i=1; i<256; i++) printf "%c %d\n", i, i}' | LC_ALL=en_US.utf8 grep 126 outputs "Binary file (standard input) matches" in grep 2.21. These changes were put in partly due to security issues, not only having to do with grep's internals (the old 'grep' would dump core sometimes when given encoding errors), but also for the benefit of invokers expecting properly encoded text. To some extent we were stuck between a rock and a hard place here. No matter what 'grep' does, it will do the wrong thing for some usages. But overall we thought it better for grep's output to be valid text. I think you can work around the problem for unfixed backup2l by setting your system's locale to a unibyte locale where all bytes are valid. The en_US.ISO-8859-15 locale, say. Of course backup2l should get fixed, regardless of what we do with 'grep' or with your system locale. > $ find /etc/ssl/certs/ | LANG= grep pem Wouldn't the following be better? find /etc/ssl/certs/ -name '*.pem' This avoids false matches like '/etc/ssl/certs/pemmican'. Alternatively: find /etc/ssl/certs/ -print | grep -a '\.pem$' > It is easy to catch such a file from the internet or from song or picture metadata. None of the above approaches will work for arbitrary file names ("off the Internet"), because they all mishandle file names containing newlines. backup2l needs to do something like this: find /etc/ssl/certs/ -name '*.pem' -print0 or like this: find /etc/ssl/certs/ -print0 | grep -az '\.pem$' with remaining code using null bytes instead of newlines to terminate file names. This is the sort of thing that backup2l should be doing. From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 12:14:14 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 17:14:14 +0000 Received: from localhost ([127.0.0.1]:53919 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaROg-0007pm-Dw for submit@debbugs.gnu.org; Mon, 29 Feb 2016 12:14:14 -0500 Received: from larissa.perathoner.de ([85.10.209.172]:46834) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaROb-0007pO-Hc for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 12:14:10 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=perathoner.de; s=2; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=s2bnjjzESOU/Ps6tLZxLxXj8cQGopLt/darLafjyVZM=; b=HZjuoqXfiXNQS4hVf8Ga1YKYLZqCTkzvj827/GQyqJFRmsDAYn0Gg8NyAd1596x/CWojElFXQ1HB1x0pfvmsETXQzvUexBnbSYtCZN0e/qEWu2J6KzovmeuLixyHVscRAQdRy/IN7qxNKTpHXUygVXdjHNOZU7DyuatERAt3oi1LBm1sti9KytcXmxpTgjnvw8itwjzyIvORYtbE1lT7AwcqY9Nbe/bLJmZ15HnBEgjy5/Yr0TiIsukbhgBDK4REh5UMBVMlYsLmvhva2X3J9sTh43xyvd6zX1nJ4I9azR43d/yS1j9ByoPK2zB+WJdY4nqcPTQDbtGwQNop; Received: from 2001-4dd0-425d-0-77e6-94a9-296f-d2f0.ipv6dyn.netcologne.de ([2001:4dd0:425d:0:77e6:94a9:296f:d2f0]) by larissa.perathoner.de with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1aaROY-0003xX-JE; Mon, 29 Feb 2016 18:14:07 +0100 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> From: Marcello Perathoner Message-ID: <56D47C5C.8000509@perathoner.de> Date: Mon, 29 Feb 2016 18:14:04 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D37117.5060007@cs.ucla.edu> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Scanner: Spamassassin on larissa X-Spam-Level: -- X-Spam-Score: -2.9 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 02/28/2016 11:13 PM, Paul Eggert wrote: > These changes were put in partly due to security issues, not only having > to do with grep's internals (the old 'grep' would dump core sometimes > when given encoding errors), but also for the benefit of invokers > expecting properly encoded text. > > To some extent we were stuck between a rock and a hard place here. No > matter what 'grep' does, it will do the wrong thing for some usages. But > overall we thought it better for grep's output to be valid text. You are driving out demons by Beelzebub. grep is a core component of every unix system. You cannot change the behaviour or interface of such a fundamental tool without incurring in substantial breakage. Keeping the old bug is far wiser than to fix it and introduce a new bug. Copying faulty input to the output is a preferable failure mode to dropping part of the expected output. People do not expect grep to validate their input but they do expect grep to produce a complete result set. A text file with encoding problems is a text file and not a binary file. >> $ find /etc/ssl/certs/ | LANG= grep pem > > Wouldn't the following be better? > > find /etc/ssl/certs/ -name '*.pem' I'm not doing that. That was just an example to show how grep now gives incorrect results. Many more cases can be made: any process that feeds tainted (user-provided) strings to grep can now be made to fail. Eg. a process that greps apache logs for known exploit signatures will now fail if the attacker sends a bogus user-agent string. Regards -- Marcello Perathoner From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 12:22:26 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 17:22:26 +0000 Received: from localhost ([127.0.0.1]:53927 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaRWc-0001F9-7w for submit@debbugs.gnu.org; Mon, 29 Feb 2016 12:22:26 -0500 Received: from mx1.redhat.com ([209.132.183.28]:54678) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaRWa-0001F1-Hv for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 12:22:24 -0500 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (Postfix) with ESMTPS id 7275B8553D; Mon, 29 Feb 2016 17:22:23 +0000 (UTC) Received: from [10.3.113.120] (ovpn-113-120.phx2.redhat.com [10.3.113.120]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u1THMMK7006735; Mon, 29 Feb 2016 12:22:23 -0500 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> From: Eric Blake Openpgp: url=http://people.redhat.com/eblake/eblake.gpg Organization: Red Hat, Inc. Message-ID: <56D47E4E.4060409@redhat.com> Date: Mon, 29 Feb 2016 10:22:22 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D47C5C.8000509@perathoner.de> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="7St1qBVUSFmK4uKqrX9wXmi6FsSLi64Ou" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --7St1qBVUSFmK4uKqrX9wXmi6FsSLi64Ou Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 02/29/2016 10:14 AM, Marcello Perathoner wrote: >=20 > A text file with encoding problems is a text file and not a binary file= =2E Wrong, at least according to the POSIX definition of text file. A text file is one with no encoding errors. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --7St1qBVUSFmK4uKqrX9wXmi6FsSLi64Ou Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJW1H5OAAoJEKeha0olJ0NqU68IAK0cBSgZXJGrRVepz3Asa4L4 YOqcHdragzCJIDrxExn60Tft1T9769hJzrZp5jp+BtEbXIArIfIvNpLwtXwqYjeO DSadx7tj0//6vmHbzsMbhj0nSjXAvlBDurtt67htt9JkDNPdJ9npeM6jfc5vLVYV xfWvT6/eRrThgy10+dhesQyOE6A7NwE0qyR2veGJDayouQzdHOTOtekJbcCGi938 E0eGRYrUwQRhVqxiG1fm77ljcx/3KnugPj8F8phqUfhCir04Y9Lt4lJCzzBWJSn9 QPbcBj5/LswbT2XTwj8fuqjwXOeUDJawxvXc5AQNy7mIHqxm3WZU/sZyILtgm2k= =GQfl -----END PGP SIGNATURE----- --7St1qBVUSFmK4uKqrX9wXmi6FsSLi64Ou-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 12:40:44 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 17:40:44 +0000 Received: from localhost ([127.0.0.1]:53963 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaRoK-0001i5-5O for submit@debbugs.gnu.org; Mon, 29 Feb 2016 12:40:44 -0500 Received: from larissa.perathoner.de ([85.10.209.172]:47127) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaRoI-0001hx-Ir for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 12:40:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=perathoner.de; s=2; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=dYr1dKQ2c5vE/momNebW0WxdsilyLHWnLn8HQ+CCGe8=; b=K3lUPao3Un7UX7N1Zn6bvMgvPUCKoEHUhcjr9/pHrNxZ4w1KKs2L9MhJK7yY0NqpribmjnW9FX4zyr0XymVBpWJwWjqCP5IQ5Ev8jVXR30c4qUwn+7txsQ1Ubw+iGKwO4Bs/MSVdYBlq08gOToBSv2kZAwrhX6aXdVCcTX8Dq1PQUOWNmuXmKZdSRPQk0PtauZG6aHPTRShXL60VPHNThaiTA9ioLk8L7OHWbplEX0kfZNmQzRRFAWQrei2oj7TFyl/hZQ5lHubs7yu/J9hRnEWT+NxGtdLZaUZKfH1IGIrbNDEayJ0M7Nq4N/9dfOfWjJcKUCuwCv4nmGw5; Received: from 2001-4dd0-425d-0-77e6-94a9-296f-d2f0.ipv6dyn.netcologne.de ([2001:4dd0:425d:0:77e6:94a9:296f:d2f0]) by larissa.perathoner.de with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1aaRoG-0004iU-St; Mon, 29 Feb 2016 18:40:41 +0100 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Eric Blake , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> From: Marcello Perathoner Message-ID: <56D48298.40503@perathoner.de> Date: Mon, 29 Feb 2016 18:40:40 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D47E4E.4060409@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Scanner: Spamassassin on larissa X-Spam-Level: -- X-Spam-Score: -2.9 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 02/29/2016 06:22 PM, Eric Blake wrote: > On 02/29/2016 10:14 AM, Marcello Perathoner wrote: >> >> A text file with encoding problems is a text file and not a binary file. > > Wrong, at least according to the POSIX definition of text file. A text > file is one with no encoding errors. """ 3.397 Text File A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections. """ -- The Open Group Base Specifications Issue 7 IEEE Std 1003.1, 2013 Edition Copyright © 2001-2013 The IEEE and The Open Group Regards -- Marcello Perathoner From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 12:54:58 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 17:54:58 +0000 Received: from localhost ([127.0.0.1]:53970 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaS25-000231-Pu for submit@debbugs.gnu.org; Mon, 29 Feb 2016 12:54:58 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34205) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaS23-00022r-DQ for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 12:54:56 -0500 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (Postfix) with ESMTPS id 7B430486AF; Mon, 29 Feb 2016 17:54:53 +0000 (UTC) Received: from [10.3.113.165] (ovpn-113-165.phx2.redhat.com [10.3.113.165]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u1THsq22026914; Mon, 29 Feb 2016 12:54:53 -0500 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> From: Eric Blake Openpgp: url=http://people.redhat.com/eblake/eblake.gpg Organization: Red Hat, Inc. Message-ID: <56D485EC.6040008@redhat.com> Date: Mon, 29 Feb 2016 10:54:52 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D48298.40503@perathoner.de> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="BQouSpi36K9nTpIxsmWE2rhGwtJvjveWd" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --BQouSpi36K9nTpIxsmWE2rhGwtJvjveWd Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 02/29/2016 10:40 AM, Marcello Perathoner wrote: >> Wrong, at least according to the POSIX definition of text file. A tex= t >> file is one with no encoding errors. >=20 >=20 > """ > 3.397 Text File >=20 > A file that contains characters organized into zero or more lines. The > lines do not contain NUL characters and none can exceed {LINE_MAX} byte= s > in length, including the character. Although POSIX.1-2008 doe= s > not distinguish between text files and binary files (see the ISO C > standard), many utilities only produce predictable or meaningful output= > when operating on text files. The standard utilities that have such > restrictions always specify "text files" in their STDIN or INPUT FILES > sections. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html >=20 > 3.206 Line >=20 > A sequence of zero or more non- characters plus a terminating= character. >=20 > 3.87 Character >=20 > A sequence of one or more bytes representing a single graphic symbol or= control code. >=20 > Note: > This term corresponds to the ISO C standard term multi-byte character, = where a single-byte character is a special case of a multi-byte character= =2E Unlike the usage in the ISO C standard, character here has no necessa= ry relationship with storage space, and byte is used when storage space i= s discussed. >=20 > See the definition of the portable character set in Portable Character = Set for a further explanation of the graphical representations of (abstra= ct) characters, as opposed to character encodings. >=20 Encoding errors are not characters, but bytes. A line cannot contain encoding errors. Therefore, a file with encoding errors is not a text fi= le. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --BQouSpi36K9nTpIxsmWE2rhGwtJvjveWd Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJW1IXsAAoJEKeha0olJ0NqBssH/0dnsCTEHrKopbsM4UBBXWyo eoOiyctMnFQ6JNaMYjOQHa90/fl4wFgEyoPumnGJLEiIRycmzQMAW0fUYXwOyIbn ZxePf90mh4J/mm+1RxhYUX8Jjg72qNC6U0DpWC7OTTCRLWwb1tBynxnFAOzMyQfi P5MX85fkKC7e5cC+/eFsfO35RbYn6uvVC+cRg7YEKUEnqxtNh8A7FSpoLF3pxHSl YvG11Jepz3Ro8FZkpocg2BbxBcNhyc1St8K6MrG0+n6sK+i+Rm6xZT2QsqcAVHjQ yNGdWyaTEpGmv1eQM8qrFn5TsMgadT+mtPEVhe6hX2kGkdOjqgU+I+DNFLLlz2Q= =25S/ -----END PGP SIGNATURE----- --BQouSpi36K9nTpIxsmWE2rhGwtJvjveWd-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 12:56:17 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 17:56:17 +0000 Received: from localhost ([127.0.0.1]:53975 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaS3N-00025Q-5R for submit@debbugs.gnu.org; Mon, 29 Feb 2016 12:56:17 -0500 Received: from mx1.redhat.com ([209.132.183.28]:50984) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaS3M-00025I-1H for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 12:56:16 -0500 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 9250A7F094; Mon, 29 Feb 2016 17:56:15 +0000 (UTC) Received: from [10.3.113.165] (ovpn-113-165.phx2.redhat.com [10.3.113.165]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u1THuFJs006839; Mon, 29 Feb 2016 12:56:15 -0500 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> From: Eric Blake Openpgp: url=http://people.redhat.com/eblake/eblake.gpg Organization: Red Hat, Inc. Message-ID: <56D4863E.5040205@redhat.com> Date: Mon, 29 Feb 2016 10:56:14 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D485EC.6040008@redhat.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="VUT8KiDxo72tmbOOAvc2jG7M2dVGfiphs" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --VUT8KiDxo72tmbOOAvc2jG7M2dVGfiphs Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 02/29/2016 10:54 AM, Eric Blake wrote: > Encoding errors are not characters, but bytes. A line cannot contain > encoding errors. Therefore, a file with encoding errors is not a text = file. Corollary - there exist files which are text files in some locales, but binary files in others (based on whether the locale interprets the bytes as an encoding error or as valid characters). Yes, locale dependencies on standard behavior can be annoying. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --VUT8KiDxo72tmbOOAvc2jG7M2dVGfiphs Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJW1IY+AAoJEKeha0olJ0NqdAAH+wWFTXMELYJ7NMehSWrInkvx 7YWx3b6kbOTU3Fx40vD0ulS7UnEgtQsLM19oB42S2Fuha+c6Rk36231x2Wrnekhr PmnDkkWHN9fHIL3ftyyir28+HY0oXvDwwtccVj2ngnsq066n71hZjmYvey4yZH3Y RJH0mVyLaM34IlExe6nu5CKAWFaeGdpwo/XO6AahSkj+a0aq9LmgqhiJUXBTlZwQ omhvJfcA6gE92ViLRueQRIYjFMPQb+ypD780a9zvi1Vr2Y6CdkHNkIOA6JtuJLrT 5UQD44RR1D4l0DTtEnRwv9yD/s2ZDiNH45+XdbKL9A8WDV5p6uwE3idGx7kQNOo= =MLOv -----END PGP SIGNATURE----- --VUT8KiDxo72tmbOOAvc2jG7M2dVGfiphs-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 14:29:32 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 19:29:32 +0000 Received: from localhost ([127.0.0.1]:54054 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaTVc-0004Lu-Ky for submit@debbugs.gnu.org; Mon, 29 Feb 2016 14:29:32 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:56431) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaTVb-0004Lh-6k for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 14:29:31 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 769DF160F53; Mon, 29 Feb 2016 11:29:25 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 3vHoL8T0lQW8; Mon, 29 Feb 2016 11:29:24 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B9E81160FD2; Mon, 29 Feb 2016 11:29:24 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id K77cW4n9K2aM; Mon, 29 Feb 2016 11:29:24 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id A2165160F53; Mon, 29 Feb 2016 11:29:24 -0800 (PST) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56D49C14.2020506@cs.ucla.edu> Date: Mon, 29 Feb 2016 11:29:24 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D47C5C.8000509@perathoner.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 02/29/2016 09:14 AM, Marcello Perathoner wrote: > Keeping the old bug is far wiser than to fix it and introduce a new bug. That depends on the bugs in question. The old bugs were pretty bad. > Copying faulty input to the output is a preferable failure mode Again, we cannot satisfy everybody. There are reasonable complaints from users if 'grep' blasts improperly-encoded data to their terminals, or more generally if grep's improperly-encoded output trashes other programs that read the output. This is why grep has the -a option. It sounds like you need grep's -a option for your application, and it should be easy to use -a. It's not clear that -a should be the default. > any process that feeds tainted (user-provided) strings to grep can now > be made to fail. Eg. a process that greps apache logs for known > exploit signatures will now fail if the attacker sends a bogus > user-agent string. Such a process won't fail if it uses grep's -a option, or if it treats the "Binary file matches" diagnostic as an indication that there are possible attacks, or if it is run in a unibyte locale where all bytes are valid characters, or if it looks at grep's exit status. Granted, slapdash approaches that don't do any of these things will be vulnerable, but they'll be vulnerable even with older grep versions. From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 15:11:08 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 20:11:08 +0000 Received: from localhost ([127.0.0.1]:54100 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaU9s-0005Nc-EI for submit@debbugs.gnu.org; Mon, 29 Feb 2016 15:11:08 -0500 Received: from larissa.perathoner.de ([85.10.209.172]:48785) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaU9r-0005NT-3z for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 15:11:07 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=perathoner.de; s=2; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=P0cQR9KIWGwFKfurZs6iN+pQwOBMFz9f3pCvxQRokP4=; b=MrnUp8wVVQH4r5Fjs+RvNJVJKGvjamwzqSG+9prJ5nbWova8+7C374OjZRogxacK+DjgHsI0jOZ2sDFLh6JCgiWRgLA9CLFvAcOBXREybOqaHyrvqIRFtmv/oHxv8cuEHXta9dYqYtP0x3pPfzU+blxJ2gOlWDh2RBJXQZCFQxlOieB+pA2BRWJLFaOQz6TPxtAb4gP0aXOYbXp2DUK/Km6h/qqGGpjEqkZy/9cIfGuCzjpQmwAD+p1ijhnbTuomrjRFoimFzCpRQjWhDL4Frd5OmNe62nCAVkzERQAAMCpRoiwVCMVYpya7Vv4+B/zOOGO2UEUdF7Y1PMHp; Received: from 2001-4dd0-425d-0-77e6-94a9-296f-d2f0.ipv6dyn.netcologne.de ([2001:4dd0:425d:0:77e6:94a9:296f:d2f0]) by larissa.perathoner.de with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1aaU9o-0000Tf-Rv; Mon, 29 Feb 2016 21:11:05 +0100 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Eric Blake , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> From: Marcello Perathoner Message-ID: <56D4A5D6.5040709@perathoner.de> Date: Mon, 29 Feb 2016 21:11:02 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4863E.5040205@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Scanner: Spamassassin on larissa X-Spam-Level: -- X-Spam-Score: -2.9 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 02/29/2016 06:56 PM, Eric Blake wrote: > On 02/29/2016 10:54 AM, Eric Blake wrote: >> Encoding errors are not characters, but bytes. A line cannot contain >> encoding errors. Therefore, a file with encoding errors is not a text file. > > Corollary - there exist files which are text files in some locales, but > binary files in others (based on whether the locale interprets the bytes > as an encoding error or as valid characters). > > Yes, locale dependencies on standard behavior can be annoying. > You assume that a user will only ever want to grep text files encoded in the machine's locale. That is not so. As a German user I have on my disk files in many encodings: utf-8, iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts, old WordStar files that used control characters inside. Since 2.21 I will now have to always specify -a or LC_ALL=C when grepping my files. Regards -- Marcello Perathoner From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 15:34:50 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 20:34:50 +0000 Received: from localhost ([127.0.0.1]:54129 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaUWn-0005xS-Sh for submit@debbugs.gnu.org; Mon, 29 Feb 2016 15:34:50 -0500 Received: from larissa.perathoner.de ([85.10.209.172]:49087) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaUWm-0005xK-Hl for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 15:34:48 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=perathoner.de; s=2; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=fx+b4FMKUGUfVPjIko/+ZJx0Ij1N+fT6pisQMbQxoAc=; b=AjvexU36vvv1aSfRp9PgrsYfJ5U/NKv3ojz9XcLpCWHj3tWURyXRilDT5R0m77Afrdr/koSztqR+QX+/NqnY814jBQmQYNio0UVngOjHxQ0exr82eteAUDROU5dngrs8d9VwuefyFH2iXyHBcet2+QivwhV7iv3H/HxHBTfJkXy9XosAGO09NjBBwJMz+9otxj+NGinIGf1VJYEyXEPKWHK3nwYLw8Xjjobd3rV5m55o9OWb7AP47m+OoWIBo22ZjDGxGN4zdz12do3UWLf2Wbm3TpRMZxDT0iGZDdUtqhprIQMFYzbB4LO8SIXDrdYmHjX1+Y0S3YyjWxfY; Received: from 2001-4dd0-425d-0-77e6-94a9-296f-d2f0.ipv6dyn.netcologne.de ([2001:4dd0:425d:0:77e6:94a9:296f:d2f0]) by larissa.perathoner.de with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1aaUWl-0001Cs-5E; Mon, 29 Feb 2016 21:34:47 +0100 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D49C14.2020506@cs.ucla.edu> From: Marcello Perathoner Message-ID: <56D4AB66.4010203@perathoner.de> Date: Mon, 29 Feb 2016 21:34:46 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D49C14.2020506@cs.ucla.edu> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Scanner: Spamassassin on larissa X-Spam-Level: -- X-Spam-Score: -2.9 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 02/29/2016 08:29 PM, Paul Eggert wrote: > On 02/29/2016 09:14 AM, Marcello Perathoner wrote: >> Keeping the old bug is far wiser than to fix it and introduce a new bug. > > That depends on the bugs in question. The old bugs were pretty bad. > >> Copying faulty input to the output is a preferable failure mode > > Again, we cannot satisfy everybody. There are reasonable complaints from > users if 'grep' blasts improperly-encoded data to their terminals, or > more generally if grep's improperly-encoded output trashes other > programs that read the output. They would 'blast' their terminals without grep too. I don't see any grounds for a complaint like that. Grep is not a sanitizer. > This is why grep has the -a option. It > sounds like you need grep's -a option for your application, and it > should be easy to use -a. It's not clear that -a should be the default. I was lucky in that I noticed that a 17GB tar file could not be a complete backup of a 500GB drive. I was lucky because the now offending filename (the same filename that didn't bother grep for over 10 years) was early in the file list. If it had been late in the file list I wouldn't have noticed that a 400GB tar file was missing a few thousand files. Other people may not be that lucky and they could get understandably angry at losing their data. At least, if you must turn grep into a text file sanitizer, make the new behaviour optional. You can then tell people who complain about 'blasted' terminals to turn on that option, while other people would not blindly incur into the new bug. Regards -- Marcello Perathoner From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 17:08:24 2016 Received: (at submit) by debbugs.gnu.org; 29 Feb 2016 22:08:25 +0000 Received: from localhost ([127.0.0.1]:54216 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaVzM-0001hT-Kz for submit@debbugs.gnu.org; Mon, 29 Feb 2016 17:08:24 -0500 Received: from eggs.gnu.org ([208.118.235.92]:38508) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaUSS-0005qJ-KF for submit@debbugs.gnu.org; Mon, 29 Feb 2016 15:30:20 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aaUSM-0003X2-Nn for submit@debbugs.gnu.org; Mon, 29 Feb 2016 15:30:15 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,FREEMAIL_FROM autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:36190) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aaUSM-0003Wy-KW for submit@debbugs.gnu.org; Mon, 29 Feb 2016 15:30:14 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:57696) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aaUSL-0007XN-MX for bug-grep@gnu.org; Mon, 29 Feb 2016 15:30:14 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aaUSI-0003Uy-34 for bug-grep@gnu.org; Mon, 29 Feb 2016 15:30:13 -0500 Received: from mout.gmx.net ([212.227.15.18]:50719) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aaUSH-0003Tv-OA for bug-grep@gnu.org; Mon, 29 Feb 2016 15:30:09 -0500 Received: from mail.bru.lan ([195.225.201.123]) by mail.gmx.com (mrgmx002) with ESMTPSA (Nemesis) id 0LynHb-1Znjh10fkl-016Bfr for ; Mon, 29 Feb 2016 21:30:08 +0100 Received: from li12.bru.lan ([192.168.2.37]) by mail.bru.lan with esmtp (Exim 4.86) (envelope-from ) id 1aaUS1-0001Bf-DO for bug-grep@gnu.org; Mon, 29 Feb 2016 21:30:06 +0100 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: bug-grep@gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> From: "Holger Bruenjes" Message-ID: <56D4AA3A.1060103@bru.lan> Date: Mon, 29 Feb 2016 21:29:46 +0100 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4A5D6.5040709@perathoner.de> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="DgbmpuRGUEAo4ACKfvSfbT7bIwNNPR7Qi" X-Scan-Signature: 6b24595edb5b2cc27354e14addbc81ca X-Provags-ID: V03:K0:FgUx5qkx0snV5LLS8wPfnqsa4UPiXXa9GKAy6b+RD+SsIxERYYf cRPBIc6rrSUIZa6xrRDqrI63r+d9rxZeoEBKSFf4rbEmWLusq1zQdAk5KUg1Qtwk7bR0b4a A+Tnd9QGQpjmHMRH/03ShKSOK5+Nxq9Q2Cjo1wzRj6yv0vTd72ewk3siZzIabPeLAcn3RDP frLsJs9D3QdguCbAMfhsg== X-UI-Out-Filterresults: notjunk:1;V01:K0:DNPT5pzWLbM=:jRQLS2FP7XXTIXiKYig/k3 nxTF8KbaleWdVw/RInOfbkmZ5yeWylvs1/xv61CLXnTzPPZ2pQKy2iyBbwtl1Q5gF19/y0jUr o4KlXlz/cHLQvy7q9dOLW6aWjKy2uD4HZyD2LWzxXKHzuXPUCusKDQp/TAocBZ2iNOOWSS3lu LKWEP4d8SpceZbimf41lIU3MseH0yW8py8OAkUrXZpHS79e780h1GeLvG0LZthcimTNXbAv77 Do0mKxk2Ogxdx1uUseayig8aVNXTNMM1Bp7XbFrsJSwDVU/9vhv3cvPaDpLXRRbyyKbquQxkA COMoH1o6Fjzq5bzU1r4bAt6qgvyAxPfbFCQ9mah/2s0wFEvx/kBYkI7V+EDUEVn2w+AFyEmED eujSw/DigDXl8FhbLyThVL47JI0/nePOD0S1JqAuajIVwfT1X6iBSIQU6aGUdL8E9WGAc1PvE r6FMB+K3RdHRv47cSwD/rkrwXYZV2BKc9rMCX7TdhchyoXxwFcAzwcgFyfQy0Rw4FyNQMWn9G avqIT3eGlvSfwVheQ0GbisExtz8gY1S/M7twVQcqcR43WljZlaPfRS/Gx+uHAsa88ZH7lxh7c Lr9n/HJdgIXB6Ccx7QDjqmaQI1Q79Diqhnng2kTh130iJHQrnGrJI5y9jSA5Kvdou5tuUjnxh Eam7T4S+8plG5QkZE5EdA633S58+bk9W/mFIDPU4T2yFf1zVS2L50uEvxPb8D8cD7cyJ9LSDG d3JvOVY5ScOEL9ilTQC3hjsG1wWmtG7pFazNv+M7XX/1dNx8vIcQsoIWg5A= X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.1 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 29 Feb 2016 17:08:24 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.1 (----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --DgbmpuRGUEAo4ACKfvSfbT7bIwNNPR7Qi Content-Type: multipart/mixed; boundary="963ABlATK6GHxCrJkcNjH9aARB4CsonD7" From: "Holger Bruenjes" To: bug-grep@gnu.org Message-ID: <56D4AA3A.1060103@bru.lan> Subject: Re: bug#22838: New 'Binary file' detection considered harmful References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> In-Reply-To: <56D4A5D6.5040709@perathoner.de> --963ABlATK6GHxCrJkcNjH9aARB4CsonD7 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Am 2016-02-29 um 21:11 schrieb Marcello Perathoner: > As a German user I have on my disk files in many encodings: utf-8,=20 > iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like = > CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,=20 > old WordStar files that used control characters inside. >=20 > Since 2.21 I will now have to always specify -a or LC_ALL=3DC when=20 > grepping my files. You can use a wrapper for grep mv grep in.grep and create a new grep file with the following LC_ALL=3DC; "/usr/bin/in.grep" "${@}" that worked perfect Holger --963ABlATK6GHxCrJkcNjH9aARB4CsonD7-- --DgbmpuRGUEAo4ACKfvSfbT7bIwNNPR7Qi Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBAgAGBQJW1KpAAAoJEI5ZH+avTVIQwZ4P/1sGWIvCYBr7ZxNsWt9Q2rcc nRc9ZICoeUXDn03lJ4HUBbQvVMhZZ8AAyS629gHIXQArNqC6qhQXaj6zEfv0QRS2 050Ztrt7nHPiAwbke3axgTljp7vtK5esKf26dI+roFJpP+aq4G2dIaG4gFTAZOAY tahb2ZOJLOfWcCQDWoEjrPKhU51Dj0/8+hvKFPVQ+ya9jypI6iJo6aV4SWAqVrtj sKzEWKSZM522mtYi+IjI9Jvbvb26hSM7SeqpWkjrHlnP6cjpmeOx7hrtIZ9Ur/2K vyHFtHp8U3Ag13IrNTvheYo8PYwrJ1Peby3shwKiRiJxMk5RrjLRefB3+pv0Wb4X XoaPv7dc81NZmiOoEpImF8+nipiL1H+PgYfpDf6+IwsGk+xCInHmwvWLCefW93tQ H9E/CuYj8M54WXoiXpbXu5PU+o+ZgcRIlF+jKr41JAkQdixy6vR262225jB6KEDR hr8kjRByUoPJkAIexuGYj+0rzQTVnXcY24jBFwVmcqhBXon7ibfXP+BqDedpvFM7 LjqgjyMGSuY139t1G1xdA1jOsxFlsNukLRJJTjFY3WhVTp29d7h0pCU5P0ECSs0J DTywBUPvrhJf5qcT9+FCtV1z1RR9ubXKbpGOA2PBO2fKIHTg+JCfLDq125yzPreq chLkz8YN1WjZ5KomaDKv =Vz0c -----END PGP SIGNATURE----- --DgbmpuRGUEAo4ACKfvSfbT7bIwNNPR7Qi-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 17:38:01 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 22:38:01 +0000 Received: from localhost ([127.0.0.1]:54241 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaWS1-0002QU-9g for submit@debbugs.gnu.org; Mon, 29 Feb 2016 17:38:01 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52610) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaWRz-0002QL-7n for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 17:37:59 -0500 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 1074AC00B8C6; Mon, 29 Feb 2016 22:37:57 +0000 (UTC) Received: from [10.3.113.165] (ovpn-113-165.phx2.redhat.com [10.3.113.165]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u1TMbuq8001112; Mon, 29 Feb 2016 17:37:56 -0500 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> From: Eric Blake Openpgp: url=http://people.redhat.com/eblake/eblake.gpg X-Enigmail-Draft-Status: N1110 Organization: Red Hat, Inc. Message-ID: <56D4C843.4040405@redhat.com> Date: Mon, 29 Feb 2016 15:37:55 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4A5D6.5040709@perathoner.de> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="WjIBwcrCoFVIcIIHHTjgVhaDa9kAOWrvq" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --WjIBwcrCoFVIcIIHHTjgVhaDa9kAOWrvq Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 02/29/2016 01:11 PM, Marcello Perathoner wrote: >> Yes, locale dependencies on standard behavior can be annoying. >> >=20 > You assume that a user will only ever want to grep text files encoded i= n > the machine's locale. That is not so. You've been relying on undefined behavior, and it caught up with you. It's the same as asking for us to keep use-after-free "working" in a multithreaded program because it has always "worked" in your older single-threaded program when nothing was perturbing the memory between free() and its latent use. A latent bug in your usage is still a bug in your usage, even if it took a change in grep's defaults to expose your problem. And meanwhile, newer grep 2.23 has improved the heuristics to only complain about a binary file if it would otherwise be outputting encoding errors (rather than blindly complaining about the encoding error up front and stopping processing immediately), which does alleviate some of the worst of the change caused by your undefined usage (that is, you can still grep for valid encodings, and get reasonable results so long as the valid text doesn't mix with lines with invalid encodings). >=20 > As a German user I have on my disk files in many encodings: utf-8, > iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like > CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts, > old WordStar files that used control characters inside. >=20 > Since 2.21 I will now have to always specify -a or LC_ALL=3DC when > grepping my files. Yes, but then you are no longer relying on undefined behavior, and therefore have a leg to stand on if we break that behavior. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --WjIBwcrCoFVIcIIHHTjgVhaDa9kAOWrvq Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJW1MhEAAoJEKeha0olJ0NqOhYH/1OVlOC0wipLmNirPwbgmuYk 7JpVFv/XheuCLVsRZ1sERdYjxDLHlmvePUJiJ7CKS0q790YM3Z+/GeGqpxqMCr/h LyxzZPVUZaK1r7ey5kk20yOaGKJMw4tLny4RAzVFSehNju5EinYxijnwWW4VTJlp vH/jTyeAhThU7fB1Fz8KhRJUAZC0yMlCkQ9w5iFOcElVoeROYiHXhjb3v71AhPdT C16sMqLd3kQM55gMCe1bHLbzikV9XgaEvrsIUl+wBLkosqYRWhEVbiFK5c9wnHHb K8XsQWsKjJLY7ajesZvSX7yna/SN7BtjvOs7Q7/BV6KmL4AeXeo2Govx4gVSZmk= =uMs9 -----END PGP SIGNATURE----- --WjIBwcrCoFVIcIIHHTjgVhaDa9kAOWrvq-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 18:35:28 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 23:35:28 +0000 Received: from localhost ([127.0.0.1]:54305 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaXLc-0003q2-75 for submit@debbugs.gnu.org; Mon, 29 Feb 2016 18:35:28 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:41853) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaXLa-0003pm-4U for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 18:35:26 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 28FBE160F53; Mon, 29 Feb 2016 15:35:20 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id w9EgO9RF9akG; Mon, 29 Feb 2016 15:35:19 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 70258160FDA; Mon, 29 Feb 2016 15:35:19 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id QZ8T6xYHl0eP; Mon, 29 Feb 2016 15:35:19 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 58212160F53; Mon, 29 Feb 2016 15:35:19 -0800 (PST) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D49C14.2020506@cs.ucla.edu> <56D4AB66.4010203@perathoner.de> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56D4D5B7.6040903@cs.ucla.edu> Date: Mon, 29 Feb 2016 15:35:19 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4AB66.4010203@perathoner.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 02/29/2016 12:34 PM, Marcello Perathoner wrote: > On 02/29/2016 08:29 PM, Paul Eggert wrote: > > They would 'blast' their terminals without grep too. Sure, but in practice it's common for users to do something like this: grep -r getaddrinfo_a * I just now did this in my working copy of the GNU Emacs source code. If -a were the default, I would see 13874778 bytes on my screen, the vast majority of which would be useless or even harmful. As grep stands now, I see just 5480 bytes and they're mostly useful. > I was lucky in that I noticed that a 17GB tar file could not be a > complete backup of a 500GB drive. Yes, you were lucky there. But you were unlucky in that your backup software invoked grep without worrying about file name validity. Suppose a file name contained a newline? Your backups could be toast. > At least ... make the new behaviour optional. It is optional; we merely disagree about the option's default value. > Since 2.21 I will now have to always specify -a or LC_ALL=C when > grepping my files. I suggest using -a. LC_ALL=C won't work the way that you want on platforms where the C locale is UTF-8, or is pure ASCII. For example, on Fedora 23 or RHEL 7 with grep 2.23 we have: $ printf '\200\n' | LC_ALL=C grep . Binary file (standard input) matches This is because the C locale is pure ASCII on these platforms, i.e., '\200' is not a valid character the way it is with traditional Unix. I don't know why Red Hat made that change. From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 18:55:21 2016 Received: (at 22838) by debbugs.gnu.org; 29 Feb 2016 23:55:21 +0000 Received: from localhost ([127.0.0.1]:54319 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaXer-0004JP-EP for submit@debbugs.gnu.org; Mon, 29 Feb 2016 18:55:21 -0500 Received: from mx1.redhat.com ([209.132.183.28]:34677) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaXep-0004JB-Ib for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 18:55:20 -0500 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (Postfix) with ESMTPS id 6A68AC0005D1; Mon, 29 Feb 2016 23:55:15 +0000 (UTC) Received: from [10.3.113.165] (ovpn-113-165.phx2.redhat.com [10.3.113.165]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u1TNtEgY009106; Mon, 29 Feb 2016 18:55:15 -0500 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Paul Eggert , Marcello Perathoner , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D49C14.2020506@cs.ucla.edu> <56D4AB66.4010203@perathoner.de> <56D4D5B7.6040903@cs.ucla.edu> From: Eric Blake Openpgp: url=http://people.redhat.com/eblake/eblake.gpg Organization: Red Hat, Inc. Message-ID: <56D4DA62.8020403@redhat.com> Date: Mon, 29 Feb 2016 16:55:14 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4D5B7.6040903@cs.ucla.edu> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="wXkiidHoLkUPvHEhoKnr9r4ds6tKkiIEM" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --wXkiidHoLkUPvHEhoKnr9r4ds6tKkiIEM Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 02/29/2016 04:35 PM, Paul Eggert wrote: > I suggest using -a. LC_ALL=3DC won't work the way that you want on > platforms where the C locale is UTF-8, or is pure ASCII. For example, o= n > Fedora 23 or RHEL 7 with grep 2.23 we have: >=20 > $ printf '\200\n' | LC_ALL=3DC grep . > Binary file (standard input) matches >=20 > This is because the C locale is pure ASCII on these platforms, i.e., > '\200' is not a valid character the way it is with traditional Unix. I= > don't know why Red Hat made that change. I _think_ the Austin Group is leaning towards requiring the "C" locale to always be a unibyte locale with all 256 bytes as valid characters, so neither strict 7-bit ASCII nor UTF-8 would be usable as the "C" locale; but for that to happen, POSIX would also need to allow a way to get a UTF-8 locale easily accessible and describe how it differs from the "C" locale under such a ruling. But it's still all conjecture on what the final results will be - even in the standards committee, gracefully documenting how locale corner cases must behave vs. leaving implementations some latitude is tricky business; and any such change is at least 3 or 4 years down the road before it could be standardized in Issue 8 (right now, the focus is on Technical Corrigendum 2 for Issue 7).= --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --wXkiidHoLkUPvHEhoKnr9r4ds6tKkiIEM Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJW1NpiAAoJEKeha0olJ0Nq4I8H/R+kYocQSM/18NpHXc/eNFut 6FMksCLuzffLtVp229TX+43HgSm2Qwds7kj446IQVnodzGZYz/JNLfqUZRHFLgVV tObQgOezZcaOSqMlkPT+VGDW6sCwfY5y6sZUb4arxXIwx2REJIIuX1vauyMThDoA QVKszp7Sw5v9uCrVA2wxmxAKbrPYOujayHh48+NZpN8PjJGSThpmhW/YojEX8gVr PFA5pkuLNZ13vkH9n36yfbjsjiGanyJujMj7dzoiIG1rBHdbHj1ED4wujSPph2X+ iOzgUXTipHqcDp1Cl2AISorrWk1p1dihP1BXJ2k0p8I8TpmYbRaujJk2MsanM8w= =nc0Z -----END PGP SIGNATURE----- --wXkiidHoLkUPvHEhoKnr9r4ds6tKkiIEM-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 20:24:13 2016 Received: (at 22838) by debbugs.gnu.org; 1 Mar 2016 01:24:13 +0000 Received: from localhost ([127.0.0.1]:54383 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaZ2r-0008C0-5I for submit@debbugs.gnu.org; Mon, 29 Feb 2016 20:24:13 -0500 Received: from mail-oi0-f42.google.com ([209.85.218.42]:34512) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaZ2p-0008Bh-6s for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 20:24:11 -0500 Received: by mail-oi0-f42.google.com with SMTP id m82so118576572oif.1 for <22838@debbugs.gnu.org>; Mon, 29 Feb 2016 17:24:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=D43HHz2KfmH17kKaZmFvVMCF+x7aKXuGxUzllDRS2bY=; b=epf+dPRm69M31paLHDCVpBYd9MWZF4rio4ahy8O+Fuo6pUdASugHHCcj28H/r7Mvlo 2NToNV9z9OkP+sa1uvgOSd2llJxTRxb1xU7d29KZqEPMc/x5XOwEAvmkwVDpBA7l24lL 9dGPO0AB6N/IFH2TlTlPcZdHyfIlHoyY4jGjmibQVve0vklcuHezPHA3Pj7J27FNv/L2 XiIcQNCjVOoUWdjOv5IXT2Usq3ombHqQo0A0wvr8sFOZm8oNIAvq9xif2ot23XZtssM5 g8Tht4fFzzbPAoMWdzvkiM+IOMez5FSUb84SeZrs+6CSRYvgQokqBUxz/9PJP4yGK4OO seww== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=D43HHz2KfmH17kKaZmFvVMCF+x7aKXuGxUzllDRS2bY=; b=dPSsGkwfpTEJuZbcWb+FVsd2Oq2NdH3igAuY40XVu2arJRWxSYLymHunMJATQiP+jG k/9oUafs90tR+GDNtUFjZVcQPIph71+iUX+nseIr2wu7vQHIfsXEpxBIIvcCvAFG2H/s pFJ/avq3UVGQpSACrMBWhXtbpIPvnQSFed5STeMKvlXYMOuwAMHtnBFThAwTG9RLitQR uIP+lZiMvpjkbpBYl3U2F2iU6+E2tbxcJx0EJYrRjORgFDy948VbS8o/HMq+HIbLcS9M DmsynBcRcTKrL5Fowq877dD89yRxDjj47m0n0RnJMaUC6S0MCjouSRiCOjRnLCjwnvep rgrQ== X-Gm-Message-State: AD7BkJJi1Zk141sX6VIHZodhaHoVsE3LtXPdT7zhXfmDg/3gaeGle+XHn906ZWJsY+ummXDqBbFzOZtKNejcOg== X-Received: by 10.202.84.82 with SMTP id i79mr13967578oib.130.1456795445652; Mon, 29 Feb 2016 17:24:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.202.44.194 with HTTP; Mon, 29 Feb 2016 17:23:45 -0800 (PST) In-Reply-To: <56D4D5B7.6040903@cs.ucla.edu> References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D49C14.2020506@cs.ucla.edu> <56D4AB66.4010203@perathoner.de> <56D4D5B7.6040903@cs.ucla.edu> From: Jim Meyering Date: Mon, 29 Feb 2016 17:23:45 -0800 X-Google-Sender-Auth: gznx7TB4u5FTOLBYTEPCX-ahS8M Message-ID: Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Paul Eggert Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.5 (/) X-Debbugs-Envelope-To: 22838 Cc: 22838@debbugs.gnu.org, Marcello Perathoner X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Mon, Feb 29, 2016 at 3:35 PM, Paul Eggert wrote: > On 02/29/2016 12:34 PM, Marcello Perathoner wrote: ... >> Since 2.21 I will now have to always specify -a or LC_ALL=C when >> grepping my files. > > I suggest using -a. LC_ALL=C won't work the way that you want on platforms > where the C locale is UTF-8, or is pure ASCII. For example, on Fedora 23 or > RHEL 7 with grep 2.23 we have: > > $ printf '\200\n' | LC_ALL=C grep . > Binary file (standard input) matches > > This is because the C locale is pure ASCII on these platforms, i.e., '\200' > is not a valid character the way it is with traditional Unix. I don't know > why Red Hat made that change. Wow. I hadn't noticed that using LC_ALL=C is inadequate. Disturbing... From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 21:24:43 2016 Received: (at 22838) by debbugs.gnu.org; 1 Mar 2016 02:24:43 +0000 Received: from localhost ([127.0.0.1]:54477 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaZzP-0002wI-GH for submit@debbugs.gnu.org; Mon, 29 Feb 2016 21:24:43 -0500 Received: from mail-ob0-f179.google.com ([209.85.214.179]:36543) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aaZzN-0002w5-Lk for 22838@debbugs.gnu.org; Mon, 29 Feb 2016 21:24:41 -0500 Received: by mail-ob0-f179.google.com with SMTP id jj9so2206146obb.3 for <22838@debbugs.gnu.org>; Mon, 29 Feb 2016 18:24:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding; bh=BL1krdbbjqFbF45loGV+EmqfnhsX+9ujsVshpby9Gx8=; b=YAZEg2h74I6V7EkiRmpWNMC/iEgB92xJirQCGvinIjHJ/NCxZLpHWXYFLwR7TAm9GQ gc28IPfnK4fqNs0ZYeZ7jv1Mx/cRwoQEbex/bHIQ2exVbvJK/0xpI+oa/Wx1jOIP7YBx IJhQ6+pj8+JzuJEsg3CzPTeLBAXLmwUt7fT2rR5uPf4xNOGssp1MAvfLYViiBZqfU67L gg69cPZzOyxU2Zl4/rzWj3xYBNuZ4q7lQ8JAH0iFt0IwDsFVlznZj7E7dqPvFABII9US XJ8qih8wnX9/KC0Q3FdYBVN/tt57olPocUcRsyfMea2Nz6sohi/v/wHnWU5caSD4wk1m GBNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=BL1krdbbjqFbF45loGV+EmqfnhsX+9ujsVshpby9Gx8=; b=SsW/PltEvkj766/Oy6U4H9rlzSWqNuTrX7EigpzsUBdk5f0lM1Ha9NN7cwJOjvmstF SdxjCajcZGELRNQaEge7UKiLSpZyzWCBYHBNWcRRBhe3zbFgbAgpnJFDhtKoUkOrxtLh CsAE4ZiqlP3ew2EJRlXSsWIKDJcYG+Q7o/dr64nm3lqN7HJrmk7yzHKEK6rjOI3g/R47 Jf2vzd4mh0zZEudG3y2IMNg5MALX59kWM51NoY2N3m4mn2k/4UKvmkbNC75DOxnH1457 pr61msy0yEHr3cNwq4MpPiuIGwpa18/6SnCFIIXwYSUIU8FTj/o5rlRYnGM5PjgQi2TT G6xg== X-Gm-Message-State: AD7BkJK7enQvlJjcQD8k8oT3SNhbCgwNjj3Awm9KSdutb8MlOpG5g/cJLZ873LizwS9ncA== X-Received: by 10.182.55.10 with SMTP id n10mr14139196obp.68.1456799076048; Mon, 29 Feb 2016 18:24:36 -0800 (PST) Received: from [192.168.0.76] (cpe-70-123-244-133.satx.res.rr.com. [70.123.244.133]) by smtp.gmail.com with ESMTPSA id x4sm20072046oek.17.2016.02.29.18.24.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 29 Feb 2016 18:24:35 -0800 (PST) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Paul Eggert , Marcello Perathoner , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D49C14.2020506@cs.ucla.edu> <56D4AB66.4010203@perathoner.de> <56D4D5B7.6040903@cs.ucla.edu> From: Bruce Dubbs Message-ID: <56D4FD62.6020703@gmail.com> Date: Mon, 29 Feb 2016 20:24:34 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0 SeaMonkey/2.39 MIME-Version: 1.0 In-Reply-To: <56D4D5B7.6040903@cs.ucla.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Paul Eggert wrote: > $ printf '\200\n' | LC_ALL=C grep . > Binary file (standard input) matches > > This is because the C locale is pure ASCII on these platforms, i.e., > '\200' is not a valid character the way it is with traditional Unix. I > don't know why Red Hat made that change. I also get the 'Binary file (standard input) matches' output from the above string on a Linux From Scratch system. We build everything in a fairly generic way and did nothing special in this area. I suspect this is something buried deep into glibc. -- Bruce From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 29 23:02:15 2016 Received: (at submit) by debbugs.gnu.org; 1 Mar 2016 04:02:15 +0000 Received: from localhost ([127.0.0.1]:54564 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aabVn-0000Lf-1o for submit@debbugs.gnu.org; Mon, 29 Feb 2016 23:02:15 -0500 Received: from eggs.gnu.org ([208.118.235.92]:49975) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aabVm-0000HX-0m for submit@debbugs.gnu.org; Mon, 29 Feb 2016 23:02:14 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aabVg-0003zX-0K for submit@debbugs.gnu.org; Mon, 29 Feb 2016 23:02:08 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:55043) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aabVf-0003zS-UH for submit@debbugs.gnu.org; Mon, 29 Feb 2016 23:02:07 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40940) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aabVf-0004Ij-1L for bug-grep@gnu.org; Mon, 29 Feb 2016 23:02:07 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aabVb-0003yw-SC for bug-grep@gnu.org; Mon, 29 Feb 2016 23:02:06 -0500 Received: from smtp01.mail.online.nl ([194.134.25.71]:20727) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aabVb-0003vf-MW for bug-grep@gnu.org; Mon, 29 Feb 2016 23:02:03 -0500 Received: from [192.168.1.65] (s51447d83.adsl.online.nl [81.68.125.131]) by smtp01.mail.online.nl (Postfix) with ESMTP id A6A3740026 for ; Tue, 1 Mar 2016 05:01:55 +0100 (CET) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: bug-grep@gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D49C14.2020506@cs.ucla.edu> <56D4AB66.4010203@perathoner.de> <56D4D5B7.6040903@cs.ucla.edu> <56D4DA62.8020403@redhat.com> From: Hans Pelleboer Message-ID: <56D51433.4040206@online.nl> Date: Tue, 1 Mar 2016 05:01:55 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4DA62.8020403@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) On 03/01/2016 12:55 AM, Eric Blake wrote: > I _think_ the Austin Group is leaning towards requiring the "C" locale > to always be a unibyte locale with all 256 bytes as valid characters, > so neither strict 7-bit ASCII nor UTF-8 would be usable as the "C" > locale; but for that to happen, POSIX would also need to allow a way > to get a UTF-8 locale easily accessible and You do realize that this leaves all _non-US_users_, who rely on diacritics or even different character sets entirely for their language, completely out in the cold. > describe how it differs from the "C" locale under such a ruling. But > it's still all conjecture on what the final results will be - even in > the standards committee, gracefully documenting how locale corner > cases must behave vs. leaving implementations some latitude is tricky > business; and any such change is at least 3 or 4 years down the road > before it could be standardized in Issue 8 (right now, the focus is on > Technical Corrigendum 2 for Issue 7). Already back in _1987_, an IT professor in Leiden was especially appointed for the streamlining of all the competing character sets that later were merged to become Unicode. Given the current state of affairs, nearly thirty years down the road, I do not share your optimism that this issue will be resolved in the next couple of years. Hans Pelleboer From debbugs-submit-bounces@debbugs.gnu.org Tue Mar 01 05:05:26 2016 Received: (at 22838) by debbugs.gnu.org; 1 Mar 2016 10:05:27 +0000 Received: from localhost ([127.0.0.1]:54736 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aahBG-0000r7-NX for submit@debbugs.gnu.org; Tue, 01 Mar 2016 05:05:26 -0500 Received: from larissa.perathoner.de ([85.10.209.172]:57750) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aahBE-0000qy-RW for 22838@debbugs.gnu.org; Tue, 01 Mar 2016 05:05:25 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=perathoner.de; s=2; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:MIME-Version:Date:Message-ID:From:References:To:Subject; bh=5ambc/xzPc8q1eYDhDowFpv4LAsb3Jf4SiH68Kkqqos=; b=KmmCe0IdtW1Zq3sTqaKn854Pm664Y4GzDSD6VsqHOk0xBjccOEn7O1dhFoJKMbtHipvtltLR8Gb5WjEH2+d5GC+An9TEOVrjuxJmt24zfDnS9vbR30ql7mxEghgC2GMyYLlA2pWLE3/G5ESCAokEZiEYHzrt+Uk5Q9BgXP6VwJj0OFwUeUe1vs2ZZ1Nii8N2T1De+fjNUix/gcmHD6wPsw1J/oOjJPNh6JqQxjLW8qy5yQWFAZD826khjQND++ZTwAxTDjPiUwFm7q9xcNsxBU1vEE0FRZU/QQfT61pnChRYI/Za6uPzAfyddHJNnt7pYcNl+YfOwj8XodoT; Received: from zappa.cceh.uni-koeln.de ([134.95.65.225]) by larissa.perathoner.de with esmtpsa (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.84) (envelope-from ) id 1aahBB-0006Wr-Sr; Tue, 01 Mar 2016 11:05:22 +0100 Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Eric Blake , Paul Eggert , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> <56D4C843.4040405@redhat.com> From: Marcello Perathoner Message-ID: <56D56961.4060904@perathoner.de> Date: Tue, 1 Mar 2016 11:05:21 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Icedove/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D4C843.4040405@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Scanner: Spamassassin on larissa X-Spam-Level: -- X-Spam-Score: -2.9 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 02/29/2016 11:37 PM, Eric Blake wrote: > On 02/29/2016 01:11 PM, Marcello Perathoner wrote: > >>> Yes, locale dependencies on standard behavior can be annoying. >>> >> >> You assume that a user will only ever want to grep text files encoded in >> the machine's locale. That is not so. > > You've been relying on undefined behavior, and it caught up with you. (The backup2l author has been relying. I'm just a user of that package and I already filed a bug against backup2l too.) You confuse 'undefined' with 'undocumented'. The old behaviour was very well defined, even if it could turn out nasty. It was defined by implementation: it was a de-facto standard. OTOH it was nowhere documented that grepping non-locale files was considered marginal or illegal. The old documentation explicitly stated: """ If the first few bytes of a file indicate that the file contains binary data, assume that the file is of type TYPE. By default, TYPE is binary, and grep normally outputs either a one-line message saying that a binary file matches, or no message if there is no match. """ --- from an old man page The new behaviour changes documented old behaviour. Furthermore there's no need to fix the old bug in such a heavy-handed way. Less disrupting alternatives: 1) Make the new behaviour an opt-in. Print a deprecation warning that gives people a chance to fix their scripts. After a while make the new behaviour the default. 2) If you just output binary line 42 in file x matches and continue regular output after the next newline, the breakage would be much more confined. 3) Fail in the old documented way of printing only the error message instead of introducing a new mode of failure that looks like success and loses the error message in the noise. 4) Don't implement this change between minor releases. A breaking change deserves a major release. Regards -- Marcello Perathoner From debbugs-submit-bounces@debbugs.gnu.org Tue Mar 01 12:14:18 2016 Received: (at 22838) by debbugs.gnu.org; 1 Mar 2016 17:14:18 +0000 Received: from localhost ([127.0.0.1]:56576 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aansI-0008VA-Gr for submit@debbugs.gnu.org; Tue, 01 Mar 2016 12:14:18 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:50724) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aansG-0008Uu-FY for 22838@debbugs.gnu.org; Tue, 01 Mar 2016 12:14:17 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7C2F7160E44; Tue, 1 Mar 2016 09:14:10 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id pb9UvQ-grHpC; Tue, 1 Mar 2016 09:14:08 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B7135160FDB; Tue, 1 Mar 2016 09:14:08 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id fjIOkhr8Wt_3; Tue, 1 Mar 2016 09:14:08 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 9D458160E44; Tue, 1 Mar 2016 09:14:08 -0800 (PST) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , Eric Blake , 22838@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> <56D4C843.4040405@redhat.com> <56D56961.4060904@perathoner.de> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56D5CDE0.6020501@cs.ucla.edu> Date: Tue, 1 Mar 2016 09:14:08 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <56D56961.4060904@perathoner.de> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 22838 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 03/01/2016 02:05 AM, Marcello Perathoner wrote: > 1) Make the new behaviour an opt-in. Again, this is arguing over what the default should be. For many users, the new sort of behavior is better. > > 2) If you just output > > binary line 42 in file x matches > > and continue regular output after the next newline, the breakage would > be much more confined. This sounds like a good suggestion. That is, grep could keep going if its only problem is an attempt to output encoding errors (as opposed to reading null bytes, which are a more-reliable indication of binary data). It would probably be better to output just one "Binary file matches" line per file, at the end of the other matches, so that it's more likely to be noticed. > > 3) Fail in the old documented way of printing only the error message > instead of introducing a new mode of failure that looks like success > and loses the error message in the noise. I don't understand this suggestion, as it's not an error or an error message. But since I like (2) better perhaps it doesn't matter. > > 4) Don't implement this change between minor releases. A breaking > change deserves a major release. > Grep does not have minor releases. Whether to call the next release "2.24" or "3.0" is primarily a marketing decision, not a technical one. From debbugs-submit-bounces@debbugs.gnu.org Thu Sep 08 21:43:53 2016 Received: (at 22838-done) by debbugs.gnu.org; 9 Sep 2016 01:43:53 +0000 Received: from localhost ([127.0.0.1]:54033 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1biArA-0008FF-T0 for submit@debbugs.gnu.org; Thu, 08 Sep 2016 21:43:53 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:43276) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1biAr8-0008F2-GJ for 22838-done@debbugs.gnu.org; Thu, 08 Sep 2016 21:43:51 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id AF8A81611FB; Thu, 8 Sep 2016 18:43:44 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id xIsErDcZvfqE; Thu, 8 Sep 2016 18:43:43 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B29EE1611DF; Thu, 8 Sep 2016 18:43:43 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id pLcl5JCsCMuq; Thu, 8 Sep 2016 18:43:43 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 83E041611FB; Thu, 8 Sep 2016 18:43:43 -0700 (PDT) Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Marcello Perathoner , Eric Blake , 22838-done@debbugs.gnu.org References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> <56D4C843.4040405@redhat.com> <56D56961.4060904@perathoner.de> <56D5CDE0.6020501@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <3fa28b6a-9a78-375a-5978-46987a9bb681@cs.ucla.edu> Date: Thu, 8 Sep 2016 18:43:43 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <56D5CDE0.6020501@cs.ucla.edu> Content-Type: multipart/mixed; boundary="------------49D2F601757011799F5D774E" X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: 22838-done Cc: Hans Pelleboer , Bruce Dubbs , Jim Meyering X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.3 (-) This is a multi-part message in MIME format. --------------49D2F601757011799F5D774E Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Paul Eggert wrote: > On 03/01/2016 02:05 AM, Marcello Perathoner wrote: >> 2) If you just output >> >> binary line 42 in file x matches >> >> and continue regular output after the next newline, the breakage would= be much >> more confined. > > This sounds like a good suggestion. That is, grep could keep going if = its only > problem is an attempt to output encoding errors (as opposed to reading = null > bytes, which are a more-reliable indication of binary data). It would = probably > be better to output just one "Binary file matches" line per file, at th= e end of > the other matches, so that it's more likely to be noticed. I finally got around to implementing this, which turned out to be conside= rably=20 easier than I thought it would be. I installed the attached patch into th= e grep=20 Savannah master. I am boldly closing this old bug report; we can always s= tart a=20 new report if further problems turn up. --------------49D2F601757011799F5D774E Content-Type: text/x-diff; name="0001-grep-encoding-errors-suppress-just-their-line.patch" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filename="0001-grep-encoding-errors-suppress-just-their-line.patch" =46rom 0f1fb0747fdac7043124df4cead5c845bd64fd77 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 8 Sep 2016 18:33:14 -0700 Subject: [PATCH] grep: encoding errors suppress just their line =46rom a suggestion by Marcello Perathoner (Bug#22838). * NEWS, doc/grep.texi (File and Directory Selection): Document this. * src/grep.c (print_line_head): Do not suppress later output lines merely because an earlier output line would have had an encoding error. * tests/encoding-error: Test for the new behavior. --- NEWS | 5 +++++ doc/grep.texi | 13 +++++++------ src/grep.c | 13 ++++++------- tests/encoding-error | 4 ++++ 4 files changed, 22 insertions(+), 13 deletions(-) diff --git a/NEWS b/NEWS index 01be350..a63a7b2 100644 --- a/NEWS +++ b/NEWS @@ -2,6 +2,11 @@ GNU grep NEWS -*- out= line -*- =20 * Noteworthy changes in release ?.? (????-??-??) [?] =20 +** Bug fixes + + Grep no longer omits output merely because it follows an output line + suppressed due to encoding errors. [bug introduced in grep-2.21] + ** Improvements =20 grep can be much faster now when standard output is /dev/null. diff --git a/doc/grep.texi b/doc/grep.texi index 7e51d45..fcfad42 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -610,18 +610,19 @@ Variables}), or null input bytes when the @option{-z} (@option{--null-data}) option is not given (@pxref{Other Options}). =20 -By default, @var{type} is @samp{binary}, and when @command{grep} -discovers that a file is binary it suppresses any further output, and -instead outputs either a one-line message saying that a binary file -matches, or no message if there is no match. +By default, @var{type} is @samp{binary}, and @command{grep} +suppresses output afer null input binary data is discovered, +and suppresses output lines that contain improperly encoded data. +When some output is suppressed, @command{grep} follows any output +with a one-line message saying that a binary file matches. =20 If @var{type} is @samp{without-match}, -when @command{grep} discovers that a file is binary +when @command{grep} discovers null input binary data it assumes that the rest of the file does not match; this is equivalent to the @option{-I} option. =20 If @var{type} is @samp{text}, -@command{grep} processes a binary file as if it were text; +@command{grep} processes binary data as if it were text; this is equivalent to the @option{-a} option. =20 When @var{type} is @samp{binary}, @command{grep} may treat non-text diff --git a/src/grep.c b/src/grep.c index d07f5da..65916ca 100644 --- a/src/grep.c +++ b/src/grep.c @@ -1108,17 +1108,16 @@ print_offset (uintmax_t pos, int min_width, const= char *color) static bool print_line_head (char *beg, size_t len, char const *lim, char sep) { - bool encoding_errors =3D false; if (binary_files !=3D TEXT_BINARY_FILES) { char ch =3D beg[len]; - encoding_errors =3D buf_has_encoding_errors (beg, len); + bool encoding_errors =3D buf_has_encoding_errors (beg, len); beg[len] =3D ch; - } - if (encoding_errors) - { - encoding_error_output =3D done_on_match =3D out_quiet =3D true; - return false; + if (encoding_errors) + { + encoding_error_output =3D true; + return false; + } } =20 bool pending_sep =3D false; diff --git a/tests/encoding-error b/tests/encoding-error index 4b5fcb5..0cbeffc 100755 --- a/tests/encoding-error +++ b/tests/encoding-error @@ -35,6 +35,10 @@ grep '^X' in >out test $? =3D 1 || fail=3D1 compare /dev/null out || fail=3D1 =20 +grep . in >out || fail=3D1 +(cat a j && printf 'Binary file in matches\n') >exp || framework_failure= _ +compare exp out || fail=3D1 + grep -a . in >out || fail=3D1 compare in out =20 --=20 2.7.4 --------------49D2F601757011799F5D774E-- From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 09 01:21:05 2016 Received: (at 22838-done) by debbugs.gnu.org; 9 Sep 2016 05:21:05 +0000 Received: from localhost ([127.0.0.1]:54086 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1biEFN-000545-3X for submit@debbugs.gnu.org; Fri, 09 Sep 2016 01:21:05 -0400 Received: from mail-vk0-f41.google.com ([209.85.213.41]:36365) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1biEFK-00053G-Sz for 22838-done@debbugs.gnu.org; Fri, 09 Sep 2016 01:21:03 -0400 Received: by mail-vk0-f41.google.com with SMTP id m62so5615378vkd.3 for <22838-done@debbugs.gnu.org>; Thu, 08 Sep 2016 22:21:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=TApBwFKp3t02BYPrEjboU4lBGskx3xp4cY4ZbZgB8eQ=; b=KNTktw9Dwzjt8fKemw6adt19Pr6rYQNceXV+WPdjzuFCQcjEwjgzw0+nKFR+uloRy5 BEQP8vbjMta8mKfm34VFQA/SQcfNu+0Tvi9ztcwirMoA718E9VdrqA8W+vArGgpLqKEN +XxxW/Tk/tDYzW2GXlPh+MqVV5UNZbj7+QEH/rzMEMzj1QaHBonOaoVgtP4CFMOirvzN ofVMBm6LdDbmAgF+Q8XYD8OVFyxBVNZUkylxBySmQBC8BO+vt8jtzwwz/dV9n+1nsFeh nwQcQWRLxqxvKLpp0Jk5EJswzmRxNRR1pEMkBM/5vfmsPOJUJSRiYXos+QpxvXa1qG8d LAKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=TApBwFKp3t02BYPrEjboU4lBGskx3xp4cY4ZbZgB8eQ=; b=TwqbOiqawhYg7pO8lgUhdn4+eSoqo42PaOsCPCbwj8qgQZMXGCQ92z/xWrDLkvcOJP NusqmEJPljap8XMXTSYxvJ7R+Orqoh8p4OCmlWeYnNjyzFkVntulYnKivQMpCJqdlkNO 368KHXcIKQxBycx+/oqUHvHUxivPcJGmtmegLB+NLLGl+FBvWvCluqxG1mss8gtBVr7b eem/y3wo7H0eHJchjJLCPgep0Z4/fVTrOSTjgQ1GTcVEyCKIWvwB1qC/4WLIJv/7GUa6 MUQmatCAum26dw5HnisF5k9rs4qR9VHlvzbPBST36mLx/uPNm0EcvQQWcmHfLj8DLKol DEVg== X-Gm-Message-State: AE9vXwN/jQXSzPbHDk7kth+uRk9UywT1NvxM8cll2u1osXzJEYFFX+Mz1MDXiIFwNI0kWCsuXRXPKOYJtgcQpA== X-Received: by 10.31.92.143 with SMTP id q137mr1048756vkb.92.1473398457367; Thu, 08 Sep 2016 22:20:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.176.80.212 with HTTP; Thu, 8 Sep 2016 22:20:36 -0700 (PDT) In-Reply-To: <3fa28b6a-9a78-375a-5978-46987a9bb681@cs.ucla.edu> References: <56D2D733.60506@perathoner.de> <56D37117.5060007@cs.ucla.edu> <56D47C5C.8000509@perathoner.de> <56D47E4E.4060409@redhat.com> <56D48298.40503@perathoner.de> <56D485EC.6040008@redhat.com> <56D4863E.5040205@redhat.com> <56D4A5D6.5040709@perathoner.de> <56D4C843.4040405@redhat.com> <56D56961.4060904@perathoner.de> <56D5CDE0.6020501@cs.ucla.edu> <3fa28b6a-9a78-375a-5978-46987a9bb681@cs.ucla.edu> From: Jim Meyering Date: Thu, 8 Sep 2016 22:20:36 -0700 X-Google-Sender-Auth: xaGvQtTVbnE_gRjqlcHjVLS1RKI Message-ID: Subject: Re: bug#22838: New 'Binary file' detection considered harmful To: Paul Eggert Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 22838-done Cc: 22838-done@debbugs.gnu.org, Eric Blake , Hans Pelleboer , Marcello Perathoner , Bruce Dubbs X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Thu, Sep 8, 2016 at 6:43 PM, Paul Eggert wrote: > Paul Eggert wrote: >> >> On 03/01/2016 02:05 AM, Marcello Perathoner wrote: >>> >>> 2) If you just output >>> >>> binary line 42 in file x matches >>> >>> and continue regular output after the next newline, the breakage would be >>> much >>> more confined. >> >> >> This sounds like a good suggestion. That is, grep could keep going if its >> only >> problem is an attempt to output encoding errors (as opposed to reading >> null >> bytes, which are a more-reliable indication of binary data). It would >> probably >> be better to output just one "Binary file matches" line per file, at the >> end of >> the other matches, so that it's more likely to be noticed. > > > I finally got around to implementing this, which turned out to be > considerably easier than I thought it would be. I installed the attached > patch into the grep Savannah master. I am boldly closing this old bug > report; we can always start a new report if further problems turn up. Very nice. Thank you! From unknown Fri Jun 20 07:21:42 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Fri, 07 Oct 2016 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator