From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 01 13:01:04 2014 Received: (at submit) by debbugs.gnu.org; 1 Dec 2014 18:01:04 +0000 Received: from localhost ([127.0.0.1]:51053 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XvVHT-0006A4-5p for submit@debbugs.gnu.org; Mon, 01 Dec 2014 13:01:03 -0500 Received: from eggs.gnu.org ([208.118.235.92]:58189) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XvUQi-0002LM-Tp for submit@debbugs.gnu.org; Mon, 01 Dec 2014 12:06:33 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XvUQZ-0005IR-HH for submit@debbugs.gnu.org; Mon, 01 Dec 2014 12:06:32 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_40,HTML_MESSAGE autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:52119) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XvUQZ-0005IL-FE for submit@debbugs.gnu.org; Mon, 01 Dec 2014 12:06:23 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59769) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XvUQT-0000xq-7s for bug-grep@gnu.org; Mon, 01 Dec 2014 12:06:23 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XvUQM-0005Ck-DZ for bug-grep@gnu.org; Mon, 01 Dec 2014 12:06:17 -0500 Received: from demumfd001.nsn-inter.net ([93.183.12.32]:53396) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XvUQM-0005Bf-4Q for bug-grep@gnu.org; Mon, 01 Dec 2014 12:06:10 -0500 Received: from demuprx017.emea.nsn-intra.net ([10.150.129.56]) by demumfd001.nsn-inter.net (8.14.3/8.14.3) with ESMTP id sB1H5q5o009533 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Mon, 1 Dec 2014 17:05:54 GMT Received: from [10.149.138.145] ([10.149.138.145]) by demuprx017.emea.nsn-intra.net (8.12.11.20060308/8.12.11) with ESMTP id sB1H5pBx002417; Mon, 1 Dec 2014 18:05:51 +0100 Message-ID: <547C9FEF.6090809@computer.org> Date: Mon, 01 Dec 2014 18:05:51 +0100 From: Thomas Wolff User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: bug-grep@gnu.org Subject: latest grep considers text files as binary Content-Type: multipart/alternative; boundary="------------010909060809080607020703" X-purgate-type: clean X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de X-purgate: clean X-purgate: This mail is considered clean (visit http://www.eleven.de for further information) X-purgate-size: 2185 X-purgate-ID: 151667::1417453555-0000658F-53BBB669/0/0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Mon, 01 Dec 2014 13:01:02 -0500 Cc: meyering@fb.com, eggert@cs.ucla.edu, noritnk@kcn.ne.jp X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) This is a multi-part message in MIME format. --------------010909060809080607020703 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Since grep 2.21, grep fails to report matches in a UTF-8 file with a few non-UTF-8 bytes interspersed. This is likely to be related to one of the recent patches related to encoding or multi-byte issues I see in the change log. I have a number of large UTF-8 source files with some non-UTF-8 characters used as constants and it was quite useful that grep nonetheless would simply report the requested matches. Now it claims just "Binary file ... matches" even if the file contains only one single non-UTF-8 byte which I consider quite inappropriate. I would appreciate to get the previous behaviour restored, at least in a UTF-8 locale, as the mentioned patches are apparently intended to fix issues in non-UTF-8 locales. Kind regards, Thomas --------------010909060809080607020703 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit Since grep 2.21, grep fails to report matches in a UTF-8 file with a few
non-UTF-8 bytes interspersed. This is likely to be related to one of the
recent patches related to encoding or multi-byte issues I see in the change log.

I have a number of large UTF-8 source files with some non-UTF-8 characters
used as constants and it was quite useful that grep nonetheless would
simply report the requested matches. Now it claims just
"Binary file ... matches" even if the file contains only one single
non-UTF-8 byte which I consider quite inappropriate.
I would appreciate to get the previous behaviour restored, at least in a
UTF-8 locale, as the mentioned patches are apparently intended to fix
issues in non-UTF-8 locales.

Kind regards,
Thomas
--------------010909060809080607020703-- From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 01 17:42:05 2014 Received: (at 19242-done) by debbugs.gnu.org; 1 Dec 2014 22:42:05 +0000 Received: from localhost ([127.0.0.1]:51262 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XvZfQ-0007hr-O5 for submit@debbugs.gnu.org; Mon, 01 Dec 2014 17:42:05 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:34254) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XvZfP-0007hk-Bs for 19242-done@debbugs.gnu.org; Mon, 01 Dec 2014 17:42:03 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id C85FFA60051; Mon, 1 Dec 2014 14:42:02 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id PUkiHtOev14F; Mon, 1 Dec 2014 14:41:54 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 33DF8A6001B; Mon, 1 Dec 2014 14:41:54 -0800 (PST) Message-ID: <547CEEB1.2070305@cs.ucla.edu> Date: Mon, 01 Dec 2014 14:41:53 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: 19242-done@debbugs.gnu.org Subject: Re: latest grep considers text files as binary References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> In-Reply-To: <547CA56B.4070002@cs.ucla.edu> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 19242-done Cc: noritnk@kcn.ne.jp X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Also marking Bug#19242 as done, since it's the same as Bug#19241. From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 05 04:59:09 2014 Received: (at 19242) by debbugs.gnu.org; 5 Dec 2014 09:59:10 +0000 Received: from localhost ([127.0.0.1]:54269 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XwpfJ-0005Vh-7B for submit@debbugs.gnu.org; Fri, 05 Dec 2014 04:59:09 -0500 Received: from demumfd001.nsn-inter.net ([93.183.12.32]:52427) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XwpfE-0005VW-PS for 19242@debbugs.gnu.org; Fri, 05 Dec 2014 04:59:06 -0500 Received: from demuprx017.emea.nsn-intra.net ([10.150.129.56]) by demumfd001.nsn-inter.net (8.14.3/8.14.3) with ESMTP id sB59wopN020935 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Fri, 5 Dec 2014 09:58:51 GMT Received: from [10.149.138.145] ([10.149.138.145]) by demuprx017.emea.nsn-intra.net (8.12.11.20060308/8.12.11) with ESMTP id sB59wotg019975; Fri, 5 Dec 2014 10:58:50 +0100 Message-ID: <548181D9.4030108@computer.org> Date: Fri, 05 Dec 2014 10:58:49 +0100 From: Thomas Wolff User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Paul Eggert , Jim Meyering Subject: Re: latest grep considers text files as binary References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> In-Reply-To: <547CA56B.4070002@cs.ucla.edu> X-TagToolbar-Keys: D20141205105849888 Content-Type: multipart/alternative; boundary="------------050009080007050500090806" X-purgate-type: clean X-purgate-Ad: Categorized by eleven eXpurgate (R) http://www.eleven.de X-purgate: clean X-purgate: This mail is considered clean (visit http://www.eleven.de for further information) X-purgate-size: 3703 X-purgate-ID: 151667::1417773532-0000658F-1C9A97DF/0/0 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: 19242 Cc: 19242@debbugs.gnu.org, noritnk@kcn.ne.jp X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) This is a multi-part message in MIME format. --------------050009080007050500090806 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Paul Eggert wrote: >> the mentioned patches are apparently intended to fix issues in >> non-UTF-8 locales. > No, they're also needed for UTF-8 locales I'm afraid. There are some > security issues, not only having to do with grep's internals, but also > for the behavior of downstream programs that may be expecting UTF-8 text. > > You can work around the problem with 'grep -a'. I was aware of this workaround but I claim it should not be needed because the files affected are in fact not binary files but text files. The manual clearly says about -a: "Process a binary file as if it were text" but partial content in a different text encoding does not make a file binary. Jim Meyering wrote: > this is due to documented and desirable behavior. I deny this is desirable behavior and I doubt there is a security issue as described. If any other, independent software has a security issue with non-UTF-8 input, it should decide itself to filter it and use accordingly stable decoding functions. It cannot be the task of any tool (grep in this case) to filter output to work around possible security issues in other programs in a pipe. This would be completely against the concept of pipes in the Unix tradition. Honestly I think this is another case of practical usefulness losing against dogma in software design. Kind regards, Thomas --------------050009080007050500090806 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit
Paul Eggert wrote:
the mentioned patches are apparently intended to fix issues in non-UTF-8 locales.
No, they're also needed for UTF-8 locales I'm afraid.  There are some security issues, not only having to do with grep's internals, but also for the behavior of downstream programs that may be expecting UTF-8 text.

You can work around the problem with 'grep -a'.
I was aware of this workaround but I claim it should not be needed because the files affected are in fact not binary files but text files. The manual clearly says about -a: "Process a binary file as if it were text" but partial content in a different text encoding does not make a file binary.

Jim Meyering wrote:
 this is due to documented and desirable behavior.
I deny this is desirable behavior and I doubt there is a security issue as described. If any other, independent software has a security issue with non-UTF-8 input, it should decide itself to filter it and use accordingly stable decoding functions. It cannot be the task of any tool (grep in this case) to filter output to work around possible security issues in other programs in a pipe. This would be completely against the concept of pipes in the Unix tradition.

Honestly I think this is another case of practical usefulness losing against dogma in software design.

Kind regards,
Thomas
--------------050009080007050500090806-- From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 05 10:00:47 2014 Received: (at 19242) by debbugs.gnu.org; 5 Dec 2014 15:00:47 +0000 Received: from localhost ([127.0.0.1]:54778 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XwuNC-0005mu-Om for submit@debbugs.gnu.org; Fri, 05 Dec 2014 10:00:47 -0500 Received: from mail-yh0-f42.google.com ([209.85.213.42]:62340) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XwuN8-0005mh-CC for 19242@debbugs.gnu.org; Fri, 05 Dec 2014 10:00:43 -0500 Received: by mail-yh0-f42.google.com with SMTP id v1so384986yhn.15 for <19242@debbugs.gnu.org>; Fri, 05 Dec 2014 07:00:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=x3e1XbYOG5vjKeaK3vDWJWmXxiuuUrVyjn4Bo/THP3U=; b=kiYFjCjxAJLUBZzEQZcTPXqkeDbaYnS88xpSKS1q0L+aMdPazIqjNE4nzbFav3Wkkc IL5HmWt4YY2RiWFAQvk3jAYYE8UYP9rZtKMRgK5dBCzcAQviIK5oM0/rioS9VR6YR6Wl xAeau9bai92n1xknJK54QT/w5wk8aGXvbhD22wCpKiHj74pXvLz8D2/RxkAxXVj2bV0G c+eOvms5H6ck4OlKz5eqCVV5Pdze2yJWOMpyM/yiz7N6aMHZ+JyjuOXUukRwFWPQJiHk t4fxee46/BSmQVWnJxyN8CBrTjhXMxZJ+5uuGlqbIIFiAWxnpgEkbEtZh2mOYPBEKq13 Tg2g== X-Received: by 10.170.90.68 with SMTP id h65mr22779596yka.94.1417791641633; Fri, 05 Dec 2014 07:00:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.170.139.67 with HTTP; Fri, 5 Dec 2014 07:00:21 -0800 (PST) In-Reply-To: <548181D9.4030108@computer.org> References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> <548181D9.4030108@computer.org> From: Jim Meyering Date: Fri, 5 Dec 2014 07:00:21 -0800 X-Google-Sender-Auth: oZbOql0-FD4BXWHIXnR2mdkGnEE Message-ID: Subject: Re: bug#19242: latest grep considers text files as binary To: Thomas Wolff Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 19242 Cc: Jim Meyering , Paul Eggert , 19242@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Fri, Dec 5, 2014 at 1:58 AM, Thomas Wolff wrote: > Paul Eggert wrote: >>> >>> the mentioned patches are apparently intended to fix issues in non-UTF-8 >>> locales. >> >> No, they're also needed for UTF-8 locales I'm afraid. There are some >> security issues, not only having to do with grep's internals, but also for >> the behavior of downstream programs that may be expecting UTF-8 text. >> >> You can work around the problem with 'grep -a'. > > I was aware of this workaround but I claim it should not be needed because > the files affected are in fact not binary files but text files. The manual > clearly says about -a: "Process a binary file as if it were text" but > partial content in a different text encoding does not make a file binary. > > Jim Meyering wrote: >> >> this is due to documented and desirable behavior. > > I deny this is desirable behavior and I doubt there is a security issue as > described. If any other, independent software has a security issue with > non-UTF-8 input, it should decide itself to filter it and use accordingly > stable decoding functions. It cannot be the task of any tool (grep in this > case) to filter output to work around possible security issues in other > programs in a pipe. This would be completely against the concept of pipes in > the Unix tradition. This is another side effect of using a multibyte locale. As long as there are no NUL bytes in your input, you can work around the issue by running grep in the C locale: LC_ALL=C grep ... From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 05 10:35:04 2014 Received: (at 19242) by debbugs.gnu.org; 5 Dec 2014 15:35:04 +0000 Received: from localhost ([127.0.0.1]:54786 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XwuuO-0006dK-9O for submit@debbugs.gnu.org; Fri, 05 Dec 2014 10:35:04 -0500 Received: from mx1.redhat.com ([209.132.183.28]:49684) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XwuuL-0006cr-Ku for 19242@debbugs.gnu.org; Fri, 05 Dec 2014 10:35:02 -0500 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id sB5FYvEB026287 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 5 Dec 2014 10:34:57 -0500 Received: from [10.3.113.183] (ovpn-113-183.phx2.redhat.com [10.3.113.183]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id sB5FYupL004155; Fri, 5 Dec 2014 10:34:57 -0500 Message-ID: <5481D09F.2060801@redhat.com> Date: Fri, 05 Dec 2014 08:34:55 -0700 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Thomas Wolff , Paul Eggert , Jim Meyering Subject: Re: bug#19242: latest grep considers text files as binary References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> <548181D9.4030108@computer.org> In-Reply-To: <548181D9.4030108@computer.org> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="bLWqBRNSL6N2skIk6N4QIpqN8PG2Ibl2P" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 19242 Cc: 19242@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --bLWqBRNSL6N2skIk6N4QIpqN8PG2Ibl2P Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 12/05/2014 02:58 AM, Thomas Wolff wrote: > Paul Eggert wrote: >>> the mentioned patches are apparently intended to fix issues in >>> non-UTF-8 locales. >> No, they're also needed for UTF-8 locales I'm afraid. There are some >> security issues, not only having to do with grep's internals, but also= >> for the behavior of downstream programs that may be expecting UTF-8 te= xt. >> >> You can work around the problem with 'grep -a'. > I was aware of this workaround but I claim it should not be needed > because the files affected are in fact not binary files but text files.= No, they are binary. The POSIX definition of a text file states that the file may consist ONLY of characters in the current locale. If you have files created under different locales, such that the bytes in the file are NOT characters in the current locale, then that file is binary under the current locale, even though it may be text in a better locale. > The manual clearly says about -a: "Process a binary file as if it were > text" but partial content in a different text encoding does not make a > file binary. Yes, it does, per POSIX. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --bLWqBRNSL6N2skIk6N4QIpqN8PG2Ibl2P Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg iQEcBAEBCAAGBQJUgdCgAAoJEKeha0olJ0NqZ/8H/jF6MDCr4PBS10RYdAT42uqn U11qvV7HweV80gdx4Ivk3LCktuPk68o4H3gmRFMHMhgKYbNCHiGc6Hf9gyst3Fsz QCv+nt2T4Sxa0cbeInK9TYXJ4mpgpD5NQdzyGMtzTBTycn4NocFPrvrC6COig3FZ +M0GY2GOpTjdP6QxuO/u6v3tEocxgt1Wj9OLlGeA7jXbQ1VM4OMdrIduLfSO7zdN pGZJdjQD3J7YPR4cO1RxAWnQkvcvvHlERo4sgghgDSxFq6E37S5fOv19MqxQ5rTB VdmG4bZCQyEtKbQ+TqLNQvXT/rkIi5N1U/szPwRn+OMmPXeodBukHktVT6GNAxk= =BpfF -----END PGP SIGNATURE----- --bLWqBRNSL6N2skIk6N4QIpqN8PG2Ibl2P-- From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 05 10:36:23 2014 Received: (at 19242) by debbugs.gnu.org; 5 Dec 2014 15:36:23 +0000 Received: from localhost ([127.0.0.1]:54790 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Xwuve-0006fO-Ox for submit@debbugs.gnu.org; Fri, 05 Dec 2014 10:36:22 -0500 Received: from mx1.redhat.com ([209.132.183.28]:37729) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Xwuvc-0006fF-M8 for 19242@debbugs.gnu.org; Fri, 05 Dec 2014 10:36:21 -0500 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id sB5FaIEn029193 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 5 Dec 2014 10:36:18 -0500 Received: from [10.3.113.183] (ovpn-113-183.phx2.redhat.com [10.3.113.183]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id sB5FaHvV031868; Fri, 5 Dec 2014 10:36:18 -0500 Message-ID: <5481D0F1.2000402@redhat.com> Date: Fri, 05 Dec 2014 08:36:17 -0700 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Jim Meyering , Thomas Wolff Subject: Re: bug#19242: latest grep considers text files as binary References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> <548181D9.4030108@computer.org> In-Reply-To: OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="B4jlMbPrtXjsONcGlVSaCtN6li2hN8AOX" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 19242 Cc: Jim Meyering , Paul Eggert , 19242@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --B4jlMbPrtXjsONcGlVSaCtN6li2hN8AOX Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 12/05/2014 08:00 AM, Jim Meyering wrote: >> >> I deny this is desirable behavior and I doubt there is a security issu= e as >> described. If any other, independent software has a security issue wit= h >> non-UTF-8 input, it should decide itself to filter it and use accordin= gly >> stable decoding functions. It cannot be the task of any tool (grep in = this >> case) to filter output to work around possible security issues in othe= r >> programs in a pipe. This would be completely against the concept of pi= pes in >> the Unix tradition. >=20 > This is another side effect of using a multibyte locale. > As long as there are no NUL bytes in your input, you can work > around the issue by running grep in the C locale: >=20 > LC_ALL=3DC grep ... Yes, the C locale has the nice effect of EVERY byte being a valid single byte character, leaving only NUL bytes and a non-empty file not ending in newline as the only reasons for a file to be marked binary. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --B4jlMbPrtXjsONcGlVSaCtN6li2hN8AOX Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg iQEcBAEBCAAGBQJUgdDxAAoJEKeha0olJ0NqVJkIAIEAi4cAT0xIthHigmfv5EbV I8/UTKIVEHaO83vgmQ03ulQn1MCP/+hBvZly54aGqnuDzATEuHISscHV+iiHe8NI hs8BUKTYEoHEqK8AhQlGko4/EnaW7JQSBgh4jAyo0XW+7vN/fNDc7EOuc9AD4y7W 8sg0rMI366eLNotkrc3E1LzpCkR4ZySr62WBWz+aUPqJVEJtxQkmeUgLbLH0D7nJ zTSbhCA25CitXGcj1n7SxAVG5SMsyBGVNcZJUimu4zYf5AWMm/8LYsS18WlzcNQg Nxn71DKqClduGezircGWd3WAbpGBUupyyZtZlDqFVNkSqpwNABAW3+wQgD0jKtU= =z7Hv -----END PGP SIGNATURE----- --B4jlMbPrtXjsONcGlVSaCtN6li2hN8AOX-- From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 05 10:39:54 2014 Received: (at 19242) by debbugs.gnu.org; 5 Dec 2014 15:39:55 +0000 Received: from localhost ([127.0.0.1]:54798 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Xwuz4-0006mK-2l for submit@debbugs.gnu.org; Fri, 05 Dec 2014 10:39:54 -0500 Received: from mx1.redhat.com ([209.132.183.28]:51328) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Xwuz1-0006m9-94 for 19242@debbugs.gnu.org; Fri, 05 Dec 2014 10:39:51 -0500 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id sB5Fdm5J028108 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 5 Dec 2014 10:39:49 -0500 Received: from [10.3.113.183] (ovpn-113-183.phx2.redhat.com [10.3.113.183]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id sB5Fdmhx019628; Fri, 5 Dec 2014 10:39:48 -0500 Message-ID: <5481D1C4.40107@redhat.com> Date: Fri, 05 Dec 2014 08:39:48 -0700 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Thomas Wolff , Paul Eggert , Jim Meyering Subject: Re: bug#19242: latest grep considers text files as binary References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> <548181D9.4030108@computer.org> <5481D09F.2060801@redhat.com> In-Reply-To: <5481D09F.2060801@redhat.com> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="8v3jcqM1SKCIjVPdXxjs6je6LtRNd3Kbu" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 19242 Cc: 19242@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --8v3jcqM1SKCIjVPdXxjs6je6LtRNd3Kbu Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 12/05/2014 08:34 AM, Eric Blake wrote: > On 12/05/2014 02:58 AM, Thomas Wolff wrote: >> Paul Eggert wrote: >>>> the mentioned patches are apparently intended to fix issues in >>>> non-UTF-8 locales. >>> No, they're also needed for UTF-8 locales I'm afraid. There are some= >>> security issues, not only having to do with grep's internals, but als= o >>> for the behavior of downstream programs that may be expecting UTF-8 t= ext. >>> >>> You can work around the problem with 'grep -a'. >> I was aware of this workaround but I claim it should not be needed >> because the files affected are in fact not binary files but text files= =2E >=20 > No, they are binary. The POSIX definition of a text file states that > the file may consist ONLY of characters in the current locale. If you > have files created under different locales, such that the bytes in the > file are NOT characters in the current locale, then that file is binary= > under the current locale, even though it may be text in a better locale= =2E >=20 >> The manual clearly says about -a: "Process a binary file as if it were= >> text" but partial content in a different text encoding does not make a= >> file binary. >=20 > Yes, it does, per POSIX. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#t= ag_03_397 A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --8v3jcqM1SKCIjVPdXxjs6je6LtRNd3Kbu Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg iQEcBAEBCAAGBQJUgdHEAAoJEKeha0olJ0NqfXcH/1hnf/0RF2FOE49GA75SXakW /LCs1UamBmsBB1AloRzXFfn74YfCB4Q4fsukFmtOzB7X70jGBgaiYw6wiTO8qpYM uA7wA150VrEVL56XGFh5fxT8tgmpYH7beCfIhEgMBJ+9f8kkDx6FHQKccrQrPRdR 3cQJi1llyw1PVfUaNmHAwZna+fX62CaGIv9jskorqy8KNXFuo21/XSLipcarOk1u JjJ8O4LU3ss3MPgPs5QAe0AesjkQqoqlJdlKu/D9MkCHHhs+Y0ynKv2S8Wu4JFmY pNPMLKAaEPSTGFgXM66YU9utb7ZrraZ6JBUvmbxODEG9Yg1VWPFdqRJGfj38tuY= =vNFb -----END PGP SIGNATURE----- --8v3jcqM1SKCIjVPdXxjs6je6LtRNd3Kbu-- From unknown Sat Jun 21 03:28:44 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sat, 03 Jan 2015 12:24:03 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator From unknown Sat Jun 21 03:28:44 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: Did not alter fixed versions and reopened. Date: Mon, 23 Mar 2015 00:10:03 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # Did not alter fixed versions and reopened. thanks # This fakemail brought to you by your local debbugs # administrator From debbugs-submit-bounces@debbugs.gnu.org Sun Mar 22 20:41:39 2015 Received: (at control) by debbugs.gnu.org; 23 Mar 2015 00:41:39 +0000 Received: from localhost ([127.0.0.1]:32881 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YZqR0-00033W-OA for submit@debbugs.gnu.org; Sun, 22 Mar 2015 20:41:38 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:39453) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YZqQy-00033F-G4 for control@debbugs.gnu.org; Sun, 22 Mar 2015 20:41:36 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id CD33E39E8019 for ; Sun, 22 Mar 2015 17:41:30 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id rRGOnH9wbPvp for ; Sun, 22 Mar 2015 17:41:30 -0700 (PDT) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 920FD39E8015 for ; Sun, 22 Mar 2015 17:41:30 -0700 (PDT) Message-ID: <550F613A.2050409@cs.ucla.edu> Date: Sun, 22 Mar 2015 17:41:30 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: 19242 discussion continues Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) unarchive 19242 From debbugs-submit-bounces@debbugs.gnu.org Sun Mar 22 20:42:36 2015 Received: (at 19242) by debbugs.gnu.org; 23 Mar 2015 00:42:36 +0000 Received: from localhost ([127.0.0.1]:32886 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YZqRv-00035R-4v for submit@debbugs.gnu.org; Sun, 22 Mar 2015 20:42:35 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:39492) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YZqRr-000357-M8 for 19242@debbugs.gnu.org; Sun, 22 Mar 2015 20:42:32 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 2BEAC39E8019; Sun, 22 Mar 2015 17:42:26 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id hVfvOWDhMGuO; Sun, 22 Mar 2015 17:42:25 -0700 (PDT) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 6C7E539E8015; Sun, 22 Mar 2015 17:42:25 -0700 (PDT) Message-ID: <550F6171.5090109@cs.ucla.edu> Date: Sun, 22 Mar 2015 17:42:25 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Thomas Wolff , Jim Meyering Subject: Re: latest grep considers text files as binary References: <547C9FEF.6090809@computer.org> <547CA56B.4070002@cs.ucla.edu> <548181D9.4030108@computer.org> <550EAC87.2020107@towo.net> In-Reply-To: <550EAC87.2020107@towo.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 19242 Cc: 19242@debbugs.gnu.org, noritnk@kcn.ne.jp X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Thomas Wolff wrote: > Hi Paul and Jim, > > Thanks for your previous quick responses on this matter and excuse my very late > additional statement. > > However, the arguments are not convincing. > The new behavior violates the principle of least astonishment which is well > established in software design. That cuts both ways. Older versions of grep could dump core when given improperly encoded text, which is even more astonishing. The new version is an improvement in that particular area. It is not clear how grep could be modified to avoid the core dumps while still preserving the old behavior in question. > It is not convincing that a text file is not considered a text file for a few > bytes that are not properly encoded in the current locale. Also the quoted POSIX > clause does not support that claim. Not by itself, but from the chain of definitions it's clear that a text file must contain properly encoded text. The quoted POSIX clause (3.397) says that a text file contains "characters", and an earlier clause (3.87) defines "character" to be "A sequence of one or more bytes representing a single graphic symbol or control code. Note: This term corresponds to the ISO C standard term multi-byte character". http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87 Because encoding errors are not characters, they are not text. > And, considering the "pipe security" argument, shall all classic Unix tools now > get additional options -a, so that something like > grep 'bla' | sed -e 'expr' | tr '' '' | grep -v 'argl' > would in future look like > grep -a 'bla' | sed -a -e 'expr' | tr -a '' '' | grep -a -v 'argl' > ? It shouldn't be needed for tr, as tr's input is not required to be a text file. GNU sed doesn't worry about whether files are text or binary. I expect this is because the problem of spitting out random binary data tends to be less of an issue for 'sed' in practice. However, portable scripts should not assume that 'sed' will work on arbitrary binary data. > What about backwards compability of scripts then? > This is breaking decades of Unix tradition of modular tools for the mere > dogmatics of some peculiar and strict locale theory. UTF-8 does tend to have that effect, yes. From the traditional Unix point of view, patterns like 'a.b' are "broken" with modern grep in UTF-8 locales, since the "." no longer matches only single bytes. This has been true for decades, not just for 'grep' but also for 'sed' etc. These days, though, users tend to be more interested in dealing with multibyte characters than in insisting on circa-1977 semantics in all cases. > If you insist on this priority of locale strategy over Unix tradition, > please offer at least a compatibility option that does not break scripts, > i.e. an environment setting that enforces compatible behaviour (like other tools > have, e.g. LS_COLORS etc). Instead of an environment variable I suggest using a script. Please see: http://bugs.gnu.org/19998#8 > As a last remark, I wonder why my report does not show up in > http://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep > and apparently I cannot submit anything there myself. Please get the issue > documented there. I unarchived that bug report and am quoting the entire new part of your message, which should do the trick. > Kind regards, > Thomas From unknown Sat Jun 21 03:28:44 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 20 Apr 2015 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator