From unknown Sun Jun 22 00:05:03 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#20526 <20526@debbugs.gnu.org> To: bug#20526 <20526@debbugs.gnu.org> Subject: Status: BUG: text file is detected as binary Reply-To: bug#20526 <20526@debbugs.gnu.org> Date: Sun, 22 Jun 2025 07:05:03 +0000 retitle 20526 BUG: text file is detected as binary reassign 20526 grep submitter 20526 Sebastian Poehn severity 20526 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Thu May 07 11:40:15 2015 Received: (at submit) by debbugs.gnu.org; 7 May 2015 15:40:15 +0000 Received: from localhost ([127.0.0.1]:37970 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqNuH-0006aV-PX for submit@debbugs.gnu.org; Thu, 07 May 2015 11:40:15 -0400 Received: from eggs.gnu.org ([208.118.235.92]:53263) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqJfD-0006iV-1N for submit@debbugs.gnu.org; Thu, 07 May 2015 07:08:23 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YqJf3-00088F-DL for submit@debbugs.gnu.org; Thu, 07 May 2015 07:08:17 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:34134) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YqJf3-00088B-AT for submit@debbugs.gnu.org; Thu, 07 May 2015 07:08:13 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:38521) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YqJf1-0000Ha-U4 for bug-grep@gnu.org; Thu, 07 May 2015 07:08:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YqJf0-00087c-Nb for bug-grep@gnu.org; Thu, 07 May 2015 07:08:11 -0400 Received: from mail-wg0-x22a.google.com ([2a00:1450:400c:c00::22a]:35393) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YqJf0-00087P-Gf for bug-grep@gnu.org; Thu, 07 May 2015 07:08:10 -0400 Received: by wgyo15 with SMTP id o15so39806704wgy.2 for ; Thu, 07 May 2015 04:08:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:message-id:subject:to:date:content-type:mime-version; bh=46L3aBxRSvndTt5J5edgVaRChh8O5WolMef+Ot6RNLA=; b=xA0YSAahfQ0Xba0lwl98XndyDGwFzi27qnUQcy39yRQK+02cV0lonz/286wyNJXbMy yKsRhixNuzAeau+qJSJFCV1aEp4+MDUq7l8c6sV418CMJ3rWKTq0XipXwWxbGl3Nf/ap rMmbIiv9adXVDnluDlHQdTSCdX9bnvTeckKYkMoZHJNgvNOLR2RO4sCVV4E/q0naExuc KU8u+E+kJvAGqHkjHUEFKTr3HkmOnfoGXji3iq8kXIaacpt+HLwJs3jHZHy1xT1UYOmX avlrp2SzvPLbMBszqDZF8DCs3HYghfJe92WHiVvPJ3Asy7WgYAx5WqdtGQ+Rwy5Y2HuB be6w== X-Received: by 10.180.108.147 with SMTP id hk19mr5391173wib.51.1430996889983; Thu, 07 May 2015 04:08:09 -0700 (PDT) Received: from de-ka-36785.green.sophos ([2001:1a80:2000:2:4637:e6ff:feaa:80c4]) by mx.google.com with ESMTPSA id ex2sm2846517wjd.28.2015.05.07.04.08.09 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 May 2015 04:08:09 -0700 (PDT) From: Sebastian Poehn X-Google-Original-From: Sebastian Poehn Message-ID: <1430996888.2678.8.camel@googlemail.com> Subject: BUG: text file is detected as binary To: bug-grep@gnu.org Date: Thu, 07 May 2015 13:08:08 +0200 Content-Type: multipart/mixed; boundary="=-yg0+CMVRUiOHQ6Hli6jm" X-Mailer: Evolution 3.12.11 (3.12.11-1.fc21) Mime-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 07 May 2015 11:40:12 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --=-yg0+CMVRUiOHQ6Hli6jm Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Fedora 21 grep (GNU grep) 2.21 Grep detects text file as Binary. File is attached. file Makefile Makefile: ISO-8859 text ack PKG_NAME Makefile 10:PKG_NAME:=clearsilver 14:PKG_SOURCE:=$(PKG_NAME)-$(PKG_VERSION).tar.gz grep --version ; grep "PKG_NAME" Makefile grep (GNU grep) 2.7 ... PKG_NAME:=clearsilver PKG_SOURCE:=$(PKG_NAME)-$(PKG_VERSION).tar.gz grep --version ; grep "PKG_NAME" Makefile grep (GNU grep) 2.21 ... Binary file Makefile matches --=-yg0+CMVRUiOHQ6Hli6jm Content-Disposition: attachment; filename="Makefile" Content-Type: text/x-makefile; name="Makefile"; charset="UTF-8" Content-Transfer-Encoding: base64 IwojIENvcHlyaWdodCAoQykgMjAwNi0yMDEwIE9wZW5XcnQub3JnCiMKIyBUaGlzIGlzIGZyZWUg c29mdHdhcmUsIGxpY2Vuc2VkIHVuZGVyIHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSB2 Mi4KIyBTZWUgL0xJQ0VOU0UgZm9yIG1vcmUgaW5mb3JtYXRpb24uCiMKCmluY2x1ZGUgJChUT1BE SVIpL3J1bGVzLm1rCgpQS0dfTkFNRTo9Y2xlYXJzaWx2ZXIKUEtHX1ZFUlNJT046PTAuMTAuNQpQ S0dfUkVMRUFTRTo9NQoKUEtHX1NPVVJDRTo9JChQS0dfTkFNRSktJChQS0dfVkVSU0lPTikudGFy Lmd6ClBLR19TT1VSQ0VfVVJMOj1odHRwOi8vd3d3LmNsZWFyc2lsdmVyLm5ldC9kb3dubG9hZHMv ClBLR19NRDVTVU06PWI4YzBjN2ZiZTBlZjVlMDZlMGM5MzVmMTM0MzA0ZDQ0CgpQS0dfQ09ORklH X0RFUEVORFM6PSBcCglDT05GSUdfQ0xFQVJTSUxWRVJfRU5BQkxFX0NPTVBSRVNTSU9OIFwKCUNP TkZJR19DTEVBUlNJTFZFUl9FTkFCTEVfUkVNT1RFX0RFQlVHR0VSIFwKCUNPTkZJR19DTEVBUlNJ TFZFUl9FTkFCTEVfR0VUVEVYVAoKUEtHX0ZJWFVQOj1saWJ0b29sClBLR19JTlNUQUxMOj0xClBL R19CVUlMRF9QQVJBTExFTDo9MAoKaW5jbHVkZSAkKElOQ0xVREVfRElSKS9wYWNrYWdlLm1rCgpk ZWZpbmUgUGFja2FnZS9jbGVhcnNpbHZlcgogIFNFQ1RJT046PWxpYnMKICBDQVRFR09SWTo9TGli cmFyaWVzCiAgVElUTEU6PUNsZWFyU2lsdmVyIHRlbXBsYXRlIHN5c3RlbQogIFVSTDo9aHR0cDov L3d3dy5jbGVhcnNpbHZlci5uZXQvCiAgTUFJTlRBSU5FUjo9UmFwaGHrbCBIVUNLIDxyaGtAY2tz dW0ub3JnPgogIERFUEVORFM6PStDTEVBUlNJTFZFUl9FTkFCTEVfQ09NUFJFU1NJT046emxpYiAr Q0xFQVJTSUxWRVJfRU5BQkxFX0dFVFRFWFQ6bGliaW50bAplbmRlZgoKZGVmaW5lIFBhY2thZ2Uv Y2xlYXJzaWx2ZXIvY29uZmlnCiAgbWVudSAiQ29uZmlndXJhdGlvbiIKICBkZXBlbmRzIG9uIFBB Q0tBR0VfY2xlYXJzaWx2ZXIKICBzb3VyY2UgIiQoU09VUkNFKS9Db25maWcuaW4iCiAgZW5kbWVu dQplbmRlZgoKZGVmaW5lIFBhY2thZ2UvY2xlYXJzaWx2ZXIvZGVzY3JpcHRpb24KQ2xlYXJzaWx2 ZXIgaXMgYSBmYXN0LCBwb3dlcmZ1bCwgYW5kIGxhbmd1YWdlLW5ldXRyYWwgSFRNTCB0ZW1wbGF0 ZSBzeXN0ZW0uIEluCmJvdGggc3RhdGljIGNvbnRlbnQgc2l0ZXMgYW5kIGR5bmFtaWMgSFRNTCBh cHBsaWNhdGlvbnMsIGl0IHByb3ZpZGVzIGEgc2VwYXJhdGlvbgpiZXR3ZWVuIHByZXNlbnRhdGlv biBjb2RlIGFuZCBhcHBsaWNhdGlvbiBsb2dpYyB3aGljaCBtYWtlcyB3b3JraW5nIHdpdGggeW91 cgpwcm9qZWN0IGVhc2llci4KZW5kZWYKCkNPTkZJR1VSRV9BUkdTKz0gXAoJLS1kaXNhYmxlLXdk YiBcCgktLWRpc2FibGUtYXBhY2hlIFwKCS0tZGlzYWJsZS1weXRob24gXAoJLS1kaXNhYmxlLXBl cmwgXAoJLS1kaXNhYmxlLXJ1YnkgXAoJLS1kaXNhYmxlLWphdmEgXAoJLS1kaXNhYmxlLWNzaGFy cCBcCgktLXByZWZpeD0vdXNyCgppZmVxICgkKFNESykkKENPTkZJR19DTEVBUlNJTFZFUl9FTkFC TEVfQ09NUFJFU1NJT04pLHkpCglDT05GSUdVUkVfQVJHUys9IFwKCQktLWVuYWJsZS1jb21wcmVz c2lvbgplbHNlCglDT05GSUdVUkVfQVJHUys9IFwKCQktLWRpc2FibGUtY29tcHJlc3Npb24KZW5k aWYKCmlmZXEgKCQoU0RLKSQoQ09ORklHX0NMRUFSU0lMVkVSX0VOQUJMRV9SRU1PVEVfREVCVUdH RVIpLHkpCglDT05GSUdVUkVfQVJHUys9IFwKCQktLWVuYWJsZS1yZW1vdGUtZGVidWdnZXIKZWxz ZQoJQ09ORklHVVJFX0FSR1MrPSBcCgkJLS1kaXNhYmxlLXJlbW90ZS1kZWJ1Z2dlcgplbmRpZgoK aWZlcSAoJChTREspJChDT05GSUdfQ0xFQVJTSUxWRVJfRU5BQkxFX0dFVFRFWFQpLHkpCglDT05G SUdVUkVfQVJHUys9IFwKCQktLWVuYWJsZS1nZXR0ZXh0CmVsc2UKCUNPTkZJR1VSRV9BUkdTKz0g XAoJCS0tZGlzYWJsZS1nZXR0ZXh0CmVuZGlmCgpUQVJHRVRfQ0ZMQUdTKz0kKEZQSUMpCgpNQUtF X0ZMQUdTKz0gXAoJJChUQVJHRVRfQ09ORklHVVJFX09QVFMpIFwKCUFSPSIkKEFSKSBjciIgXAoJ TEQ9IiQoVEFSR0VUX0NDKSAtbyIKCmRlZmluZSBCdWlsZC9JbnN0YWxsRGV2CgkkKENQKSAkKFBL R19JTlNUQUxMX0RJUikvKiAkKDEpLwplbmRlZgoKZGVmaW5lIFBhY2thZ2UvY2xlYXJzaWx2ZXIv aW5zdGFsbAoJJChJTlNUQUxMX0RJUikgJCgxKS91c3IvbGliCmVuZGVmCgokKGV2YWwgJChjYWxs IEJ1aWxkUGFja2FnZSxjbGVhcnNpbHZlcikpCg== --=-yg0+CMVRUiOHQ6Hli6jm-- From debbugs-submit-bounces@debbugs.gnu.org Thu May 07 12:23:43 2015 Received: (at 20526) by debbugs.gnu.org; 7 May 2015 16:23:44 +0000 Received: from localhost ([127.0.0.1]:38006 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqOaN-0000h6-At for submit@debbugs.gnu.org; Thu, 07 May 2015 12:23:43 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:55222) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqOaK-0000gs-HI for 20526@debbugs.gnu.org; Thu, 07 May 2015 12:23:41 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 776A5A6006F; Thu, 7 May 2015 09:23:34 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id s9X6qCcF64b0; Thu, 7 May 2015 09:23:33 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id CB4DEA60065; Thu, 7 May 2015 09:23:33 -0700 (PDT) Message-ID: <554B917F.80902@cs.ucla.edu> Date: Thu, 07 May 2015 09:23:27 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Sebastian Poehn , 20526@debbugs.gnu.org Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> In-Reply-To: <1430996888.2678.8.camel@googlemail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20526 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) That file uses ISO 8859 encoding (presumably Latin-1 or Latin-9), so you need to grep it in a locale compatible with that encoding. It appears that you ran grep in a UTF-8 or other incompatible locale, which meant the ISO 8859 encoding wasn't valid and was treated as binary gibberish. You could try working around it with this: grep -a PKG_NAME Makefile or this: LC_ALL=de_DE.iso885915 grep PKG_NAME Makefile but in either case 'grep' might output the binary gibberish, which could cause other problems. So it might be better to change that non-ASCII character in the file's string "Raphaël" to use an encoding compatible with the encoding of your locale. From debbugs-submit-bounces@debbugs.gnu.org Thu May 07 13:47:34 2015 Received: (at 20526) by debbugs.gnu.org; 7 May 2015 17:47:34 +0000 Received: from localhost ([127.0.0.1]:38036 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqPtV-0002hQ-3g for submit@debbugs.gnu.org; Thu, 07 May 2015 13:47:34 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:38498) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqPtT-0002hB-67 for 20526@debbugs.gnu.org; Thu, 07 May 2015 13:47:31 -0400 Received: by wiun10 with SMTP id n10so521542wiu.1 for <20526@debbugs.gnu.org>; Thu, 07 May 2015 10:47:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=a5yz5A3IKH5rfPEKvRKN4B9o0YbY6vMnGx/TnUIbPRg=; b=ye3p3s8NFMNlpzKFG58kGLxbNMqttfaY9nycpEFDkuTVxzsEh57UY4IcqB8RNTgz0s qoRwVDGmgTbLUy8GwBSFjkN2d6OsDvj2RKdIPy3c44DjKqZY1qtVbBAONM32GfOiegVC a9VnR8y8FWaXqGk1A8nSBLK/7M42o/QxgTp9dmtsQAeVR9Cj8gXnTQ7qsBSX9cKCRUDB Ph+qtIMkbXKHLCvXUHhi3ic8y9QxrXqM3swl0JtSWnlQwC4DBPSVFCPchdGURQsnW0wL U+l9N/5NFSUSsYwytXLFwlMM3FoYew9YuoM5oSEsWKhOJ7y7HVLrRyBuOoVMc1vXSn7n bPvA== MIME-Version: 1.0 X-Received: by 10.195.18.103 with SMTP id gl7mr9629795wjd.34.1431020845171; Thu, 07 May 2015 10:47:25 -0700 (PDT) Received: by 10.27.126.139 with HTTP; Thu, 7 May 2015 10:47:25 -0700 (PDT) Received: by 10.27.126.139 with HTTP; Thu, 7 May 2015 10:47:25 -0700 (PDT) In-Reply-To: <554B917F.80902@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <554B917F.80902@cs.ucla.edu> Date: Thu, 7 May 2015 19:47:25 +0200 Message-ID: Subject: Re: bug#20526: BUG: text file is detected as binary From: =?UTF-8?Q?Sebastian_P=C3=B6hn?= To: Paul Eggert Content-Type: multipart/alternative; boundary=001a11c28a82248a4405158181d5 X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 20526 Cc: =?UTF-8?Q?P=C3=B6hn=2C_Sebastian?= , 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --001a11c28a82248a4405158181d5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks for this fast feedback. Your explanation sounds very reasonable. As you may have noticed this a makefile out of openwrt with is mainlined there= . 1) I downgraded to grep 2.20. Issue is gone with the same environment. So this is in my eyes a regression. 2) I will also open a report at fedora, maybe the use some strange setting in building the new packet. 3) I will send a short notice to openwrt asking if they think it is fine to use =C3=AB or =C3=B6. I personally have a strong opinion on that ;) Am 07.05.2015 6:23 nachm. schrieb "Paul Eggert" : > That file uses ISO 8859 encoding (presumably Latin-1 or Latin-9), so you > need to grep it in a locale compatible with that encoding. It appears th= at > you ran grep in a UTF-8 or other incompatible locale, which meant the ISO > 8859 encoding wasn't valid and was treated as binary gibberish. You coul= d > try working around it with this: > > grep -a PKG_NAME Makefile > > or this: > > LC_ALL=3Dde_DE.iso885915 grep PKG_NAME Makefile > > but in either case 'grep' might output the binary gibberish, which could > cause other problems. So it might be better to change that non-ASCII > character in the file's string "Rapha=C3=ABl" to use an encoding compatib= le with > the encoding of your locale. > --001a11c28a82248a4405158181d5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Thanks for this fast feedback. Your explanation sounds very = reasonable. As you may have noticed this a makefile out of openwrt with is = mainlined there.

1) I downgraded to grep 2.20. Issue is gone with the same en= vironment. So this is in my eyes a regression.

2) I will also open a report at fedora, maybe the use some s= trange setting in building the new packet.

3) I will send a short notice to openwrt asking if they thin= k it is fine to use =C3=AB or =C3=B6. I personally have a strong opinion on= that ;)

Am 07.05.2015 6:23 nachm. schrieb "Paul Egg= ert" <eggert@cs.ucla.edu&= gt;:
That file uses = ISO 8859 encoding (presumably Latin-1 or Latin-9), so you need to grep it i= n a locale compatible with that encoding.=C2=A0 It appears that you ran gre= p in a UTF-8 or other incompatible locale, which meant the ISO 8859 encodin= g wasn't valid and was treated as binary gibberish.=C2=A0 You could try= working around it with this:

grep -a PKG_NAME Makefile

or this:

LC_ALL=3Dde_DE.iso885915 grep PKG_NAME Makefile

but in either case 'grep' might output the binary gibberish, which = could cause other problems.=C2=A0 So it might be better to change that non-= ASCII character in the file's string "Rapha=C3=ABl" to use an= encoding compatible with the encoding of your locale.
--001a11c28a82248a4405158181d5-- From debbugs-submit-bounces@debbugs.gnu.org Thu May 07 15:11:55 2015 Received: (at 20526) by debbugs.gnu.org; 7 May 2015 19:11:55 +0000 Received: from localhost ([127.0.0.1]:38091 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqRD9-0006Gw-5W for submit@debbugs.gnu.org; Thu, 07 May 2015 15:11:55 -0400 Received: from mx1.redhat.com ([209.132.183.28]:54143) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqRD6-0006Gk-AU for 20526@debbugs.gnu.org; Thu, 07 May 2015 15:11:53 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t47JBoCA013699 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 7 May 2015 15:11:50 -0400 Received: from [10.3.113.103] (ovpn-113-103.phx2.redhat.com [10.3.113.103]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t47JBoYU011017; Thu, 7 May 2015 15:11:50 -0400 Message-ID: <554BB8F5.9020505@redhat.com> Date: Thu, 07 May 2015 13:11:49 -0600 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: =?UTF-8?B?U2ViYXN0aWFuIFDDtmhu?= , Paul Eggert Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> <554B917F.80902@cs.ucla.edu> In-Reply-To: OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="U7iKLWHo7kDMRA3wf1J4CJbs9W8wbBJH1" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --U7iKLWHo7kDMRA3wf1J4CJbs9W8wbBJH1 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 05/07/2015 11:47 AM, Sebastian P=C3=B6hn wrote: > Thanks for this fast feedback. Your explanation sounds very reasonable.= As > you may have noticed this a makefile out of openwrt with is mainlined t= here. >=20 > 1) I downgraded to grep 2.20. Issue is gone with the same environment. = So > this is in my eyes a regression. No, it is a bug fix, and documented in the NEWS: If a file contains data improperly encoded for the current locale, and this is discovered before any of the file's contents are output, grep now treats the file as binary. >=20 > 2) I will also open a report at fedora, maybe the use some strange sett= ing > in building the new packet. But as the change is intentional, there is probably nothing that Fedora would do about it. >=20 > 3) I will send a short notice to openwrt asking if they think it is fin= e to > use =C3=AB or =C3=B6. I personally have a strong opinion on that ;) It would be fine if they would recode their file to use UTF-8, as that is pretty much a standard encoding these days. Latin-1 files are getting harder and harder to process, as more people move to multibyte UTF-8 locales. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --U7iKLWHo7kDMRA3wf1J4CJbs9W8wbBJH1 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJVS7j1AAoJEKeha0olJ0Nq1sAH/RSAEz3HwN1c69ZDp8riL7gE mSzxufRbra05AkE3GnX4S8DE6Dy6QjJT8qgfGeiPS9q9OYB6hOgtdbCpo+UGvgOC FyLlOou/g1Mus4W+74WVKj9Dnmbh6pcD9W9C1MrISwzYH4wXe0Jy35B70Gih5uNW d1Fr5ONoQ8VRJX9SzZfupssnmEXvT1SWOzhZD21S0Qj+Z6GSm038uOTCrCdJh8Nf Jwvwf1UXlXXGECulBvj4HWp9c6HjyespAK5shM0sDrixQpyRINtm/NE0pgQx3YDZ 6W2wEOkKgNQSV7OlCK/eloQUiCTBHdvpzSqIsv4QNwWsK3fHY6nRjsQFj/skYZk= =aamu -----END PGP SIGNATURE----- --U7iKLWHo7kDMRA3wf1J4CJbs9W8wbBJH1-- From debbugs-submit-bounces@debbugs.gnu.org Thu May 07 16:07:34 2015 Received: (at 20526) by debbugs.gnu.org; 7 May 2015 20:07:34 +0000 Received: from localhost ([127.0.0.1]:38101 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqS4z-0007cQ-Rz for submit@debbugs.gnu.org; Thu, 07 May 2015 16:07:34 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:38194) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YqS4x-0007cC-12 for 20526@debbugs.gnu.org; Thu, 07 May 2015 16:07:32 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 816ACA6007D; Thu, 7 May 2015 13:07:24 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Rby8YoWVb1qX; Thu, 7 May 2015 13:07:21 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 92B95A6007A; Thu, 7 May 2015 13:07:21 -0700 (PDT) Message-ID: <554BC5F8.8070400@cs.ucla.edu> Date: Thu, 07 May 2015 13:07:20 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: =?UTF-8?B?U2ViYXN0aWFuIFDDtmhu?= Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> <554B917F.80902@cs.ucla.edu> In-Reply-To: Content-Type: multipart/mixed; boundary="------------070506010006000609020804" X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) This is a multi-part message in MIME format. --------------070506010006000609020804 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit On 05/07/2015 10:47 AM, Sebastian Pöhn wrote: > > Thanks for this fast feedback. Your explanation sounds very > reasonable. As you may have noticed this a makefile out of openwrt > with is mainlined there. > > 1) I downgraded to grep 2.20. Issue is gone with the same environment. > So this is in my eyes a regression. > Not really, as Openwrt is relying on undefined behavior. The spec for grep has never defined what grep does when you feed it binary data that is not properly encoded for the current locale. Different versions of grep (and we're not just talking GNU grep here, but other implementations) do different things. Some grep implementations dump core. These behaviors all conform to the spec. (Well, GNU grep isn't supposed to dump core, but older versions of GNU grep are buggy and will dump core sometimes anyway, so you'll need good luck with them.) > 2) I will also open a report at fedora, maybe the use some strange > setting in building the new packet. > Nowadays most people are using UTF-8, so I suggest encoding the Makefiles in UTF-8 and specifying a UTF-8 locale when you build. Another possibility is the attached hack (I haven't tried it). The most conservative course would be to insist that Makefiles be ASCII, although .... > 3) I will send a short notice to openwrt asking if they think it is > fine to use ë or ö. I personally have a strong opinion on that ;) > Don't blame you a bit. --------------070506010006000609020804 Content-Type: text/x-patch; name="openwrt.diff" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="openwrt.diff" diff --git a/include/scan.mk b/include/scan.mk index c2a8f7e..2a83924 100644 --- a/include/scan.mk +++ b/include/scan.mk @@ -55,7 +55,7 @@ endif $(FILELIST): $(OVERRIDELIST) rm -f $(TMP_DIR)/info/.files-$(SCAN_TARGET)-* - $(call FIND_L, $(SCAN_DIR)) $(SCAN_EXTRA) -mindepth 1 $(if $(SCAN_DEPTH),-maxdepth $(SCAN_DEPTH)) -name Makefile | xargs grep -HE 'call $(GREP_STRING)' | sed -e 's#^$(SCAN_DIR)/##' -e 's#/Makefile:.*##' | uniq | awk -v of=$(OVERRIDELIST) -f include/scan.awk > $@ + LC_ALL=C; export LC_ALL; $(call FIND_L, $(SCAN_DIR)) $(SCAN_EXTRA) -mindepth 1 $(if $(SCAN_DEPTH),-maxdepth $(SCAN_DEPTH)) -name Makefile | xargs grep -HE 'call $(GREP_STRING)' | sed -e 's#^$(SCAN_DIR)/##' -e 's#/Makefile:.*##' | uniq | awk -v of=$(OVERRIDELIST) -f include/scan.awk > $@ $(TMP_DIR)/info/.files-$(SCAN_TARGET).mk: $(FILELIST) ( \ --------------070506010006000609020804-- From debbugs-submit-bounces@debbugs.gnu.org Fri May 08 03:29:31 2015 Received: (at submit) by debbugs.gnu.org; 8 May 2015 07:29:32 +0000 Received: from localhost ([127.0.0.1]:38232 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yqciw-0002Wr-U7 for submit@debbugs.gnu.org; Fri, 08 May 2015 03:29:31 -0400 Received: from eggs.gnu.org ([208.118.235.92]:46863) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yqciu-0002WH-Hd for submit@debbugs.gnu.org; Fri, 08 May 2015 03:29:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yqcio-0005oL-EO for submit@debbugs.gnu.org; Fri, 08 May 2015 03:29:23 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=BAYES_20 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:45580) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yqcio-0005oH-C9 for submit@debbugs.gnu.org; Fri, 08 May 2015 03:29:22 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60370) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yqcin-00062c-Dk for bug-grep@gnu.org; Fri, 08 May 2015 03:29:22 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Yqcii-0005nm-4l for bug-grep@gnu.org; Fri, 08 May 2015 03:29:21 -0400 Received: from cantor2.suse.de ([195.135.220.15]:37181 helo=mx2.suse.de) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Yqcih-0005ne-VC for bug-grep@gnu.org; Fri, 08 May 2015 03:29:16 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 266E9AAC7; Fri, 8 May 2015 07:29:14 +0000 (UTC) Date: Fri, 8 May 2015 09:29:11 +0200 (CEST) From: Johannes Meixner To: Sebastian Poehn Subject: Re: bug#20526: BUG: text file is detected as binary In-Reply-To: <554B917F.80902@cs.ucla.edu> Message-ID: References: <1430996888.2678.8.camel@googlemail.com> <554B917F.80902@cs.ucla.edu> User-Agent: Alpine 2.00 (LNX 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x (no timestamps) [generic] X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit Cc: bug-grep@gnu.org, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Hello, only an addendum FYI: On May 7 09:23 Paul Eggert wrote (excerpt): > That file uses ISO 8859 encoding (presumably Latin-1 or Latin-9), > so you need to grep it in a locale compatible with that encoding. For some general information about that kind of issue have a look at https://en.opensuse.org/SDB:Plain_Text_versus_Locale Kind Regards Johannes Meixner -- SUSE LINUX GmbH - GF: Felix Imendoerffer, Jane Smithard, Jennifer Guild, Dilip Upmanyu, Graham Norton - HRB 21284 (AG Nuernberg) From debbugs-submit-bounces@debbugs.gnu.org Fri May 08 03:40:59 2015 Received: (at 20526) by debbugs.gnu.org; 8 May 2015 07:41:00 +0000 Received: from localhost ([127.0.0.1]:38238 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yqcu2-0002op-4S for submit@debbugs.gnu.org; Fri, 08 May 2015 03:40:59 -0400 Received: from mail-wi0-f181.google.com ([209.85.212.181]:37479) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yqcty-0002ob-Jm for 20526@debbugs.gnu.org; Fri, 08 May 2015 03:40:55 -0400 Received: by widdi4 with SMTP id di4so17812134wid.0 for <20526@debbugs.gnu.org>; Fri, 08 May 2015 00:40:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=from:message-id:subject:to:cc:date:in-reply-to:references :content-type:mime-version:content-transfer-encoding; bh=SDcBaW0I08K6qQTQR/fB1YpvtEWlIjF7lqzNwhykjSA=; b=VjFMz3NJYpDlruA+5QtO/QmliBBDR93/VWmwaOqVeNy9IiRgQfcAcNTX8Bzy+IVFtL ApTvbvLL2VGZR7MLLXce7+nalyEDw68PmnSd0DU2Jms4Z54VrMgeS3NciddNDxoLJ3un ZokKquiDayLxnJIrfIq7lttfnifca9T0yqDmvmQifcNEKS//8fHltfsk7SwJ+S3YfAsp X/NAduUnzvXvnZdglazVlVBUi80ZOLyEguy3GkzXIRkigXajFLimTkEevHcX4qFWcYhi XrVDWwiDq1p4CwrMHNnmw+Bh84/srJUmsVSeENIXku17xhXVmVpCH0HPb+OSRzS8saSl ShZQ== X-Received: by 10.180.188.35 with SMTP id fx3mr3737371wic.43.1431070848650; Fri, 08 May 2015 00:40:48 -0700 (PDT) Received: from de-ka-36785.green.sophos ([2001:1a80:2000:2:4637:e6ff:feaa:80c4]) by mx.google.com with ESMTPSA id ei8sm7047862wjd.32.2015.05.08.00.40.47 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 May 2015 00:40:47 -0700 (PDT) From: Sebastian Poehn X-Google-Original-From: Sebastian Poehn Message-ID: <1431070846.2678.20.camel@googlemail.com> Subject: Re: bug#20526: BUG: text file is detected as binary To: Paul Eggert Date: Fri, 08 May 2015 09:40:46 +0200 In-Reply-To: <554BC5F8.8070400@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <554B917F.80902@cs.ucla.edu> <554BC5F8.8070400@cs.ucla.edu> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.12.11 (3.12.11-1.fc21) Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, Sebastian =?ISO-8859-1?Q?P=F6hn?= , Eric Blake X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Thu, 2015-05-07 at 13:07 -0700, Paul Eggert wrote: > On 05/07/2015 10:47 AM, Sebastian Pöhn wrote: > > > > Thanks for this fast feedback. Your explanation sounds very > > reasonable. As you may have noticed this a makefile out of openwrt > > with is mainlined there. > > > > 1) I downgraded to grep 2.20. Issue is gone with the same environment. > > So this is in my eyes a regression. > > > > Not really, as Openwrt is relying on undefined behavior. The spec for > grep has never defined what grep does when you feed it binary data that > is not properly encoded for the current locale. Different versions of > grep (and we're not just talking GNU grep here, but other > implementations) do different things. Some grep implementations dump > core. These behaviors all conform to the spec. (Well, GNU grep isn't > supposed to dump core, but older versions of GNU grep are buggy and will > dump core sometimes anyway, so you'll need good luck with them.) Ok, agree. It's not a regression. It's just that we got a little stricter. > > > 2) I will also open a report at fedora, maybe the use some strange > > setting in building the new packet. > > > > Nowadays most people are using UTF-8, so I suggest encoding the > Makefiles in UTF-8 and specifying a UTF-8 locale when you build. Another > possibility is the attached hack (I haven't tried it). The most > conservative course would be to insist that Makefiles be ASCII, although > .... There is already a report for this. Let's see what they do. > > > 3) I will send a short notice to openwrt asking if they think it is > > fine to use ë or ö. I personally have a strong opinion on that ;) > > > > Don't blame you a bit. I checked openwrt upstream. They changed all Makefiles not being ASCII to UTF-8 three months ago as they run into exactly this. From debbugs-submit-bounces@debbugs.gnu.org Fri May 08 12:27:55 2015 Received: (at 20526) by debbugs.gnu.org; 8 May 2015 16:27:55 +0000 Received: from localhost ([127.0.0.1]:38924 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yql7y-0001sl-Mm for submit@debbugs.gnu.org; Fri, 08 May 2015 12:27:55 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:52472) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yql7w-0001sY-87 for 20526@debbugs.gnu.org; Fri, 08 May 2015 12:27:52 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 4A204A600B3; Fri, 8 May 2015 09:27:46 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 7kxdfZb2OzaP; Fri, 8 May 2015 09:27:45 -0700 (PDT) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id C339AA60007; Fri, 8 May 2015 09:27:45 -0700 (PDT) Message-ID: <554CE401.5090405@cs.ucla.edu> Date: Fri, 08 May 2015 09:27:45 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Sebastian Poehn Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> <554B917F.80902@cs.ucla.edu> <554BC5F8.8070400@cs.ucla.edu> <1431070846.2678.20.camel@googlemail.com> In-Reply-To: <1431070846.2678.20.camel@googlemail.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20526 Cc: Eric Blake , 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Sebastian Poehn wrote: > They changed all Makefiles not being ASCII > to UTF-8 three months ago as they run into exactly this. Hah! Great minds think alike. But they missed a few files (not Makefiles). The following shell command finds every openwrt file that's not UTF-8 (and isn't obviously binary). It works because '.' matches only properly-encoded characters. You may need a new GNU grep for this command to be reliable. LC_ALL=en_US.utf8 grep -lv '^.*$' \ $(git ls-files | grep -Ev '\.(patch|bin|squashfs)$') From debbugs-submit-bounces@debbugs.gnu.org Mon May 11 07:05:31 2015 Received: (at 20526) by debbugs.gnu.org; 11 May 2015 11:05:31 +0000 Received: from localhost ([127.0.0.1]:40604 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YrlWc-0002kI-MP for submit@debbugs.gnu.org; Mon, 11 May 2015 07:05:31 -0400 Received: from mx1.redhat.com ([209.132.183.28]:47665) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YrlWZ-0002k6-OR; Mon, 11 May 2015 07:05:28 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (Postfix) with ESMTPS id ABCB572; Mon, 11 May 2015 11:05:26 +0000 (UTC) Received: from kdudka.brq.redhat.com (kdudka.brq.redhat.com [10.34.4.67]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t4BB5OHQ003871 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO); Mon, 11 May 2015 07:05:26 -0400 From: Kamil Dudka To: Eric Blake Subject: Re: bug#20526: BUG: text file is detected as binary Date: Mon, 11 May 2015 13:05:23 +0200 Message-ID: <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> User-Agent: KMail/4.14.7 (Linux/4.0.1-300.fc22.x86_64; KDE/4.14.7; x86_64; ; ) In-Reply-To: <554BB8F5.9020505@redhat.com> References: <1430996888.2678.8.camel@googlemail.com> <554BB8F5.9020505@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, Paul Eggert , Sebastian =?ISO-8859-1?Q?P=F6hn?= , debbugs-submit@debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) On Thursday 07 May 2015 13:11:49 Eric Blake wrote: > On 05/07/2015 11:47 AM, Sebastian P=F6hn wrote: > > Thanks for this fast feedback. Your explanation sounds very reasona= ble. As > > you may have noticed this a makefile out of openwrt with is mainlin= ed > > there. > >=20 > > 1) I downgraded to grep 2.20. Issue is gone with the same environme= nt. So > > this is in my eyes a regression. >=20 > No, it is a bug fix, and documented in the NEWS: >=20 > If a file contains data improperly encoded for the current locale, > and this is discovered before any of the file's contents are output= , > grep now treats the file as binary. Which bug does it fix? The upstream commit in question (cd36abd4) does not refer to any bug re= port. Also the fact that the commit had to change existing regression tests t= o=20 prevent them from failing suggests that it can be seen as a regression.= > > 2) I will also open a report at fedora, maybe the use some strange = setting > > in building the new packet. >=20 > But as the change is intentional, there is probably nothing that Fedo= ra > would do about it. I already created a bug for Fedora: https://bugzilla.redhat.com/1219141 Kamil > > 3) I will send a short notice to openwrt asking if they think it is= fine > > to > > use =EB or =F6. I personally have a strong opinion on that ;) >=20 > It would be fine if they would recode their file to use UTF-8, as tha= t > is pretty much a standard encoding these days. Latin-1 files are > getting harder and harder to process, as more people move to multibyt= e > UTF-8 locales. From debbugs-submit-bounces@debbugs.gnu.org Tue May 12 00:27:44 2015 Received: (at 20526) by debbugs.gnu.org; 12 May 2015 04:27:44 +0000 Received: from localhost ([127.0.0.1]:41438 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ys1nE-00069L-5a for submit@debbugs.gnu.org; Tue, 12 May 2015 00:27:44 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:36715) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ys1nC-000694-5u; Tue, 12 May 2015 00:27:43 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 2B15DA6000E; Mon, 11 May 2015 21:27:36 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id NkDCvhx9hjBW; Mon, 11 May 2015 21:27:35 -0700 (PDT) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 685A5A6000D; Mon, 11 May 2015 21:27:35 -0700 (PDT) Message-ID: <55518137.60706@cs.ucla.edu> Date: Mon, 11 May 2015 21:27:35 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Kamil Dudka , Eric Blake Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> <554BB8F5.9020505@redhat.com> <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> In-Reply-To: <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, =?UTF-8?B?U2ViYXN0aWFuIFDDtmhu?= , debbugs-submit@debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Kamil Dudka wrote: > Which bug does it fix? I don't recall a bug report being filed for it, but the old grep behavior had real problems: as I remember at times it dumped core, and at other times it spit out improperly encoded data to the terminal. We've fixed the core dumps I know about, though I think grep still outputs improperly encoded data at times (and this should get fixed too -- see below for a suggestion). At any rate, applications could never assume a particular behavior for improperly encoded files, so the current behavior is clearly not a bug. Users may be able to scrape along by setting LC_ALL=C before running 'grep' -- the problems LC_ALL=C runs into are about the same as the problems with using old 'grep' (except that the new grep doesn't dump core :-). Perhaps we can improve the behavior of grep by changing its heuristic slightly. Currently grep reports "Binary file FOO matches" if it finds binary data in FOO before it finds the first match. Instead, perhaps we could change grep to report "Binary file FOO matches" when it sees that it's about to generate binary *output* copied from FOO, regardless of whether this output represents the first match. That is, when grep sees that it's about to output binary data, grep instead outputs "Binary file FOO matches" and then stops output for FOO (even if it already output some lines for ordinary matches in FOO). This approach would fix the problem of grep trashing the output stream, and it should be less drastic than grep's current approach, in that it would make grep more likely to do what Kamil Dudka is asking for (assuming grep is given mostly valid input interspersed with small amounts of binary data). From debbugs-submit-bounces@debbugs.gnu.org Tue May 12 04:42:05 2015 Received: (at 20526) by debbugs.gnu.org; 12 May 2015 08:42:06 +0000 Received: from localhost ([127.0.0.1]:41612 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ys5lN-0005iN-BW for submit@debbugs.gnu.org; Tue, 12 May 2015 04:42:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45215) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ys5lL-0005hg-1o; Tue, 12 May 2015 04:42:04 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t4C8fu5a018246 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 12 May 2015 04:41:56 -0400 Received: from kdudka.brq.redhat.com (kdudka.brq.redhat.com [10.34.4.67]) by int-mx11.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t4C8fsl2032643 (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256 verify=NO); Tue, 12 May 2015 04:41:55 -0400 From: Kamil Dudka To: Paul Eggert , Eric Blake Subject: Re: bug#20526: BUG: text file is detected as binary Date: Tue, 12 May 2015 10:41:53 +0200 Message-ID: <2410990.OH3v9jzhSG@kdudka.brq.redhat.com> User-Agent: KMail/4.14.7 (Linux/4.0.1-300.fc22.x86_64; KDE/4.14.7; x86_64; ; ) In-Reply-To: <55518137.60706@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> <55518137.60706@cs.ucla.edu> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.24 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, Sebastian =?ISO-8859-1?Q?P=F6hn?= , debbugs-submit@debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) On Monday 11 May 2015 21:27:35 Paul Eggert wrote: > Perhaps we can improve the behavior of grep by changing its heuristic > slightly. Currently grep reports "Binary file FOO matches" if it finds > binary data in FOO before it finds the first match. Instead, perhaps we > could change grep to report "Binary file FOO matches" when it sees that > it's about to generate binary *output* copied from FOO, regardless of > whether this output represents the first match. That is, when grep sees > that it's about to output binary data, grep instead outputs "Binary file > FOO matches" and then stops output for FOO (even if it already output some > lines for ordinary matches in FOO). > > This approach would fix the problem of grep trashing the output stream, and > it should be less drastic than grep's current approach, in that it would > make grep more likely to do what Kamil Dudka is asking for (assuming grep > is given mostly valid input interspersed with small amounts of binary > data). Thanks for the suggestion! I believe that such approach would work for me. Do you want me to write a patch implementing it? Eric, what do you think about the change proposed above? Kamil From debbugs-submit-bounces@debbugs.gnu.org Tue May 12 08:06:26 2015 Received: (at 20526) by debbugs.gnu.org; 12 May 2015 12:06:26 +0000 Received: from localhost ([127.0.0.1]:41731 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ys8x8-0003rr-0J for submit@debbugs.gnu.org; Tue, 12 May 2015 08:06:26 -0400 Received: from mx1.redhat.com ([209.132.183.28]:55965) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Ys8x5-0003rf-Kj; Tue, 12 May 2015 08:06:24 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id t4CC6EFa011140 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 12 May 2015 08:06:15 -0400 Received: from [10.3.113.96] (ovpn-113-96.phx2.redhat.com [10.3.113.96]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id t4CC6DMs011834; Tue, 12 May 2015 08:06:13 -0400 Message-ID: <5551ECB5.8050007@redhat.com> Date: Tue, 12 May 2015 06:06:13 -0600 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Kamil Dudka , Paul Eggert Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> <55518137.60706@cs.ucla.edu> <2410990.OH3v9jzhSG@kdudka.brq.redhat.com> In-Reply-To: <2410990.OH3v9jzhSG@kdudka.brq.redhat.com> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="vfbN8UHajfQHHDkNSgHVNIpe8nFjJFqOR" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, =?UTF-8?B?U2ViYXN0aWFuIFDDtmhu?= , debbugs-submit@debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --vfbN8UHajfQHHDkNSgHVNIpe8nFjJFqOR Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 05/12/2015 02:41 AM, Kamil Dudka wrote: > On Monday 11 May 2015 21:27:35 Paul Eggert wrote: >> Perhaps we can improve the behavior of grep by changing its heuristic >> slightly. Currently grep reports "Binary file FOO matches" if it finds= >> binary data in FOO before it finds the first match. Instead, perhaps = we >> could change grep to report "Binary file FOO matches" when it sees tha= t >> it's about to generate binary *output* copied from FOO, regardless of >> whether this output represents the first match. That is, when grep se= es >> that it's about to output binary data, grep instead outputs "Binary fi= le >> FOO matches" and then stops output for FOO (even if it already output = some >> lines for ordinary matches in FOO). >> >> This approach would fix the problem of grep trashing the output stream= , and >> it should be less drastic than grep's current approach, in that it wou= ld >> make grep more likely to do what Kamil Dudka is asking for (assuming g= rep >> is given mostly valid input interspersed with small amounts of binary >> data). >=20 > Thanks for the suggestion! I believe that such approach would work for= me. =20 > Do you want me to write a patch implementing it? >=20 > Eric, what do you think about the change proposed above? I'm still a bit worried that encoding errors encountered on input, even though they don't match for output, may still cause issues for some patterns (we've had cases of encoding errors causing 'grep -P' to go into an infinite loop, for example); but yes, as the behavior is undefined, we are still justified in adopting those heuristics, if someone is willing to contribute a patch along those lines. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --vfbN8UHajfQHHDkNSgHVNIpe8nFjJFqOR Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJVUey1AAoJEKeha0olJ0Nqs3oIAI7n9Jvyi2pO+XceyzyUw9vl p66eQTI/bqf+QhLmVb++85aqfEaxUFNpNl1oXgKrxI/lK7zDplVPlMZvWYkeEAbZ bO4C4mdYk236vLiVg8CgD3DdqbOh5IycyMQEb0nCygF369H+naqW83fkEdMW0ZXs 2bbSSmwnge6JfMbkZuDtuNSRLUQ550aPwE4+5RXykdTX0Qoscq28aDVSNUuFCaFO ZXDJLPxlxP/HneffWlnDlvOeGG5+IAfOQQAnNPyEmrq+24ntJpV+zlGBaOjs5vsE I/E05/t9Slcz8KIBWYV4ysllD/cHHkedoQITvQ4TSBKjJE1uxkfsgwG4P7eZQ1U= =f+ss -----END PGP SIGNATURE----- --vfbN8UHajfQHHDkNSgHVNIpe8nFjJFqOR-- From debbugs-submit-bounces@debbugs.gnu.org Tue May 12 20:08:57 2015 Received: (at 20526) by debbugs.gnu.org; 13 May 2015 00:08:58 +0000 Received: from localhost ([127.0.0.1]:42780 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YsKEL-0004GB-BB for submit@debbugs.gnu.org; Tue, 12 May 2015 20:08:57 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:54139) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YsKEI-0004Fs-Dj; Tue, 12 May 2015 20:08:55 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id DBD6EA60026; Tue, 12 May 2015 17:08:47 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zrizPc1Omn-s; Tue, 12 May 2015 17:08:47 -0700 (PDT) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 37B72A60023; Tue, 12 May 2015 17:08:47 -0700 (PDT) Message-ID: <5552960A.7090503@cs.ucla.edu> Date: Tue, 12 May 2015 17:08:42 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Eric Blake , Kamil Dudka Subject: Re: bug#20526: BUG: text file is detected as binary References: <1430996888.2678.8.camel@googlemail.com> <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> <55518137.60706@cs.ucla.edu> <2410990.OH3v9jzhSG@kdudka.brq.redhat.com> <5551ECB5.8050007@redhat.com> In-Reply-To: <5551ECB5.8050007@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 20526 Cc: 20526@debbugs.gnu.org, =?UTF-8?B?U2ViYXN0aWFuIFDDtmhu?= , debbugs-submit@debbugs.gnu.org, =?UTF-8?Q?P=C3=B6hn@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Eric Blake wrote: > I'm still a bit worried that encoding errors encountered on input, even > though they don't match for output, may still cause issues for some > patterns (we've had cases of encoding errors causing 'grep -P' to go > into an infinite loop, for example); Yes, that's right. We can't go back to the old way of doing things. Encoding errors in the data must not be matched by any regular expression (not even "."). 'grep -P' won't loop if we never pass encoding errors to the PCRE matcher, so that's what we gotta do. > but yes, as the behavior is > undefined, we are still justified in adopting those heuristics, if > someone is willing to contribute a patch along those lines. The hard part about it (and the reason I haven't written up a patch yet) is making sure the above property holds, while continuing to have good performance in the typical case where the input is validly encoded. I suppose it's OK, though, if the change hurts performance only for the -P case, since -P is so slow anyway. From debbugs-submit-bounces@debbugs.gnu.org Wed May 20 20:49:44 2015 Received: (at submit) by debbugs.gnu.org; 21 May 2015 00:49:44 +0000 Received: from localhost ([127.0.0.1]:51605 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YvEgB-0003CM-CK for submit@debbugs.gnu.org; Wed, 20 May 2015 20:49:43 -0400 Received: from eggs.gnu.org ([208.118.235.92]:57052) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YvEL8-0002f3-5S for submit@debbugs.gnu.org; Wed, 20 May 2015 20:27:58 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YvEL2-0000SL-1S for submit@debbugs.gnu.org; Wed, 20 May 2015 20:27:52 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:45901) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YvEL1-0000SH-Un for submit@debbugs.gnu.org; Wed, 20 May 2015 20:27:51 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:42317) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YvEL0-0003WD-UG for bug-grep@gnu.org; Wed, 20 May 2015 20:27:51 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1YvEKx-0000RI-Ng for bug-grep@gnu.org; Wed, 20 May 2015 20:27:50 -0400 Received: from mailer.hiddenmail.net ([199.195.249.9]:46146) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1YvEKx-0000RA-Kj for bug-grep@gnu.org; Wed, 20 May 2015 20:27:47 -0400 Received: from mailer by mailer.hiddenmail.net with local (Exim 4.80) (envelope-from ) id 1YvEKv-0005fv-UH for bug-grep@gnu.org; Thu, 21 May 2015 02:27:46 +0200 Message-ID: <1432168063.1854.21.camel@16bits.net> Subject: Re: bug#20526: BUG: text file is detected as binary From: =?ISO-8859-1?Q?=C1ngel_Gonz=E1lez?= To: bug-grep@gnu.org Date: Thu, 21 May 2015 02:27:43 +0200 In-Reply-To: <55518137.60706@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <554BB8F5.9020505@redhat.com> <3109063.HrMoyCBUhY@kdudka.brq.redhat.com> <55518137.60706@cs.ucla.edu> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Mailer: Evolution 3.16.2.1 Mime-Version: 1.0 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.1 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 20 May 2015 20:49:41 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.1 (----) Paul Eggert wrote: > Perhaps we can improve the behavior of grep by changing its heuristic=20 > slightly.=20 > Currently grep reports "Binary file FOO matches" if it finds binary=20 > data in FOO before it finds the first match. Instead, perhaps we=20 > could change grep to report "Binary file FOO matches" when it sees=20 > that it's about to generate binary *output* copied from FOO,=20 > regardless of whether this output represents the first match. That=20 > is, when grep sees that it's about to output binary=20 > data, grep instead outputs "Binary file FOO matches" and then stops=20 > output for FOO (even if it already output some lines for ordinary=20 > matches in FOO). Another option would be to escape the problematic binary data (but how to escape the escape char?) or maybe even replace it with U+FFFD if our output is utf-8 (this has its own sort of problems when trying to determine what was really matched, though). > This approach would fix the problem of grep trashing the output=20 > stream, and it should be less drastic than grep's current approach,=20 > in that it would make grep more likely to do what Kamil Dudka is=20 > asking for (assuming grep is given mostly valid input interspersed=20 > with small amounts of binary data). +1 When grep is the las component of a pipeline, it isn't too bad. The danger comes from grep being part of a pipeline instead.=20 Sebastian Makefile is one of such cases. Another silly example: we might have a list of people and be interested in knowning how many of them begin with J (but excluding pseudonyms): printf 'John Smith\nJohannes Meixner\nPaul Eggert\nJohn Doe\n' > defendant= s-2015-05-15 grep ^J defendants-2015-05-* | sort -u | grep -vc "John Doe" works perfectly, until the day someone provides an incorrectly entry.=20 printf 'Pedro P\xe9rez\n' >> defendants-2015-05-15 and havoc ensues. It's something that should never happen, but someone else prepared the file for you, or it comes from a third party (and sometimes it only makes sense for them to be ANSI, yet one day there are unencoded high bytes) From debbugs-submit-bounces@debbugs.gnu.org Sat May 30 16:04:40 2015 Received: (at control) by debbugs.gnu.org; 30 May 2015 20:04:41 +0000 Received: from localhost ([127.0.0.1]:33783 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yymzo-0007X7-Fr for submit@debbugs.gnu.org; Sat, 30 May 2015 16:04:40 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:53645) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Yymzm-0007Wi-MZ for control@debbugs.gnu.org; Sat, 30 May 2015 16:04:39 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 3C93E39E801B for ; Sat, 30 May 2015 13:04:33 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 9SwSbBss7Jia for ; Sat, 30 May 2015 13:04:32 -0700 (PDT) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 4CC9C39E8016 for ; Sat, 30 May 2015 13:04:32 -0700 (PDT) Message-ID: <556A17D0.4000303@cs.ucla.edu> Date: Sat, 30 May 2015 13:04:32 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: grep bug maintainance Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) tag 20605 notabug close 20605 severity 20657 wishlist tag 20638 notabug close 20638 merge 20526 19985 19230 tag 19837 notabug close 19837 merge 16444 19777 close 19563 close 19486 tag 19330 notabug close 19330 tag 19193 notabug close 19193 tag 19071 notabug close 19071 tag 19005 notabug close 19005 close 19000 tag 18888 notabug close 18888 From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 25 14:04:27 2015 Received: (at control) by debbugs.gnu.org; 25 Sep 2015 18:04:27 +0000 Received: from localhost ([127.0.0.1]:43731 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1ZfXMB-0002Vk-HL for submit@debbugs.gnu.org; Fri, 25 Sep 2015 14:04:27 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60076) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1ZfXM9-0002VX-Go for control@debbugs.gnu.org; Fri, 25 Sep 2015 14:04:25 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E6EEF161131 for ; Fri, 25 Sep 2015 11:04:19 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 4Eb4D0PhnFAA for ; Fri, 25 Sep 2015 11:04:19 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 573ED1611B3 for ; Fri, 25 Sep 2015 11:04:19 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ETL_bYe9AWli for ; Fri, 25 Sep 2015 11:04:19 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 4188C161131 for ; Fri, 25 Sep 2015 11:04:19 -0700 (PDT) To: control@debbugs.gnu.org From: Paul Eggert Subject: merge 21558 into 20526 Organization: UCLA Computer Science Department Message-ID: <56058CA3.2010804@cs.ucla.edu> Date: Fri, 25 Sep 2015 11:04:19 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) merge 20526 21558 thanks From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 30 22:25:18 2015 Received: (at 20526-done) by debbugs.gnu.org; 31 Dec 2015 03:25:18 +0000 Received: from localhost ([127.0.0.1]:50806 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aETrZ-00052r-Dh for submit@debbugs.gnu.org; Wed, 30 Dec 2015 22:25:18 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35453) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aETrW-00052W-6b for 20526-done@debbugs.gnu.org; Wed, 30 Dec 2015 22:25:15 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 6DD67160ED6; Wed, 30 Dec 2015 19:25:07 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id EC6ZTSMTQvEi; Wed, 30 Dec 2015 19:25:05 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 35149160ED9; Wed, 30 Dec 2015 19:25:05 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id jlokzvghP-Z1; Wed, 30 Dec 2015 19:25:05 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E9853160ED6; Wed, 30 Dec 2015 19:25:04 -0800 (PST) To: 20526-done@debbugs.gnu.org From: Paul Eggert Organization: UCLA Computer Science Department Subject: Re: grep BUG: text file is detected as binary Message-ID: <5684A010.4000302@cs.ucla.edu> Date: Wed, 30 Dec 2015 19:25:04 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------020606020907080809070705" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526-done Cc: Kamil Dudka , Benno Schulenberg , Mike Frysinger , Johannes Meixner , Hans Pelleboer , Sebastian Poehn , =?UTF-8?Q?=c3=81ngel_Gonz=c3=a1lez?= , Eric Blake X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------020606020907080809070705 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit I installed into Savannah a patch (attached) that should fix this problem in typical cases, and am boldly marking the bug as done. Please give the fix a try if you have the time. Thanks. --------------020606020907080809070705 Content-Type: text/x-diff; name="0001-grep-be-less-picky-about-encoding-errors.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-be-less-picky-about-encoding-errors.patch" >From ba23b4ee721750399ede8933cf472e0c6aa6e37f Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 30 Dec 2015 19:10:14 -0800 Subject: [PATCH] grep: be less picky about encoding errors This fixes a longstanding problem introduced in grep 2.21, which is overly picky about binary files. * NEWS: * doc/grep.texi (File and Directory Selection): Document this. * src/grep.c (input_textbin, textbin_is_binary, buffer_textbin) (file_textbin): Remove. All uses removed. (encoding_error_output): New static var. (buf_has_encoding_errors, buf_has_nulls, file_must_have_nulls): New functions, which reuse bits and pieces of the removed functions. (lastout, print_line_head, print_line_middle, print_line_tail, prline) (prpending, prtext, grepbuf): Avoid use of const, now that we have functions that require modifying a sentinel. (print_line_head): New arg LEN. All uses changed. (print_line_head, print_line_tail): Return indicator whether the output line was printed. All uses changed. (print_line_middle): Exit early on encoding error. (grep): Use new method for determining whether file is binary. * src/grep.h (enum textbin, TEXTBIN_BINARY, TEXTBIN_UNKNOWN) (TEXTBIN_TEXT, input_textbin): Remove decls. All uses removed. * src/pcresearch.c (Pexecute): Remove multiline optimization, since the main program no longer checks for encoding errors on input. * tests/encoding-error: New file. * tests/Makefile.am (TESTS): Add it. --- NEWS | 7 ++ doc/grep.texi | 11 +-- src/grep.c | 221 ++++++++++++++++++++++++++------------------------- src/grep.h | 18 ----- src/pcresearch.c | 56 ++----------- tests/Makefile.am | 1 + tests/encoding-error | 41 ++++++++++ 7 files changed, 174 insertions(+), 181 deletions(-) create mode 100755 tests/encoding-error diff --git a/NEWS b/NEWS index 4e54b49..a14597f 100644 --- a/NEWS +++ b/NEWS @@ -4,6 +4,13 @@ GNU grep NEWS -*- outline -*- ** Bug fixes + Binary files are now less likely to generate diagnostics. grep now + reports "Binary file FOO matches" and suppresses further output when + grep is about to output a match that contains an encoding error. + Formerly, grep reported FOO to be binary merely because grep found + an encoding error in FOO before generating output for FOO. + [bug introduced in grep-2.21] + grep -oP is no longer susceptible to an infinite loop when processing invalid UTF8 just before a match. [bug introduced in grep-2.22] diff --git a/doc/grep.texi b/doc/grep.texi index 76c7f46..58e7f48 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -596,13 +596,13 @@ If a file's allocation metadata, or if its data read before a line is selected for output, indicate that the file contains binary data, assume that the file is of type @var{type}. -Non-text bytes indicate binary data; these are either data bytes -improperly encoded for the current locale, or null bytes when the +Non-text bytes indicate binary data; these are either output bytes that are +improperly encoded for the current locale, or null input bytes when the @option{-z} (@option{--null-data}) option is not given (@pxref{Other Options}). -By default, @var{type} is @samp{binary}, -and @command{grep} normally outputs either +By default, @var{type} is @samp{binary}, and when @command{grep} +discovers that a file is binary it normally outputs either a one-line message saying that a binary file matches, or no message if there is no match. When processing binary data, @command{grep} may treat non-text bytes @@ -611,7 +611,8 @@ not match a null byte, as the null byte might be treated as a line terminator even without the @option{-z} (@option{--null-data}) option. If @var{type} is @samp{without-match}, -@command{grep} assumes that a binary file does not match; +when @command{grep} discovers that a file is binary +it assumes that the rest of the file does not match; this is equivalent to the @option{-I} option. If @var{type} is @samp{text}, diff --git a/src/grep.c b/src/grep.c index 19ba208..e059a46 100644 --- a/src/grep.c +++ b/src/grep.c @@ -377,7 +377,6 @@ bool match_icase; bool match_words; bool match_lines; char eolbyte; -enum textbin input_textbin; static char const *matcher; @@ -389,6 +388,10 @@ static bool omit_dot_slash; static bool errseen; static bool write_error_seen; +/* True if output from the current input file has been suppressed + because an output line had an encoding error. */ +static bool encoding_error_output; + enum directories_type { READ_DIRECTORIES = 2, @@ -481,12 +484,6 @@ clean_up_stdout (void) close_stdout (); } -static bool -textbin_is_binary (enum textbin textbin) -{ - return textbin < TEXTBIN_UNKNOWN; -} - /* The high-order bit of a byte. */ enum { HIBYTE = 0x80 }; @@ -551,58 +548,60 @@ skip_easy_bytes (char const *buf) return p; } -/* Return the text type of data in BUF, of size SIZE. +/* Return true if BUF, of size SIZE, has an encoding error. BUF must be followed by at least sizeof (uword) bytes, - which may be arbitrarily written to or read from. */ -static enum textbin -buffer_textbin (char *buf, size_t size) + the first of which may be modified. */ +static bool +buf_has_encoding_errors (char *buf, size_t size) { - if (eolbyte && memchr (buf, '\0', size)) - return TEXTBIN_BINARY; + if (MB_CUR_MAX <= 1) + return false; - if (1 < MB_CUR_MAX) - { - mbstate_t mbs = { 0 }; - size_t clen; - char const *p; + mbstate_t mbs = { 0 }; + size_t clen; - buf[size] = -1; - for (p = buf; (p = skip_easy_bytes (p)) < buf + size; p += clen) - { - clen = mbrlen (p, buf + size - p, &mbs); - if ((size_t) -2 <= clen) - return clen == (size_t) -2 ? TEXTBIN_UNKNOWN : TEXTBIN_BINARY; - } + buf[size] = -1; + for (char const *p = buf; (p = skip_easy_bytes (p)) < buf + size; p += clen) + { + clen = mbrlen (p, buf + size - p, &mbs); + if ((size_t) -2 <= clen) + return true; } - return TEXTBIN_TEXT; + return false; } -/* Return the text type of a file. BUF, of size SIZE, is the initial - buffer read from the file with descriptor FD and status ST. - BUF must be followed by at least sizeof (uword) bytes, + +/* Return true if BUF, of size SIZE, has a null byte. + BUF must be followed by at least one byte, which may be arbitrarily written to or read from. */ -static enum textbin -file_textbin (char *buf, size_t size, int fd, struct stat const *st) +static bool +buf_has_nulls (char *buf, size_t size) { - enum textbin textbin = buffer_textbin (buf, size); - if (textbin_is_binary (textbin)) - return textbin; + buf[size] = 0; + return strlen (buf) != size; +} +/* Return true if a file is known to contain null bytes. + SIZE bytes have already been read from the file + with descriptor FD and status ST. */ +static bool +file_must_have_nulls (size_t size, int fd, struct stat const *st) +{ if (usable_st_size (st)) { if (st->st_size <= size) - return textbin == TEXTBIN_UNKNOWN ? TEXTBIN_BINARY : textbin; + return false; /* If the file has holes, it must contain a null byte somewhere. */ - if (SEEK_HOLE != SEEK_SET && eolbyte) + if (SEEK_HOLE != SEEK_SET) { off_t cur = size; if (O_BINARY || fd == STDIN_FILENO) { cur = lseek (fd, 0, SEEK_CUR); if (cur < 0) - return TEXTBIN_UNKNOWN; + return false; } /* Look for a hole after the current location. */ @@ -612,12 +611,12 @@ file_textbin (char *buf, size_t size, int fd, struct stat const *st) if (lseek (fd, cur, SEEK_SET) < 0) suppressible_error (filename, errno); if (hole_start < st->st_size) - return TEXTBIN_BINARY; + return true; } } } - return TEXTBIN_UNKNOWN; + return false; } /* Convert STR to a nonnegative integer, storing the result in *OUT. @@ -899,7 +898,7 @@ static char *label = NULL; /* Fake filename for stdin */ /* Internal variables to keep track of byte count, context, etc. */ static uintmax_t totalcc; /* Total character count before bufbeg. */ static char const *lastnl; /* Pointer after last newline counted. */ -static char const *lastout; /* Pointer after last character output; +static char *lastout; /* Pointer after last character output; NULL if no character has been output or if it's conceptually before bufbeg. */ static intmax_t outleft; /* Maximum number of lines to be output. */ @@ -971,10 +970,31 @@ print_offset (uintmax_t pos, int min_width, const char *color) pr_sgr_end_if (color); } -/* Print a whole line head (filename, line, byte). */ -static void -print_line_head (char const *beg, char const *lim, char sep) +/* Print a whole line head (filename, line, byte). The output data + starts at BEG and contains LEN bytes; it is followed by at least + sizeof (uword) bytes, the first of which may be temporarily modified. + The output data comes from what is perhaps a larger input line that + goes until LIM, where LIM[-1] is an end-of-line byte. Use SEP as + the separator on output. + + Return true unless the line was suppressed due to an encoding error. */ + +static bool +print_line_head (char *beg, size_t len, char const *lim, char sep) { + bool encoding_errors = false; + if (binary_files != TEXT_BINARY_FILES) + { + char ch = beg[len]; + encoding_errors = buf_has_encoding_errors (beg, len); + beg[len] = ch; + } + if (encoding_errors) + { + encoding_error_output = done_on_match = out_quiet = true; + return false; + } + bool pending_sep = false; if (out_file) @@ -1021,22 +1041,27 @@ print_line_head (char const *beg, char const *lim, char sep) print_sep (sep); } + + return true; } -static const char * -print_line_middle (const char *beg, const char *lim, +static char * +print_line_middle (char *beg, char *lim, const char *line_color, const char *match_color) { size_t match_size; size_t match_offset; - const char *cur = beg; - const char *mid = NULL; - - while (cur < lim - && ((match_offset = execute (beg, lim - beg, &match_size, cur)) - != (size_t) -1)) + char *cur = beg; + char *mid = NULL; + char *b; + + for (cur = beg; + (cur < lim + && ((match_offset = execute (beg, lim - beg, &match_size, cur)) + != (size_t) -1)); + cur = b + match_size) { - char const *b = beg + match_offset; + b = beg + match_offset; /* Avoid matching the empty line at the end of the buffer. */ if (b == lim) @@ -1056,8 +1081,11 @@ print_line_middle (const char *beg, const char *lim, /* This function is called on a matching line only, but is it selected or rejected/context? */ if (only_matching) - print_line_head (b, lim, (out_invert ? SEP_CHAR_REJECTED - : SEP_CHAR_SELECTED)); + { + char sep = out_invert ? SEP_CHAR_REJECTED : SEP_CHAR_SELECTED; + if (! print_line_head (b, match_size, lim, sep)) + return NULL; + } else { pr_sgr_start (line_color); @@ -1075,7 +1103,6 @@ print_line_middle (const char *beg, const char *lim, if (only_matching) fputs ("\n", stdout); } - cur = b + match_size; } if (only_matching) @@ -1086,8 +1113,8 @@ print_line_middle (const char *beg, const char *lim, return cur; } -static const char * -print_line_tail (const char *beg, const char *lim, const char *line_color) +static char * +print_line_tail (char *beg, const char *lim, const char *line_color) { size_t eol_size; size_t tail_size; @@ -1108,14 +1135,15 @@ print_line_tail (const char *beg, const char *lim, const char *line_color) } static void -prline (char const *beg, char const *lim, char sep) +prline (char *beg, char *lim, char sep) { bool matching; const char *line_color; const char *match_color; if (!only_matching) - print_line_head (beg, lim, sep); + if (! print_line_head (beg, lim - beg - 1, lim, sep)) + return; matching = (sep == SEP_CHAR_SELECTED) ^ out_invert; @@ -1135,7 +1163,11 @@ prline (char const *beg, char const *lim, char sep) { /* We already know that non-matching lines have no match (to colorize). */ if (matching && (only_matching || *match_color)) - beg = print_line_middle (beg, lim, line_color, match_color); + { + beg = print_line_middle (beg, lim, line_color, match_color); + if (! beg) + return; + } if (!only_matching && *line_color) { @@ -1169,7 +1201,7 @@ prpending (char const *lim) lastout = bufbeg; while (pending > 0 && lastout < lim) { - char const *nl = memchr (lastout, eolbyte, lim - lastout); + char *nl = memchr (lastout, eolbyte, lim - lastout); size_t match_size; --pending; if (outleft @@ -1184,7 +1216,7 @@ prpending (char const *lim) /* Output the lines between BEG and LIM. Deal with context. */ static void -prtext (char const *beg, char const *lim) +prtext (char *beg, char *lim) { static bool used; /* Avoid printing SEP_STR_GROUP before any output. */ char eol = eolbyte; @@ -1192,7 +1224,7 @@ prtext (char const *beg, char const *lim) if (!out_quiet && pending > 0) prpending (beg); - char const *p = beg; + char *p = beg; if (!out_quiet) { @@ -1218,7 +1250,7 @@ prtext (char const *beg, char const *lim) while (p < beg) { - char const *nl = memchr (p, eol, beg - p); + char *nl = memchr (p, eol, beg - p); nl++; prline (p, nl, SEP_CHAR_REJECTED); p = nl; @@ -1231,7 +1263,7 @@ prtext (char const *beg, char const *lim) /* One or more lines are output. */ for (n = 0; p < lim && n < outleft; n++) { - char const *nl = memchr (p, eol, lim - p); + char *nl = memchr (p, eol, lim - p); nl++; if (!out_quiet) prline (p, nl, SEP_CHAR_SELECTED); @@ -1278,13 +1310,12 @@ zap_nuls (char *p, char *lim, char eol) between matching lines if OUT_INVERT is true). Return a count of lines printed. Replace all NUL bytes with NUL_ZAPPER as we go. */ static intmax_t -grepbuf (char const *beg, char const *lim) +grepbuf (char *beg, char const *lim) { intmax_t outleft0 = outleft; - char const *p; - char const *endp; + char *endp; - for (p = beg; p < lim; p = endp) + for (char *p = beg; p < lim; p = endp) { size_t match_size; size_t match_offset = execute (p, lim - p, &match_size, NULL); @@ -1295,15 +1326,15 @@ grepbuf (char const *beg, char const *lim) match_offset = lim - p; match_size = 0; } - char const *b = p + match_offset; + char *b = p + match_offset; endp = b + match_size; /* Avoid matching the empty line at the end of the buffer. */ if (!out_invert && b == lim) break; if (!out_invert || p < b) { - char const *prbeg = out_invert ? p : b; - char const *prend = out_invert ? b : endp; + char *prbeg = out_invert ? p : b; + char *prend = out_invert ? b : endp; prtext (prbeg, prend); if (!outleft || done_on_match) { @@ -1324,7 +1355,6 @@ static intmax_t grep (int fd, struct stat const *st) { intmax_t nlines, i; - enum textbin textbin; size_t residue, save; char oldc; char *beg; @@ -1333,6 +1363,7 @@ grep (int fd, struct stat const *st) char nul_zapper = '\0'; bool done_on_match_0 = done_on_match; bool out_quiet_0 = out_quiet; + bool has_nulls = false; if (! reset (fd, st)) return 0; @@ -1344,6 +1375,7 @@ grep (int fd, struct stat const *st) after_last_match = 0; pending = 0; skip_nuls = skip_empty_lines && !eol; + encoding_error_output = false; seek_data_failed = false; nlines = 0; @@ -1356,26 +1388,20 @@ grep (int fd, struct stat const *st) return 0; } - if (binary_files == TEXT_BINARY_FILES) - textbin = TEXTBIN_TEXT; - else + for (bool firsttime = true; ; firsttime = false) { - textbin = file_textbin (bufbeg, buflim - bufbeg, fd, st); - if (textbin_is_binary (textbin)) + if (!has_nulls && eol && binary_files != TEXT_BINARY_FILES + && (buf_has_nulls (bufbeg, buflim - bufbeg) + || (firsttime && file_must_have_nulls (buflim - bufbeg, fd, st)))) { + has_nulls = true; if (binary_files == WITHOUT_MATCH_BINARY_FILES) return 0; done_on_match = out_quiet = true; nul_zapper = eol; skip_nuls = skip_empty_lines; } - else if (execute != Pexecute) - textbin = TEXTBIN_TEXT; - } - for (;;) - { - input_textbin = textbin; lastnl = bufbeg; if (lastout) lastout = bufbeg; @@ -1426,13 +1452,8 @@ grep (int fd, struct stat const *st) } /* Detect whether leading context is adjacent to previous output. */ - if (lastout) - { - if (textbin == TEXTBIN_UNKNOWN) - textbin = TEXTBIN_TEXT; - if (beg != lastout) - lastout = 0; - } + if (beg != lastout) + lastout = 0; /* Handle some details and read more data to scan. */ save = residue + lim - beg; @@ -1445,22 +1466,6 @@ grep (int fd, struct stat const *st) suppressible_error (filename, errno); goto finish_grep; } - - /* If the file's textbin has not been determined yet, assume - it's binary if the next input buffer suggests so. */ - if (textbin == TEXTBIN_UNKNOWN) - { - enum textbin tb = buffer_textbin (bufbeg, buflim - bufbeg); - if (textbin_is_binary (tb)) - { - if (binary_files == WITHOUT_MATCH_BINARY_FILES) - return 0; - textbin = tb; - done_on_match = out_quiet = true; - nul_zapper = eol; - skip_nuls = skip_empty_lines; - } - } } if (residue) { @@ -1474,7 +1479,7 @@ grep (int fd, struct stat const *st) finish_grep: done_on_match = done_on_match_0; out_quiet = out_quiet_0; - if (textbin_is_binary (textbin) && !out_quiet && nlines != 0) + if ((has_nulls || encoding_error_output) && !out_quiet && nlines != 0) printf (_("Binary file %s matches\n"), filename); return nlines; } diff --git a/src/grep.h b/src/grep.h index 580eb11..2e4527c 100644 --- a/src/grep.h +++ b/src/grep.h @@ -29,22 +29,4 @@ extern bool match_words; /* -w */ extern bool match_lines; /* -x */ extern char eolbyte; /* -z */ -/* An enum textbin describes the file's type, inferred from data read - before the first line is selected for output. */ -enum textbin - { - /* Binary, as it contains null bytes and the -z option is not in effect, - or it contains encoding errors. */ - TEXTBIN_BINARY = -1, - - /* Not known yet. Only text has been seen so far. */ - TEXTBIN_UNKNOWN = 0, - - /* Text. */ - TEXTBIN_TEXT = 1 - }; - -/* Input file type. */ -extern enum textbin input_textbin; - #endif diff --git a/src/pcresearch.c b/src/pcresearch.c index dc68345..c403032 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -194,32 +194,13 @@ Pexecute (char const *buf, size_t size, size_t *match_size, error. */ char const *subject = buf; - /* If the input type is unknown, the caller is still testing the - input, which means the current buffer cannot contain encoding - errors and a multiline search is typically more efficient. - Otherwise, a single-line search is typically faster, so that - pcre_exec doesn't waste time validating the entire input - buffer. */ - bool multiline = input_textbin == TEXTBIN_UNKNOWN; - for (; p < buf + size; p = line_start = line_end + 1) { - bool too_big; - - if (multiline) - { - size_t pcre_size_max = MIN (INT_MAX, SIZE_MAX - 1); - size_t scan_size = MIN (pcre_size_max + 1, buf + size - p); - line_end = memrchr (p, eolbyte, scan_size); - too_big = ! line_end; - } - else - { - line_end = memchr (p, eolbyte, buf + size - p); - too_big = INT_MAX < line_end - p; - } - - if (too_big) + /* A single-line search is typically faster, so that + pcre_exec doesn't waste time validating the entire input + buffer. */ + line_end = memchr (p, eolbyte, buf + size - p); + if (INT_MAX < line_end - p) error (EXIT_TROUBLE, 0, _("exceeded PCRE's line length limit")); for (;;) @@ -247,27 +228,11 @@ Pexecute (char const *buf, size_t size, size_t *match_size, int options = 0; if (!bol) options |= PCRE_NOTBOL; - if (multiline) - options |= PCRE_NO_UTF8_CHECK; e = jit_exec (subject, line_end - subject, search_offset, options, sub); if (e != PCRE_ERROR_BADUTF8) - { - if (0 < e && multiline && sub[1] - sub[0] != 0) - { - char const *nl = memchr (subject + sub[0], eolbyte, - sub[1] - sub[0]); - if (nl) - { - /* This match crosses a line boundary; reject it. */ - p = subject + sub[0]; - line_end = nl; - continue; - } - } - break; - } + break; int valid_bytes = sub[0]; /* Try to match the string before the encoding error. */ @@ -339,15 +304,6 @@ Pexecute (char const *buf, size_t size, size_t *match_size, beg = matchbeg; end = matchend; } - else if (multiline) - { - char const *prev_nl = memrchr (line_start - 1, eolbyte, - matchbeg - (line_start - 1)); - char const *next_nl = memchr (matchend, eolbyte, - line_end + 1 - matchend); - beg = prev_nl + 1; - end = next_nl + 1; - } else { beg = line_start; diff --git a/tests/Makefile.am b/tests/Makefile.am index 37bb501..f1b8c43 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -70,6 +70,7 @@ TESTS = \ empty \ empty-line \ empty-line-mb \ + encoding-error \ epipe \ equiv-classes \ ere \ diff --git a/tests/encoding-error b/tests/encoding-error new file mode 100755 index 0000000..fe52de2 --- /dev/null +++ b/tests/encoding-error @@ -0,0 +1,41 @@ +#! /bin/sh +# Test grep's behavior on encoding errors. +# +# Copyright 2015 Free Software Foundation, Inc. +# +# Copying and distribution of this file, with or without modification, +# are permitted in any medium without royalty provided the copyright +# notice and this notice are preserved. + +. "${srcdir=.}/init.sh"; path_prepend_ ../src + +require_en_utf8_locale_ + +LC_ALL=en_US.UTF-8 +export LC_ALL + +printf 'Alfred Jones\n' > a || framework_failure_ +printf 'John Smith\n' >j || framework_failure_ +printf 'Pedro P\xe9rez\n' >p || framework_failure_ +cat a p j >in || framework_failure_ + +fail=0 + +grep '^A' in >out || fail=1 +compare a out || fail=1 + +grep '^P' in >out || fail=1 +printf 'Binary file in matches\n' >exp || framework_failure_ +compare exp out || fail=1 + +grep '^J' in >out || fail=1 +compare j out || fail=1 + +grep '^X' in >out +test $? = 1 || fail=1 +compare /dev/null out || fail=1 + +grep -a . in >out || fail=1 +compare in out + +Exit $fail -- 2.5.0 --------------020606020907080809070705-- From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 30 23:59:58 2015 Received: (at 20526) by debbugs.gnu.org; 31 Dec 2015 04:59:58 +0000 Received: from localhost ([127.0.0.1]:50834 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aEVLC-0007Y5-CU for submit@debbugs.gnu.org; Wed, 30 Dec 2015 23:59:58 -0500 Received: from mail-ig0-f171.google.com ([209.85.213.171]:34873) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aEVL9-0007Xk-W3; Wed, 30 Dec 2015 23:59:56 -0500 Received: by mail-ig0-f171.google.com with SMTP id to4so190498211igc.0; Wed, 30 Dec 2015 20:59:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=WZbvFoKlXU4ez7kfBmeObOP06MkYwPKd9AHpZXZG/9s=; b=h1ZGGKyKKmlFnGq7jObv9VnNOPRzxQvgCMhS05p17BeC6oE/mZwlNgs1N+8UNzhT8o pTropxKggN3svrah2MZgx7JJR2IFLMm+3FzAURB3C0Uyu7ehHrzvfKPlBHcgN0GlwiOV JGEk1S34/4etoWiKn0T06Ec2zG9K80/HQ8D/OEBOE2r34MC1BSpl8+qWfvbBbxR0TKa+ jJIDETtHa69GY6aPz1TPB3Fq4wsfLc+IGLBF/pALCIdG153TnNUAVEEka7KYLPH2scOD bQkrfO2pXSlLGC37t0WoBOEBNXYH0MfzGGVfVM2aevwYMAFK4xYEGCg86UUoAEZBIqPu 47Nw== X-Received: by 10.50.117.33 with SMTP id kb1mr66997092igb.89.1451537990057; Wed, 30 Dec 2015 20:59:50 -0800 (PST) MIME-Version: 1.0 Received: by 10.36.10.18 with HTTP; Wed, 30 Dec 2015 20:59:30 -0800 (PST) In-Reply-To: <5684A010.4000302@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> From: Jim Meyering Date: Wed, 30 Dec 2015 20:59:30 -0800 X-Google-Sender-Auth: P9CHlDxnOfY_Wo_EATTzIuaRSbU Message-ID: Subject: Re: bug#20526: grep BUG: text file is detected as binary To: 20526@debbugs.gnu.org, Paul Eggert , sebastian.poehn@gmail.com Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.4 (/) X-Debbugs-Envelope-To: 20526 Cc: Johannes Meixner , Kamil Dudka , Benno Schulenberg , 20526-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.4 (/) On Wed, Dec 30, 2015 at 7:25 PM, Paul Eggert wrote: > I installed into Savannah a patch (attached) that should fix this problem in > typical cases, and am boldly marking the bug as done. Please give the fix a > try if you have the time. Thanks. Thank you! The combination of this and the grep -oP infloop fix make this look like a good time for a bug-fix release. If there are any other pending bug fixes or small+safe changes people would like to see included, please let us know. I would like to publish a pre-release snapshot soon. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 31 04:29:45 2015 Received: (at 20526) by debbugs.gnu.org; 31 Dec 2015 09:29:45 +0000 Received: from localhost ([127.0.0.1]:50952 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aEZYH-0007vN-9o for submit@debbugs.gnu.org; Thu, 31 Dec 2015 04:29:45 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:42494) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aEZYG-0007vA-9O for 20526@debbugs.gnu.org; Thu, 31 Dec 2015 04:29:44 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E6FED160ED0; Thu, 31 Dec 2015 01:29:36 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id PAqVn7peyyys; Thu, 31 Dec 2015 01:29:36 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 2F248160ED6; Thu, 31 Dec 2015 01:29:36 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 9i2XYJGcdfy0; Thu, 31 Dec 2015 01:29:36 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id ED891160ED0; Thu, 31 Dec 2015 01:29:35 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Jim Meyering , 20526@debbugs.gnu.org, sebastian.poehn@gmail.com References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <5684F57F.2090802@cs.ucla.edu> Date: Thu, 31 Dec 2015 01:29:35 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: Johannes Meixner , Kamil Dudka , Benno Schulenberg X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Jim Meyering wrote: > The combination of this and the grep -oP infloop fix make this look > like a good time for a bug-fix release. If there are any other pending > bug fixes or small+safe changes people would like to see included, > please let us know. I have one major qualm about this: since 'grep' no longer checks whether the input is correctly encoded, I expect this may hurt -P performance significantly (though it may help non -P performance). This is because PCRE is slow at checking whether input data are valid UTF-8. I just now did a brief check and found one major performance issue: grep -rP 'fed.*cba' . On my machine the above command is 125x slower with the new grep than the old one, which suggests some tuning is in order before releasing. (It's bogged down inside libpcre somewhere.) Since you wrote your email I did a triage of the outstanding bugs, except for the bugs where patches are available which are mostly performance-related, and where I expect there will be some stuff that is relevant to -P slowdown. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 31 10:23:27 2015 Received: (at 20526) by debbugs.gnu.org; 31 Dec 2015 15:23:27 +0000 Received: from localhost ([127.0.0.1]:51869 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aEf4Y-0001l2-QW for submit@debbugs.gnu.org; Thu, 31 Dec 2015 10:23:26 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:59903) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aEf4W-0001kl-Ex for 20526@debbugs.gnu.org; Thu, 31 Dec 2015 10:23:25 -0500 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 68E66F800F for <20526@debbugs.gnu.org>; Fri, 1 Jan 2016 00:23:16 +0900 (JST) X-matriXscan-loop-detect: 391b764c80a9efb184d175c4183ffd6f2b90deae Received: from mail01.kcn.ne.jp ([61.86.6.180]) by mxs01-s with ESMTP; Fri, 01 Jan 2016 00:23:14 +0900 (JST) Received: from [10.120.1.67] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail01.kcn.ne.jp (Postfix) with ESMTPA id 5B7B95A8059; Fri, 1 Jan 2016 00:23:14 +0900 (JST) Date: Fri, 01 Jan 2016 00:23:11 +0900 From: Norihiro Tanaka To: eggert@cs.ucla.edu Subject: Re: bug#20526: grep BUG: text file is detected as binary In-Reply-To: <5684A010.4000302@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> Message-Id: <20160101002311.8FB1.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Wed, 30 Dec 2015 19:25:04 -0800 Paul Eggert wrote: > I installed into Savannah a patch (attached) that should fix this > problem in typical cases, and am boldly marking the bug as done. > Please give the fix a try if you have the time. Thanks. I get following output after apply the patch. Is it expected? $ printf 'a\na\377\na\n' | LANG=en_US.utf8 src/grep a a Binary file (standard input) matches From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 01 19:07:53 2016 Received: (at 20526) by debbugs.gnu.org; 2 Jan 2016 00:07:53 +0000 Received: from localhost ([127.0.0.1]:33949 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aF9jd-0000cY-FZ for submit@debbugs.gnu.org; Fri, 01 Jan 2016 19:07:53 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:52448) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aF9jc-0000cO-17 for 20526@debbugs.gnu.org; Fri, 01 Jan 2016 19:07:52 -0500 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id BC9F9E80026 for <20526@debbugs.gnu.org>; Sat, 2 Jan 2016 06:39:06 +0900 (JST) X-matriXscan-loop-detect: a802958dc72c430baf039e12091e0e2633c4b174 Received: from mail04.kcn.ne.jp ([61.86.6.183]) by mxs01-s with ESMTP; Sat, 02 Jan 2016 06:39:05 +0900 (JST) Received: from [10.120.1.6] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail04.kcn.ne.jp (Postfix) with ESMTPA id 6CA2F1290022; Sat, 2 Jan 2016 06:39:05 +0900 (JST) Date: Sat, 02 Jan 2016 06:39:03 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#20526: grep BUG: text file is detected as binary In-Reply-To: <56856E16.3010207@cs.ucla.edu> References: <20160101002311.8FB1.27F6AC2D@kcn.ne.jp> <56856E16.3010207@cs.ucla.edu> Message-Id: <20160102063903.C3A5.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Thu, 31 Dec 2015 10:04:06 -0800 Paul Eggert wrote: > Yes, it's expected. Thanks, this should be stated more clearly, so I installed the attached documentation patch. Thanks. By the way, why this check is applied in only multi-byte locale? e.g. if \200 is included in en_US.iso88591 which is not POSIX locale, I think grep may need to return `Binary file ... matches', as mbrlen(3) returns -1 for \200. From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 01 19:30:46 2016 Received: (at 20526) by debbugs.gnu.org; 2 Jan 2016 00:30:46 +0000 Received: from localhost ([127.0.0.1]:33994 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aFA5l-0002rV-Qg for submit@debbugs.gnu.org; Fri, 01 Jan 2016 19:30:46 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:37254) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aFA5k-0002rI-1K for 20526@debbugs.gnu.org; Fri, 01 Jan 2016 19:30:44 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 26775160F50; Thu, 31 Dec 2015 10:04:08 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id ET584wXN1j_X; Thu, 31 Dec 2015 10:04:07 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 07F06160F59; Thu, 31 Dec 2015 10:04:07 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id VeF4bA1ZfSrj; Thu, 31 Dec 2015 10:04:06 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D4E0A160F50; Thu, 31 Dec 2015 10:04:06 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Norihiro Tanaka References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> <20160101002311.8FB1.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56856E16.3010207@cs.ucla.edu> Date: Thu, 31 Dec 2015 10:04:06 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <20160101002311.8FB1.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------080804070000080409030806" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------080804070000080409030806 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Norihiro Tanaka wrote: > I get following output after apply the patch. Is it expected? > > $ printf 'a\na\377\na\n' | LANG=en_US.utf8 src/grep a > a > Binary file (standard input) matches Yes, it's expected. Thanks, this should be stated more clearly, so I installed the attached documentation patch. --------------080804070000080409030806 Content-Type: text/x-diff; name="0001-doc-clarify-text-vs-binary-match-output.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-doc-clarify-text-vs-binary-match-output.patch" >From 5bba395e00c01b8fc263d576ab3b34121fd6a3c0 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 31 Dec 2015 10:02:31 -0800 Subject: [PATCH] doc: clarify text vs binary match output * NEWS: * doc/grep.texi (File and Directory Selection): Make it clearer that grep can now output matching text before reporting a binary match. Problem reported by Norihiro Tanaka in: http://bugs.gnu.org/20526#83 --- NEWS | 14 +++++++++----- doc/grep.texi | 9 ++++----- 2 files changed, 13 insertions(+), 10 deletions(-) diff --git a/NEWS b/NEWS index b451d76..6e97e45 100644 --- a/NEWS +++ b/NEWS @@ -4,11 +4,15 @@ GNU grep NEWS -*- outline -*- ** Bug fixes - Binary files are now less likely to generate diagnostics. grep now - reports "Binary file FOO matches" and suppresses further output when - grep is about to output a match that contains an encoding error. - Formerly, grep reported FOO to be binary merely because grep found - an encoding error in FOO before generating output for FOO. + Binary files are now less likely to generate diagnostics and more + likely to yield text matches. grep now reports "Binary file FOO + matches" and suppresses further output instead of outputting a line + containing a encoding error; hence grep can now report matching text + before a later binary match. Formerly, grep reported FOO to be + binary when it found an encoding error in FOO before generating + output for FOO, which meant it never reported both matching text and + matching binary data; this was less useful for searching text + containing encoding errors in non-matching lines. [bug introduced in grep-2.21] grep -c no longer stops counting when finding binary data. diff --git a/doc/grep.texi b/doc/grep.texi index 73151e4..b9a4d25 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -594,8 +594,7 @@ this is equivalent to the @samp{--binary-files=text} option. @item --binary-files=@var{type} @opindex --binary-files @cindex binary files -If a file's allocation metadata, -or if its data read before a line is selected for output, +If a file's data or metadata indicate that the file contains binary data, assume that the file is of type @var{type}. Non-text bytes indicate binary data; these are either output bytes that are @@ -604,9 +603,9 @@ improperly encoded for the current locale, or null input bytes when the Options}). By default, @var{type} is @samp{binary}, and when @command{grep} -discovers that a file is binary it normally outputs either -a one-line message saying that a binary file matches, -or no message if there is no match. +discovers that a file is binary it suppresses any further output, and +instead outputs either a one-line message saying that a binary file +matches, or no message if there is no match. When processing binary data, @command{grep} may treat non-text bytes as line terminators; for example, the pattern @samp{.} (period) might not match a null byte, as the null byte might be treated as a line -- 2.5.0 --------------080804070000080409030806-- From debbugs-submit-bounces@debbugs.gnu.org Sat Jan 02 00:23:12 2016 Received: (at 20526) by debbugs.gnu.org; 2 Jan 2016 05:23:12 +0000 Received: from localhost ([127.0.0.1]:34247 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aFEel-0001FA-Pw for submit@debbugs.gnu.org; Sat, 02 Jan 2016 00:23:12 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:42094) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aFEej-0001Ew-Dj for 20526@debbugs.gnu.org; Sat, 02 Jan 2016 00:23:10 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D7706160EF1; Fri, 1 Jan 2016 21:23:02 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id hGmnFLkUTSpS; Fri, 1 Jan 2016 21:23:00 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id DE859160F50; Fri, 1 Jan 2016 21:23:00 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id YeH2D0QqUq39; Fri, 1 Jan 2016 21:23:00 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id B7EB9160EF1; Fri, 1 Jan 2016 21:23:00 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Norihiro Tanaka References: <20160101002311.8FB1.27F6AC2D@kcn.ne.jp> <56856E16.3010207@cs.ucla.edu> <20160102063903.C3A5.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <56875EAE.7030309@cs.ucla.edu> Date: Fri, 1 Jan 2016 21:22:54 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <20160102063903.C3A5.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------040004030702080408030204" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------040004030702080408030204 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Norihiro Tanaka wrote: > why this check is applied in only multi-byte locale? Ouch, good point. I missed the possibility of a unibyte encoding where not all bytes are valid unibyte characters. I installed the attached additional patch to fix this, and to test for the bug I recently introduced here. --------------040004030702080408030204 Content-Type: text/plain; charset=UTF-8; name="0001-grep-fix-bug-with-with-invalid-unibyte-sequence.txt" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="0001-grep-fix-bug-with-with-invalid-unibyte-sequence.txt" RnJvbSBkMzE5MDBlM2ZjMjQwNmNjYjhhYTk3MmRkMTczZjI3ZjU4M2M5NWVjIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBGcmksIDEgSmFuIDIwMTYgMjE6MTY6MTIgLTA4MDAKU3ViamVjdDogW1BBVENI XSBncmVwOiBmaXggYnVnIHdpdGggd2l0aCBpbnZhbGlkIHVuaWJ5dGUgc2VxdWVuY2UKClRo aXMgd2FzIGludHJvZHVjZWQgYnkgdGhlIHJlY2VudCBiaW5hcnktZGF0YS1kZXRlY3Rpb24g Y2hhbmdlcy4KUHJvYmxlbSByZXBvcnRlZCBieSBOb3JpaGlybyBUYW5ha2EgaW46IGh0dHA6 Ly9idWdzLmdudS5vcmcvMjA1MjYjODYKKiBzcmMvZ3JlcC5jIChISUJZVEUsIGVhc3lfZW5j b2RpbmcsIGluaXRfZWFzeV9lbmNvZGluZyk6IFJlbW92ZSwKcmVwbGFjaW5nIHdpdGggLi4u Cih1d29yZF9tYXgsIHVuaWJ5dGVfbWFzaywgaW5pdGlhbGl6ZV91bmlieXRlX21hc2spOiAu Li4gdGhpcyBuZXcKY29uc3RhbnQsIHN0YXRpYyB2YXIsIGFuZCBmdW5jdGlvbi4gIEFsbCB1 c2VzIGNoYW5nZWQuICBUaGUKdW5pYnl0ZV9tYXNrIHZhciBnZW5lcmFsaXplcyB0aGUgb2xk IGxvY2FsIHZhciBoaWJ5dGVfbWFzaywgd2hpY2gKd29ya2VkIG9ubHkgZm9yIGVuY29kaW5n cyB3aGVyZSBldmVyeSBieXRlIHdpdGggMHg4MCB0dXJuZWQgb2ZmIGlzCmEgc2luZ2xlLWJ5 dGUgY2hhcmFjdGVyLgooYnVmX2hhc19lbmNvZGluZ19lcnJvcnMpOiBSZXR1cm4gZmFsc2Ug aW1tZWRpYXRlbHkgaWYKdW5pYnl0ZV9tYXNrIGlzIHplcm8sIG5vdCB3aGV0aGVyIHRoZSBj dXJyZW50IGVuY29kaW5nIGlzIHVuaWJ5dGUuClRoZSBvbGQgdGVzdCB3YXMgaW5jb3JyZWN0 IGluIHVuaWJ5dGUgbG9jYWxlcyBpbiB3aGljaCBzb21lIGJ5dGVzCndlcmUgZW5jb2Rpbmcg ZXJyb3JzLgoqIHRlc3RzL3BjcmUtejogUmVxdWlyZSBVVEYtOCBsb2NhbGUsIHNpbmNlIHRo ZSBncmVwIC16IC4gdGVzdCBub3cKbmVlZHMgdGhpcy4gIFVzZSBwcmludGYgXDAgcmF0aGVy IHRoYW4gdHIuICBQb3J0IHRoZSAnZ3JlcCAteiAuJwp0ZXN0IHRvIHBsYXRmb3JtcyB3aGVy ZSB0aGUgQyBsb2NhbGUgc2F5cyAnXDIwMCcgaXMgYW4gZW5jb2RpbmcKZXJyb3IuICBVc2Ug Y21wIHJhdGhlciB0aGFuIGNvbXBhcmUsIGFzIHRoZSBmaWxlIGlzIGJpbmFyeSBhbmQKc28g bm9uLUdOVSBkaWZmIG1pZ2h0IG5vdCB3b3JrLgoqIHRlc3RzL3VuaWJ5dGUtYmluYXJ5OiBO ZXcgZmlsZS4KKiB0ZXN0cy9NYWtlZmlsZS5hbSAoVEVTVFMpOiBBZGQgaXQuCi0tLQogc3Jj L2dyZXAuYyAgICAgICAgICAgfCA1NyArKysrKysrKysrKysrKysrKysrKysrKysrLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tCiB0ZXN0cy9NYWtlZmlsZS5hbSAgICB8ICAxICsKIHRl c3RzL3BjcmUteiAgICAgICAgIHwgIDkgKysrKystLS0tCiB0ZXN0cy91bmlieXRlLWJpbmFy eSB8IDI4ICsrKysrKysrKysrKysrKysrKysrKysrKysrCiA0IGZpbGVzIGNoYW5nZWQsIDYx IGluc2VydGlvbnMoKyksIDM0IGRlbGV0aW9ucygtKQogY3JlYXRlIG1vZGUgMTAwNzU1IHRl c3RzL3VuaWJ5dGUtYmluYXJ5CgpkaWZmIC0tZ2l0IGEvc3JjL2dyZXAuYyBiL3NyYy9ncmVw LmMKaW5kZXggMTIwN2E3Ni4uYTVmMWZhMiAxMDA2NDQKLS0tIGEvc3JjL2dyZXAuYworKysg Yi9zcmMvZ3JlcC5jCkBAIC00ODQsMjEgKzQ4NCw2IEBAIGNsZWFuX3VwX3N0ZG91dCAodm9p ZCkKICAgICBjbG9zZV9zdGRvdXQgKCk7CiB9CiAKLS8qIFRoZSBoaWdoLW9yZGVyIGJpdCBv ZiBhIGJ5dGUuICAqLwotZW51bSB7IEhJQllURSA9IDB4ODAgfTsKLQotLyogVHJ1ZSBpZiBl dmVyeSBieXRlIHdpdGggSElCWVRFIG9mZiBpcyBhIHNpbmdsZS1ieXRlIGNoYXJhY3Rlci4K LSAgIFVURi04IGhhcyB0aGlzIHByb3BlcnR5LiAgKi8KLXN0YXRpYyBib29sIGVhc3lfZW5j b2Rpbmc7Ci0KLXN0YXRpYyB2b2lkCi1pbml0X2Vhc3lfZW5jb2RpbmcgKHZvaWQpCi17Ci0g IGVhc3lfZW5jb2RpbmcgPSB0cnVlOwotICBmb3IgKGludCBpID0gMDsgaSA8IEhJQllURTsg aSsrKQotICAgIGVhc3lfZW5jb2RpbmcgJj0gbWJjbGVuX2NhY2hlW2ldID09IDE7Ci19Ci0K IC8qIEEgY2FzdCB0byBUWVBFIG9mIFZBTC4gIFVzZSB0aGlzIHdoZW4gVFlQRSBpcyBhIHBv aW50ZXIgdHlwZSwgVkFMCiAgICBpcyBwcm9wZXJseSBhbGlnbmVkIGZvciBUWVBFLCBhbmQg J2djYyAtV2Nhc3QtYWxpZ24nIGNhbm5vdCBpbmZlcgogICAgdGhlIGFsaWdubWVudCBhbmQg d291bGQgb3RoZXJ3aXNlIGNvbXBsYWluIGFib3V0IHRoZSBjYXN0LiAgKi8KQEAgLTUxNywy MSArNTAyLDMzIEBAIGluaXRfZWFzeV9lbmNvZGluZyAodm9pZCkKIC8qIEFuIHVuc2lnbmVk IHR5cGUgc3VpdGFibGUgZm9yIGZhc3QgbWF0Y2hpbmcuICAqLwogdHlwZWRlZiB1aW50bWF4 X3QgdXdvcmQ7CiAKKy8qIEFsbCBieXRlcyB0aGF0IGFyZSBub3QgdW5pYnl0ZSBjaGFyYWN0 ZXJzLCBBTkRlZCB0b2dldGhlciwgYW5kIHRoZW4KKyAgIHdpdGggdGhlIHBhdHRlcm4gcmVw ZWF0ZWQgdG8gZmlsbCBhIHV3b3JkLiAgRm9yIGFuIGVuY29kaW5nIHdoZXJlCisgICBhbGwg Ynl0ZXMgYXJlIHVuaWJ5dGUgY2hhcmFjdGVycywgdGhpcyBpcyAwLiAgRm9yIFVURi04LCB0 aGlzIGlzCisgICAweDgwODA4MC4uLi4gIEZvciBlbmNvZGluZ3Mgd2hlcmUgdW5pYnl0ZSBj aGFyYWN0ZXJzIGhhdmUgbm8gdXNlZnVsCisgICBwYXR0ZXJuLCB0aGlzIGlzIGFsbCAxcy4g IFRoZSB1bnNpZ25lZCBjaGFyIEMgaXMgYSB1bmlieXRlCisgICBjaGFyYWN0ZXIgaWYgQyAm IFVOSUJZVEVfTUFTSyBpcyB6ZXJvLiAgSWYgdGhlIHV3b3JkIFcgaXMgdGhlCisgICBjb25j YXRlbmF0aW9uIG9mIGJ5dGVzLCB0aGUgYnl0ZXMgYXJlIGFsbCB1bmlieXRlIGNoYXJhY3Rl cnMKKyAgIGlmIFcgJiBVTklCWVRFX01BU0sgaXMgemVyby4gICovCitzdGF0aWMgdXdvcmQg dW5pYnl0ZV9tYXNrOworCitzdGF0aWMgdm9pZAoraW5pdGlhbGl6ZV91bmlieXRlX21hc2sg KHZvaWQpCit7CisgIHVuc2lnbmVkIGNoYXIgbWFzayA9IFVDSEFSX01BWDsKKyAgZm9yIChp bnQgaSA9IDE7IGkgPD0gVUNIQVJfTUFYOyBpKyspCisgICAgaWYgKG1iY2xlbl9jYWNoZVtp XSAhPSAxKQorICAgICAgbWFzayAmPSBpOworICB1d29yZCB1d29yZF9tYXggPSAtMTsKKyAg dW5pYnl0ZV9tYXNrID0gdXdvcmRfbWF4IC8gVUNIQVJfTUFYICogbWFzazsKK30KKwogLyog U2tpcCB0aGUgZWFzeSBieXRlcyBpbiBhIGJ1ZmZlciB0aGF0IGlzIGd1YXJhbnRlZWQgdG8g aGF2ZSBhIHNlbnRpbmVsCiAgICB0aGF0IGlzIG5vdCBlYXN5LCBhbmQgcmV0dXJuIGEgcG9p bnRlciB0byB0aGUgZmlyc3Qgbm9uLWVhc3kgYnl0ZS4KLSAgIEluIGVhc3kgZW5jb2Rpbmdz LCB0aGUgZWFzeSBieXRlcyBhbGwgaGF2ZSBISUJZVEUgb2ZmLgotICAgSW4gb3RoZXIgZW5j b2RpbmdzLCBubyBieXRlIGlzIGVhc3kuICAqLworICAgVGhlIGVhc3kgYnl0ZXMgYWxsIGhh dmUgVU5JQllURV9NQVNLIG9mZi4gICovCiBzdGF0aWMgY2hhciBjb25zdCAqIF9HTF9BVFRS SUJVVEVfUFVSRQogc2tpcF9lYXN5X2J5dGVzIChjaGFyIGNvbnN0ICpidWYpCiB7Ci0gIGlm ICghZWFzeV9lbmNvZGluZykKLSAgICByZXR1cm4gYnVmOwotCi0gIHV3b3JkIHV3b3JkX21h eCA9IC0xOwotCi0gIC8qIDB4ODA4MC4uLiwgZXh0ZW5kZWQgdG8gYmUgd2lkZSBlbm91Z2gg Zm9yIHV3b3JkLiAgKi8KLSAgdXdvcmQgaGlieXRlX21hc2sgPSB1d29yZF9tYXggLyBVQ0hB Ul9NQVggKiBISUJZVEU7Ci0KICAgLyogU2VhcmNoIGEgYnl0ZSBhdCBhIHRpbWUgdW50aWwg dGhlIHBvaW50ZXIgaXMgYWxpZ25lZCwgdGhlbiBhCiAgICAgIHV3b3JkIGF0IGEgdGltZSB1 bnRpbCBhIG1hdGNoIGlzIGZvdW5kLCB0aGVuIGEgYnl0ZSBhdCBhIHRpbWUgdG8KICAgICAg aWRlbnRpZnkgdGhlIGV4YWN0IGJ5dGUuICBUaGUgdXdvcmQgc2VhcmNoIG1heSBnbyBzbGln aHRseSBwYXN0CkBAIC01MzksMTEgKzUzNiwxMSBAQCBza2lwX2Vhc3lfYnl0ZXMgKGNoYXIg Y29uc3QgKmJ1ZikKICAgY2hhciBjb25zdCAqcDsKICAgdXdvcmQgY29uc3QgKnM7CiAgIGZv ciAocCA9IGJ1ZjsgKHVpbnRwdHJfdCkgcCAlIHNpemVvZiAodXdvcmQpICE9IDA7IHArKykK LSAgICBpZiAoKnAgJiBISUJZVEUpCisgICAgaWYgKHRvX3VjaGFyICgqcCkgJiB1bmlieXRl X21hc2spCiAgICAgICByZXR1cm4gcDsKLSAgZm9yIChzID0gQ0FTVF9BTElHTkVEICh1d29y ZCBjb25zdCAqLCBwKTsgISAoKnMgJiBoaWJ5dGVfbWFzayk7IHMrKykKKyAgZm9yIChzID0g Q0FTVF9BTElHTkVEICh1d29yZCBjb25zdCAqLCBwKTsgISAoKnMgJiB1bmlieXRlX21hc2sp OyBzKyspCiAgICAgY29udGludWU7Ci0gIGZvciAocCA9IChjaGFyIGNvbnN0ICopIHM7ICEg KCpwICYgSElCWVRFKTsgcCsrKQorICBmb3IgKHAgPSAoY2hhciBjb25zdCAqKSBzOyAhICh0 b191Y2hhciAoKnApICYgdW5pYnl0ZV9tYXNrKTsgcCsrKQogICAgIGNvbnRpbnVlOwogICBy ZXR1cm4gcDsKIH0KQEAgLTU1NCw3ICs1NTEsNyBAQCBza2lwX2Vhc3lfYnl0ZXMgKGNoYXIg Y29uc3QgKmJ1ZikKIHN0YXRpYyBib29sCiBidWZfaGFzX2VuY29kaW5nX2Vycm9ycyAoY2hh ciAqYnVmLCBzaXplX3Qgc2l6ZSkKIHsKLSAgaWYgKE1CX0NVUl9NQVggPD0gMSkKKyAgaWYg KCEgdW5pYnl0ZV9tYXNrKQogICAgIHJldHVybiBmYWxzZTsKIAogICBtYnN0YXRlX3QgbWJz ID0geyAwIH07CkBAIC0yNTkyLDcgKzI1ODksNyBAQCBtYWluIChpbnQgYXJnYywgY2hhciAq KmFyZ3YpCiAgICAgdXNhZ2UgKEVYSVRfVFJPVUJMRSk7CiAKICAgYnVpbGRfbWJjbGVuX2Nh Y2hlICgpOwotICBpbml0X2Vhc3lfZW5jb2RpbmcgKCk7CisgIGluaXRpYWxpemVfdW5pYnl0 ZV9tYXNrICgpOwogCiAgIC8qIEluIGEgdW5pYnl0ZSBsb2NhbGUsIHN3aXRjaCBmcm9tIGZn cmVwIHRvIGdyZXAgaWYKICAgICAgdGhlIHBhdHRlcm4gbWF0Y2hlcyB3b3JkcyAod2hlcmUg Z3JlcCBpcyB0eXBpY2FsbHkgZmFzdGVyKS4KZGlmZiAtLWdpdCBhL3Rlc3RzL01ha2VmaWxl LmFtIGIvdGVzdHMvTWFrZWZpbGUuYW0KaW5kZXggZjM0OWFhMy4uYTM4MzAzYyAxMDA2NDQK LS0tIGEvdGVzdHMvTWFrZWZpbGUuYW0KKysrIGIvdGVzdHMvTWFrZWZpbGUuYW0KQEAgLTEz Myw2ICsxMzMsNyBAQCBURVNUUyA9CQkJCQkJXAogICB0dXJraXNoLUktd2l0aG91dC1kb3QJ CQkJXAogICB0dXJraXNoLWV5ZXMJCQkJCVwKICAgdHdvLWZpbGVzCQkJCQlcCisgIHVuaWJ5 dGUtYmluYXJ5CQkJCVwKICAgdW5pYnl0ZS1icmFja2V0LWV4cHIJCQkJXAogICB1bmlieXRl LW5lZ2F0ZWQtY2lyY3VtZmxleAkJCVwKICAgdXRmOC1icmFja2V0CQkJCQlcCmRpZmYgLS1n aXQgYS90ZXN0cy9wY3JlLXogYi90ZXN0cy9wY3JlLXoKaW5kZXggNmJiZGU5NC4uNGNlOWE5 MyAxMDA3NTUKLS0tIGEvdGVzdHMvcGNyZS16CisrKyBiL3Rlc3RzL3BjcmUtegpAQCAtMiwx MCArMiwxMSBAQAogIyBUZXN0IFBlcmwgcmVnZXggd2l0aCBOVUwtc2VwYXJhdGVkIGlucHV0 CiAuICIke3NyY2Rpcj0ufS9pbml0LnNoIjsgcGF0aF9wcmVwZW5kXyAuLi9zcmMKIHJlcXVp cmVfcGNyZV8KK3JlcXVpcmVfZW5fdXRmOF9sb2NhbGVfCiAKIFJFR0VYPWEKIAotcHJpbnRm ICIlc1xuMCIgYWJjIGRlZiBnaGkgYWFhIGdhaCB8IHRyIDAgXFwwID4gaW4KK3ByaW50ZiAn JXNcblwwJyBhYmMgZGVmIGdoaSBhYWEgZ2FoID4gaW4gfHwgZnJhbWV3b3JrX2ZhaWx1cmVf CiAKIGdyZXAgLXogIiRSRUdFWCIgaW4gPiBleHAgMj5lcnIgfHwgZmFpbF8gJ0Nhbm5vdCBk byBCUkUgKGdyZXAgLXopIG1hdGNoLicKIGNvbXBhcmUgL2Rldi9udWxsIGVyciB8fCBmYWls XyAnc3RkZXJyIG5vdCBlbXB0eSBvbiBncmVwIC16LicKQEAgLTIwLDggKzIxLDggQEAgZ3Jl cCAtUHogIiRSRUdFWCIgaW4gPiBvdXQgMj5lcnIgfHwgZmFpbD0xCiBjb21wYXJlIGV4cCBv dXQgfHwgZmFpbD0xCiBjb21wYXJlIC9kZXYvbnVsbCBlcnIgfHwgZmFpbD0xCiAKLXByaW50 ZiAnXDIwMFwwJyA+aW4wCi1MQ19BTEw9QyBncmVwIC16IC4gaW4wID5vdXQgfHwgZmFpbD0x Ci1jb21wYXJlIGluMCBvdXQgfHwgZmFpbD0xCitwcmludGYgJ1wzMDNcMjAwXDAnID5pbjAg IyAiw4AiIGZvbGxvd2VkIGJ5IGEgTlVMLgorTENfQUxMPWVuX1VTLlVURi04IGdyZXAgLXog LiBpbjAgPm91dCB8fCBmYWlsPTEKK2NtcCBpbjAgb3V0IHx8IGZhaWw9MQogCiBFeGl0ICRm YWlsCmRpZmYgLS1naXQgYS90ZXN0cy91bmlieXRlLWJpbmFyeSBiL3Rlc3RzL3VuaWJ5dGUt YmluYXJ5Cm5ldyBmaWxlIG1vZGUgMTAwNzU1CmluZGV4IDAwMDAwMDAuLjc4NzM1YjgKLS0t IC9kZXYvbnVsbAorKysgYi90ZXN0cy91bmlieXRlLWJpbmFyeQpAQCAtMCwwICsxLDI4IEBA CisjIS9iaW4vc2gKKyMgVGVzdCBiaW5hcnkgZmlsZXMgaW4gdW5pYnl0ZSBsb2NhbGVzIHdp dGggZW5jb2RpbmcgZXJyb3JzCisKKyMgQ29weXJpZ2h0IDIwMTYgRnJlZSBTb2Z0d2FyZSBG b3VuZGF0aW9uLCBJbmMuCisKKyMgVGhpcyBwcm9ncmFtIGlzIGZyZWUgc29mdHdhcmU6IHlv dSBjYW4gcmVkaXN0cmlidXRlIGl0IGFuZC9vciBtb2RpZnkKKyMgaXQgdW5kZXIgdGhlIHRl cm1zIG9mIHRoZSBHTlUgR2VuZXJhbCBQdWJsaWMgTGljZW5zZSBhcyBwdWJsaXNoZWQgYnkK KyMgdGhlIEZyZWUgU29mdHdhcmUgRm91bmRhdGlvbiwgZWl0aGVyIHZlcnNpb24gMyBvZiB0 aGUgTGljZW5zZSwgb3IKKyMgKGF0IHlvdXIgb3B0aW9uKSBhbnkgbGF0ZXIgdmVyc2lvbi4K KworIyBUaGlzIHByb2dyYW0gaXMgZGlzdHJpYnV0ZWQgaW4gdGhlIGhvcGUgdGhhdCBpdCB3 aWxsIGJlIHVzZWZ1bCwKKyMgYnV0IFdJVEhPVVQgQU5ZIFdBUlJBTlRZOyB3aXRob3V0IGV2 ZW4gdGhlIGltcGxpZWQgd2FycmFudHkgb2YKKyMgTUVSQ0hBTlRBQklMSVRZIG9yIEZJVE5F U1MgRk9SIEEgUEFSVElDVUxBUiBQVVJQT1NFLiAgU2VlIHRoZQorIyBHTlUgR2VuZXJhbCBQ dWJsaWMgTGljZW5zZSBmb3IgbW9yZSBkZXRhaWxzLgorCisjIFlvdSBzaG91bGQgaGF2ZSBy ZWNlaXZlZCBhIGNvcHkgb2YgdGhlIEdOVSBHZW5lcmFsIFB1YmxpYyBMaWNlbnNlCisjIGFs b25nIHdpdGggdGhpcyBwcm9ncmFtLiAgSWYgbm90LCBzZWUgPGh0dHA6Ly93d3cuZ251Lm9y Zy9saWNlbnNlcy8+LgorCisuICIke3NyY2Rpcj0ufS9pbml0LnNoIjsgcGF0aF9wcmVwZW5k XyAuLi9zcmMKK3JlcXVpcmVfdW5pYnl0ZV9sb2NhbGUKKworZmFpbD0wCisKK3ByaW50ZiAn YVxuXDIwMFxuYlxuJyA+aW4gfHwgZnJhbWV3b3JrX2ZhaWx1cmVfCitwcmludGYgJ2FcbkJp bmFyeSBmaWxlIGluIG1hdGNoZXNcbicgPmV4cCB8fCBmcmFtZXdvcmtfZmFpbHVyZV8KK2dy ZXAgLiBpbiA+b3V0IHx8IGZhaWw9MQorY29tcGFyZSBleHAgb3V0IHx8IGZhaWw9MQorRXhp dCAkZmFpbAotLSAKMi41LjAKCg== --------------040004030702080408030204-- From debbugs-submit-bounces@debbugs.gnu.org Sat Jan 02 20:32:18 2016 Received: (at 20526) by debbugs.gnu.org; 3 Jan 2016 01:32:18 +0000 Received: from localhost ([127.0.0.1]:35725 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aFXWr-0001Ky-Pp for submit@debbugs.gnu.org; Sat, 02 Jan 2016 20:32:18 -0500 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:53273) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aFXWp-0001Kk-9o for 20526@debbugs.gnu.org; Sat, 02 Jan 2016 20:32:16 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id 079BE8805DC for <20526@debbugs.gnu.org>; Sun, 3 Jan 2016 10:32:10 +0900 (JST) X-matriXscan-loop-detect: a802958dc72c430baf039e12091e0e2633c4b174 Received: from mail04.kcn.ne.jp ([61.86.6.183]) by mxs02-s with ESMTP; Sun, 03 Jan 2016 10:32:07 +0900 (JST) Received: from [10.120.1.30] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail04.kcn.ne.jp (Postfix) with ESMTPA id 7CE981290022; Sun, 3 Jan 2016 10:32:07 +0900 (JST) Date: Sun, 03 Jan 2016 10:32:06 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#20526: grep BUG: text file is detected as binary In-Reply-To: <56875EAE.7030309@cs.ucla.edu> References: <20160102063903.C3A5.27F6AC2D@kcn.ne.jp> <56875EAE.7030309@cs.ucla.edu> Message-Id: <20160103103157.4131.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_56887909000000004124_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --------_56887909000000004124_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Fri, 1 Jan 2016 21:22:54 -0800 Paul Eggert wrote: > Ouch, good point. I missed the possibility of a unibyte encoding where > not all bytes are valid unibyte characters. I installed the attached > additional patch to fix this, and to test for the bug I recently > introduced here. Thanks, I see that it is good idea, but I propose minor change for your fix. Perhaps, it will be what you want. --------_56887909000000004124_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-grep-minor-improvements-to-previous-change.patch" Content-Disposition: attachment; filename="0001-grep-minor-improvements-to-previous-change.patch" Content-Transfer-Encoding: base64 RnJvbSBkMzZjZjQyMDgzNjNjMGY1NmZmMzJkMzhhOWZlYTQyMjM0MjAzNmZlIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBTYXQsIDIgSmFuIDIwMTYgMDA6MjA6NDMgKzA5MDAKU3ViamVjdDogW1BBVENIXSBncmVw OiBtaW5vciBpbXByb3ZlbWVudHMgdG8gcHJldmlvdXMgY2hhbmdlCgoqIHNyYy9ncmVwLmMgKHNr aXBfZWFzeV9ieXRlcyk6IERvIG5vdGhpbmcgaWYgdGhlIGxvY2FsZSBkb2VzIG5vdCBoYXZlCmFu eSBza2lwcGFibGUgY2hhcmFjdGVyLgoqIChidWZfaGFzX2VuY29kaW5nX2Vycm9ycyk6IERvIG5v dGhpbmcgaWYgYWxsIGJ5dGVzIGFyZSBzaW5nbGUgYnl0ZQpjaGFyYWN0ZXIgaW4gdGhlIGxvY2Fs ZS4KLS0tCiBzcmMvZ3JlcC5jIHwgNCArKystCiAxIGZpbGUgY2hhbmdlZCwgMyBpbnNlcnRpb25z KCspLCAxIGRlbGV0aW9uKC0pCgpkaWZmIC0tZ2l0IGEvc3JjL2dyZXAuYyBiL3NyYy9ncmVwLmMK aW5kZXggYTVmMWZhMi4uZDVhODE4MyAxMDA2NDQKLS0tIGEvc3JjL2dyZXAuYworKysgYi9zcmMv Z3JlcC5jCkBAIC01MzUsNiArNTM1LDggQEAgc2tpcF9lYXN5X2J5dGVzIChjaGFyIGNvbnN0ICpi dWYpCiAgICAgIHRoZSBidWZmZXIgZW5kLCBidXQgdGhhdCdzIGJlbmlnbi4gICovCiAgIGNoYXIg Y29uc3QgKnA7CiAgIHV3b3JkIGNvbnN0ICpzOworICBpZiAoISB1bmlieXRlX21hc2spCisgICAg cmV0dXJuIGJ1ZjsKICAgZm9yIChwID0gYnVmOyAodWludHB0cl90KSBwICUgc2l6ZW9mICh1d29y ZCkgIT0gMDsgcCsrKQogICAgIGlmICh0b191Y2hhciAoKnApICYgdW5pYnl0ZV9tYXNrKQogICAg ICAgcmV0dXJuIHA7CkBAIC01NTEsNyArNTUzLDcgQEAgc2tpcF9lYXN5X2J5dGVzIChjaGFyIGNv bnN0ICpidWYpCiBzdGF0aWMgYm9vbAogYnVmX2hhc19lbmNvZGluZ19lcnJvcnMgKGNoYXIgKmJ1 Ziwgc2l6ZV90IHNpemUpCiB7Ci0gIGlmICghIHVuaWJ5dGVfbWFzaykKKyAgaWYgKHVuaWJ5dGVf bWFzayA9PSAodXdvcmQpIC0xKQogICAgIHJldHVybiBmYWxzZTsKIAogICBtYnN0YXRlX3QgbWJz ID0geyAwIH07Ci0tIAoyLjYuNAoK --------_56887909000000004124_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Tue Jan 05 06:27:04 2016 Received: (at 20526-done) by debbugs.gnu.org; 5 Jan 2016 11:27:04 +0000 Received: from localhost ([127.0.0.1]:38129 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGPlY-0005He-Lu for submit@debbugs.gnu.org; Tue, 05 Jan 2016 06:27:04 -0500 Received: from mx1.redhat.com ([209.132.183.28]:50450) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGPlW-0005H9-Cj for 20526-done@debbugs.gnu.org; Tue, 05 Jan 2016 06:27:02 -0500 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) by mx1.redhat.com (Postfix) with ESMTPS id 2B2BAC803; Tue, 5 Jan 2016 11:26:56 +0000 (UTC) Received: from kdudka.brq.redhat.com (kdudka.brq.redhat.com [10.34.4.67]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u05BQrQO016465 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 5 Jan 2016 06:26:54 -0500 From: Kamil Dudka To: Paul Eggert Subject: Re: grep BUG: text file is detected as binary Date: Tue, 05 Jan 2016 12:26:52 +0100 Message-ID: <2421010.Mtp0VzAiTZ@kdudka.brq.redhat.com> User-Agent: KMail/4.14.10 (Linux/4.2.8-300.fc23.x86_64; KDE/4.14.14; x86_64; ; ) In-Reply-To: <5684A010.4000302@cs.ucla.edu> References: <5684A010.4000302@cs.ucla.edu> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.26 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 20526-done Cc: Benno Schulenberg , 20526-done@debbugs.gnu.org, =?ISO-8859-1?Q?=C1ngel_Gonz=E1lez?= , Johannes Meixner , Hans Pelleboer , Sebastian Poehn , Mike Frysinger , Eric Blake X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) On Wednesday 30 December 2015 19:25:04 Paul Eggert wrote: > I installed into Savannah a patch (attached) that should fix this problem in > typical cases, and am boldly marking the bug as done. Please give the fix a > try if you have the time. Thanks. Thanks for the fixup! I can confirm that it resolves the issue described at: https://bugzilla.redhat.com/1219141 Kamil From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 06 02:33:54 2016 Received: (at 20526) by debbugs.gnu.org; 6 Jan 2016 07:33:54 +0000 Received: from localhost ([127.0.0.1]:39424 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGibS-00026J-0D for submit@debbugs.gnu.org; Wed, 06 Jan 2016 02:33:54 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:49629) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGibQ-000264-3W for 20526@debbugs.gnu.org; Wed, 06 Jan 2016 02:33:52 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 3A4691601BC; Tue, 5 Jan 2016 23:33:45 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id vjMfM0m-gVEQ; Tue, 5 Jan 2016 23:33:44 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 63DDC1601BE; Tue, 5 Jan 2016 23:33:44 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id NgjXXYswuDLQ; Tue, 5 Jan 2016 23:33:44 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 3BCCE1601BC; Tue, 5 Jan 2016 23:33:44 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Norihiro Tanaka References: <20160102063903.C3A5.27F6AC2D@kcn.ne.jp> <56875EAE.7030309@cs.ucla.edu> <20160103103157.4131.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <568CC353.5050805@cs.ucla.edu> Date: Tue, 5 Jan 2016 23:33:39 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <20160103103157.4131.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------060704080709090903020904" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------060704080709090903020904 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Norihiro Tanaka wrote: > I see that it is good idea, but I propose minor change for your > fix. Perhaps, it will be what you want. I think the problem here is that the code was not computing unibyte_mask correctly; that is, the comment for unibyte_mask is correct, and usage of unibyte_mask is correct, but unibyte_mask was sometimes initialized incorrectly in unusual locales. I installed the attached patch to try to fix that. Computing an optimal unibyte_mask (for a reasonable definition of "optimal") is likely more trouble than it is worth. --------------060704080709090903020904 Content-Type: text/x-diff; name="0001-Fix-calculation-of-unibyte_mask.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-Fix-calculation-of-unibyte_mask.patch" >From d5b5b9af641ba2c02e040c7c5678547763937145 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 5 Jan 2016 23:29:07 -0800 Subject: [PATCH] Fix calculation of unibyte_mask * src/grep.c (initialize_unibyte_mask): The old method worked for UTF-8 and other typical encodings, but did not work for weird encodings, e.g., one where all bytes other than 0x7f and 0x80 are unibyte characters. --- src/grep.c | 25 +++++++++++++++++++------ 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/src/grep.c b/src/grep.c index a5f1fa2..f6fb0bc 100644 --- a/src/grep.c +++ b/src/grep.c @@ -502,10 +502,10 @@ clean_up_stdout (void) /* An unsigned type suitable for fast matching. */ typedef uintmax_t uword; -/* All bytes that are not unibyte characters, ANDed together, and then - with the pattern repeated to fill a uword. For an encoding where +/* A mask to test for unibyte characters, with the pattern repeated to + fill a uword. For a multibyte character encoding where all bytes are unibyte characters, this is 0. For UTF-8, this is - 0x808080.... For encodings where unibyte characters have no useful + 0x808080.... For encodings where unibyte characters have no discerned pattern, this is all 1s. The unsigned char C is a unibyte character if C & UNIBYTE_MASK is zero. If the uword W is the concatenation of bytes, the bytes are all unibyte characters @@ -515,10 +515,23 @@ static uword unibyte_mask; static void initialize_unibyte_mask (void) { - unsigned char mask = UCHAR_MAX; + /* For each encoding error I that MASK does not already match, + accumulate I's most significant 1 bit by ORing it into MASK. + Although any 1 bit of I could be used, in practice high-order + bits work better. */ + unsigned char mask = 0; + int ms1b = 1; for (int i = 1; i <= UCHAR_MAX; i++) - if (mbclen_cache[i] != 1) - mask &= i; + if (mbclen_cache[i] != 1 && ! (mask & i)) + { + while (ms1b * 2 <= i) + ms1b *= 2; + mask |= ms1b; + } + + /* Now MASK will detect any encoding-error byte, although it may + cry wolf and it may not be optimal. Build a uword-length mask by + repeating MASK. */ uword uword_max = -1; unibyte_mask = uword_max / UCHAR_MAX * mask; } -- 2.5.0 --------------060704080709090903020904-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 06 03:32:27 2016 Received: (at 20526) by debbugs.gnu.org; 6 Jan 2016 08:32:27 +0000 Received: from localhost ([127.0.0.1]:39443 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGjW7-0003TO-61 for submit@debbugs.gnu.org; Wed, 06 Jan 2016 03:32:27 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:50937) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGjW5-0003TC-Bw for 20526@debbugs.gnu.org; Wed, 06 Jan 2016 03:32:26 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 8DD5D1601BC; Wed, 6 Jan 2016 00:32:19 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id s36yhDG9HL17; Wed, 6 Jan 2016 00:32:18 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 5904F1601E7; Wed, 6 Jan 2016 00:32:18 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 5mL07kphXEmg; Wed, 6 Jan 2016 00:32:18 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 296571601BC; Wed, 6 Jan 2016 00:32:18 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Jim Meyering , 20526@debbugs.gnu.org, sebastian.poehn@gmail.com References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> <5684F57F.2090802@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <568CD111.5010801@cs.ucla.edu> Date: Wed, 6 Jan 2016 00:32:17 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <5684F57F.2090802@cs.ucla.edu> Content-Type: multipart/mixed; boundary="------------090306030502020805020807" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: Johannes Meixner , Kamil Dudka , Benno Schulenberg X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------090306030502020805020807 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Paul Eggert wrote: > grep -rP 'fed.*cba' . > > On my machine the above command is 125x slower with the new grep than the old > one, which suggests some tuning is in order before releasing. (It's bogged down > inside libpcre somewhere.) I installed the attached patch, which fixed this performance bug for me. --------------090306030502020805020807 Content-Type: text/x-diff; name="0001-grep-restore-P-PCRE_NO_UTF8_CHECK-optimization.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="0001-grep-restore-P-PCRE_NO_UTF8_CHECK-optimization.patch" >From 6e8f5b27ab033f4551e61740c1bdd6ffa13e9047 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 6 Jan 2016 00:26:26 -0800 Subject: [PATCH] grep: restore -P PCRE_NO_UTF8_CHECK optimization On my platform in the en_US.utf8 locale, this makes 'grep -P "z.*a" k' 220x faster, where k is created by the shell command: yes 'abcdefg hijklmn opqrstu vwxyz' | head -n 10000000 >k * src/dfasearch.c (EGexecute): * src/grep.c (execute_fp_t): * src/kwsearch.c (Fexecute): * src/pcresearch.c (Pexecute): First arg is now char *, not char const *, since Pexecute now temporarily modifies this argument. * src/grep.c, src/grep.h (buf_has_encoding_errors): Now extern. * src/pcresearch.c (Pexecute): Use it. If the input is free of encoding errors, use a multiline search and the PCRE_NO_UTF8_CHECK option, as this is typically way faster. This restores an optimization that was removed with the recent changes for binary file detection. --- src/dfasearch.c | 2 +- src/grep.c | 4 ++-- src/grep.h | 2 ++ src/kwsearch.c | 2 +- src/pcresearch.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++------- 5 files changed, 56 insertions(+), 11 deletions(-) diff --git a/src/dfasearch.c b/src/dfasearch.c index 0205011..a330eac 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -202,7 +202,7 @@ GEAcompile (char const *pattern, size_t size, reg_syntax_t syntax_bits) } size_t -EGexecute (char const *buf, size_t size, size_t *match_size, +EGexecute (char *buf, size_t size, size_t *match_size, char const *start_ptr) { char const *buflim, *beg, *end, *ptr, *match, *best_match, *mb_start; diff --git a/src/grep.c b/src/grep.c index f6fb0bc..10aabf9 100644 --- a/src/grep.c +++ b/src/grep.c @@ -462,7 +462,7 @@ enum { SEEK_HOLE = SEEK_SET }; /* Functions we'll use to search. */ typedef void (*compile_fp_t) (char const *, size_t); -typedef size_t (*execute_fp_t) (char const *, size_t, size_t *, char const *); +typedef size_t (*execute_fp_t) (char *, size_t, size_t *, char const *); static compile_fp_t compile; static execute_fp_t execute; @@ -561,7 +561,7 @@ skip_easy_bytes (char const *buf) /* Return true if BUF, of size SIZE, has an encoding error. BUF must be followed by at least sizeof (uword) bytes, the first of which may be modified. */ -static bool +bool buf_has_encoding_errors (char *buf, size_t size) { if (! unibyte_mask) diff --git a/src/grep.h b/src/grep.h index 577fb72..75b7ef7 100644 --- a/src/grep.h +++ b/src/grep.h @@ -29,4 +29,6 @@ extern bool match_words; /* -w */ extern bool match_lines; /* -x */ extern char eolbyte; /* -z */ +extern bool buf_has_encoding_errors (char *, size_t); + #endif diff --git a/src/kwsearch.c b/src/kwsearch.c index e33caaf..e9966d4 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -78,7 +78,7 @@ Fcompile (char const *pattern, size_t size) } size_t -Fexecute (char const *buf, size_t size, size_t *match_size, +Fexecute (char *buf, size_t size, size_t *match_size, char const *start_ptr) { char const *beg, *try, *end, *mb_start; diff --git a/src/pcresearch.c b/src/pcresearch.c index a647514..8f3d935 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -174,7 +174,7 @@ Pcompile (char const *pattern, size_t size) } size_t -Pexecute (char const *buf, size_t size, size_t *match_size, +Pexecute (char *buf, size_t size, size_t *match_size, char const *start_ptr) { #if !HAVE_LIBPCRE @@ -194,13 +194,31 @@ Pexecute (char const *buf, size_t size, size_t *match_size, error. */ char const *subject = buf; + /* If the input is free of encoding errors a multiline search is + typically more efficient. Otherwise, a single-line search is + typically faster, so that pcre_exec doesn't waste time validating + the entire input buffer. */ + bool multiline = ! buf_has_encoding_errors (buf, size - 1); + buf[size - 1] = eolbyte; + for (; p < buf + size; p = line_start = line_end + 1) { - /* A single-line search is typically faster, so that - pcre_exec doesn't waste time validating the entire input - buffer. */ - line_end = memchr (p, eolbyte, buf + size - p); - if (INT_MAX < line_end - p) + bool too_big; + + if (multiline) + { + size_t pcre_size_max = MIN (INT_MAX, SIZE_MAX - 1); + size_t scan_size = MIN (pcre_size_max + 1, buf + size - p); + line_end = memrchr (p, eolbyte, scan_size); + too_big = ! line_end; + } + else + { + line_end = memchr (p, eolbyte, buf + size - p); + too_big = INT_MAX < line_end - p; + } + + if (too_big) error (EXIT_TROUBLE, 0, _("exceeded PCRE's line length limit")); for (;;) @@ -228,11 +246,27 @@ Pexecute (char const *buf, size_t size, size_t *match_size, int options = 0; if (!bol) options |= PCRE_NOTBOL; + if (multiline) + options |= PCRE_NO_UTF8_CHECK; e = jit_exec (subject, line_end - subject, search_offset, options, sub); if (e != PCRE_ERROR_BADUTF8) - break; + { + if (0 < e && multiline && sub[1] - sub[0] != 0) + { + char const *nl = memchr (subject + sub[0], eolbyte, + sub[1] - sub[0]); + if (nl) + { + /* This match crosses a line boundary; reject it. */ + p = subject + sub[0]; + line_end = nl; + continue; + } + } + break; + } int valid_bytes = sub[0]; /* Try to match the string before the encoding error. */ @@ -304,6 +338,15 @@ Pexecute (char const *buf, size_t size, size_t *match_size, beg = matchbeg; end = matchend; } + else if (multiline) + { + char const *prev_nl = memrchr (line_start - 1, eolbyte, + matchbeg - (line_start - 1)); + char const *next_nl = memchr (matchend, eolbyte, + line_end + 1 - matchend); + beg = prev_nl + 1; + end = next_nl + 1; + } else { beg = line_start; -- 2.5.0 --------------090306030502020805020807-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 06 12:57:58 2016 Received: (at 20526) by debbugs.gnu.org; 6 Jan 2016 17:57:58 +0000 Received: from localhost ([127.0.0.1]:40539 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGsLO-0006gk-KX for submit@debbugs.gnu.org; Wed, 06 Jan 2016 12:57:58 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:42944) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGsLM-0006gY-Pb for 20526@debbugs.gnu.org; Wed, 06 Jan 2016 12:57:57 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id EA1A21601E7; Wed, 6 Jan 2016 09:57:50 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id oqizPgGGqliX; Wed, 6 Jan 2016 09:57:50 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 2B99D1608D3; Wed, 6 Jan 2016 09:57:50 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id nYy3hYzwjXnN; Wed, 6 Jan 2016 09:57:50 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 0CB971601E7; Wed, 6 Jan 2016 09:57:50 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Jim Meyering , 20526@debbugs.gnu.org, sebastian.poehn@gmail.com References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> <5684F57F.2090802@cs.ucla.edu> <568CD111.5010801@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <568D559A.6050000@cs.ucla.edu> Date: Wed, 6 Jan 2016 09:57:46 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <568CD111.5010801@cs.ucla.edu> Content-Type: multipart/mixed; boundary="------------030203020100050103000008" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: Johannes Meixner , Kamil Dudka , Benno Schulenberg X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------030203020100050103000008 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit On 01/06/2016 12:32 AM, Paul Eggert wrote: > I installed the attached patch, which fixed this performance bug for me. Whoops! I forgot to 'git add src/search.h' before committing. We also need the attached followup patch, which I installed. --------------030203020100050103000008 Content-Type: text/x-patch; name="0001-grep-restore-P-optimization-followup-fix.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-restore-P-optimization-followup-fix.patch" >From 5a71d9d4afc2ec1a7a2c6e5c3fac33709ddc6551 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 6 Jan 2016 09:55:28 -0800 Subject: [PATCH] grep: restore -P optimization (followup fix) * src/search.h (EGexecute, Fexecute, Pexecute): Change decls to match new implementations. I forgot to add this file to the previous commit. --- src/search.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/search.h b/src/search.h index 5031d67..a69bf19 100644 --- a/src/search.h +++ b/src/search.h @@ -57,15 +57,15 @@ extern wint_t mb_next_wc (char const *, char const *); /* dfasearch.c */ extern void GEAcompile (char const *, size_t, reg_syntax_t); -extern size_t EGexecute (char const *, size_t, size_t *, char const *); +extern size_t EGexecute (char *, size_t, size_t *, char const *); /* kwsearch.c */ extern void Fcompile (char const *, size_t); -extern size_t Fexecute (char const *, size_t, size_t *, char const *); +extern size_t Fexecute (char *, size_t, size_t *, char const *); /* pcresearch.c */ extern void Pcompile (char const *, size_t); -extern size_t Pexecute (char const *, size_t, size_t *, char const *); +extern size_t Pexecute (char *, size_t, size_t *, char const *); /* Return the number of bytes in the character at the start of S, which is of size N. N must be positive. MBS is the conversion state. -- 2.5.0 --------------030203020100050103000008-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 06 13:12:02 2016 Received: (at 20526) by debbugs.gnu.org; 6 Jan 2016 18:12:02 +0000 Received: from localhost ([127.0.0.1]:40561 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGsZ0-0007lL-KA for submit@debbugs.gnu.org; Wed, 06 Jan 2016 13:12:02 -0500 Received: from mail-ig0-f174.google.com ([209.85.213.174]:32917) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aGsYz-0007kt-E6 for 20526@debbugs.gnu.org; Wed, 06 Jan 2016 13:12:01 -0500 Received: by mail-ig0-f174.google.com with SMTP id z14so33816620igp.0 for <20526@debbugs.gnu.org>; Wed, 06 Jan 2016 10:12:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=R4sGcYI5M4Rbqnp4REeuDY3i2R6+5SY/5wzOIoRXSPc=; b=D9qL1Mkgvges/iEV9w5H5iAkk22PoaLsPb3FqivwxhjOe8W7tV9iUyF3tjMRr1xZJn xMe1g7daElVoNvvPYYE2Eadjzj29zS0eBNBKFXf2/YJt6drim9lVm5UlS3khGG3zgz3f gLIx3TPq2xWAVmHCl+B9tN3jBdLmJyGt3OpG6EJSiTqSxCTa+XgZ6dSdt8ucpUmlpq86 DXIdkQmMyKM47Cp9BDP1ZzJNwIxeUbKvSeyQuR3A3+GdG2cAP8lBm0l9xCmD4+43kzR5 GpOdeV8EKUgs7CLnLmZ8Q/6Q5oTLuLJFbS4ZpuyVCOzCeiALULX0UXpfLZaP1JjtE/sl nCCA== X-Received: by 10.50.150.5 with SMTP id ue5mr9930314igb.50.1452103916019; Wed, 06 Jan 2016 10:11:56 -0800 (PST) MIME-Version: 1.0 Received: by 10.36.10.18 with HTTP; Wed, 6 Jan 2016 10:11:36 -0800 (PST) In-Reply-To: <568D559A.6050000@cs.ucla.edu> References: <1430996888.2678.8.camel@googlemail.com> <5684A010.4000302@cs.ucla.edu> <5684F57F.2090802@cs.ucla.edu> <568CD111.5010801@cs.ucla.edu> <568D559A.6050000@cs.ucla.edu> From: Jim Meyering Date: Wed, 6 Jan 2016 10:11:36 -0800 X-Google-Sender-Auth: XymAh83JFStohigb9e-rs7QqvmA Message-ID: Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Paul Eggert Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.5 (/) X-Debbugs-Envelope-To: 20526 Cc: =?UTF-8?Q?Sebastian_P=C3=B6hn?= , Kamil Dudka , Benno Schulenberg , 20526@debbugs.gnu.org, Johannes Meixner X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Wed, Jan 6, 2016 at 9:57 AM, Paul Eggert wrote: > On 01/06/2016 12:32 AM, Paul Eggert wrote: >> >> I installed the attached patch, which fixed this performance bug for me. > > Whoops! I forgot to 'git add src/search.h' before committing. We also need > the attached followup patch, which I installed. Oh, perfect! Thank you once again. Happy new year. Interestingly, while running tests of the just-updated code, I've just noticed an unrelated false-positive failure on fast systems: I will adjust the mb-non-UTF8-performance test to be more adaptive: rather than using a fixed-size input, I'll choose one that is large enough to make the unibyte grep invocation take a certain amount of time. Once that's resolved, I'll make a pre-release snapshot, planning to let that soak for a couple weeks before releasing grep-2.23. From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 08 08:44:40 2016 Received: (at 20526) by debbugs.gnu.org; 8 Jan 2016 13:44:40 +0000 Received: from localhost ([127.0.0.1]:42394 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHXLM-0006J0-3l for submit@debbugs.gnu.org; Fri, 08 Jan 2016 08:44:40 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:58089) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHXLJ-0006Il-RJ for 20526@debbugs.gnu.org; Fri, 08 Jan 2016 08:44:38 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id B16F380576 for <20526@debbugs.gnu.org>; Fri, 8 Jan 2016 22:44:29 +0900 (JST) X-matriXscan-loop-detect: 0a2bea3fa4dd6538ff7cae9d2b51040f15110c2e Received: from mail03.kcn.ne.jp ([61.86.6.182]) by mxs02-s with ESMTP; Fri, 08 Jan 2016 22:44:28 +0900 (JST) Received: from [10.120.1.74] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail03.kcn.ne.jp (Postfix) with ESMTPA id 9DBFF141009A; Fri, 8 Jan 2016 22:44:27 +0900 (JST) Date: Fri, 08 Jan 2016 22:44:28 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#20526: grep BUG: text file is detected as binary In-Reply-To: <568D559A.6050000@cs.ucla.edu> References: <568CD111.5010801@cs.ucla.edu> <568D559A.6050000@cs.ucla.edu> Message-Id: <20160108224427.A9B6.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: Kamil Dudka , Benno Schulenberg , Jim Meyering , Johannes Meixner , sebastian.poehn@gmail.com, 22103-done@gnu.org, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Wed, 6 Jan 2016 09:57:46 -0800 Paul Eggert wrote: > On 01/06/2016 12:32 AM, Paul Eggert wrote: > > I installed the attached patch, which fixed this performance bug for me. > Whoops! I forgot to 'git add src/search.h' before committing. We also need the attached followup patch, which I installed. Great! Thanks, many issues including for output of invalid sequence are fixed by your patches. bug#22103 is also fixed in them, so I am closing it. From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 08 10:28:11 2016 Received: (at 20526) by debbugs.gnu.org; 8 Jan 2016 15:28:11 +0000 Received: from localhost ([127.0.0.1]:43335 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHYxW-0002CO-ON for submit@debbugs.gnu.org; Fri, 08 Jan 2016 10:28:11 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35221) by debbugs.gnu.org with esmtp (Exim 4.84) (envelope-from ) id 1aHYxV-0002CC-H9 for 20526@debbugs.gnu.org; Fri, 08 Jan 2016 10:28:10 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id BE927160D3D; Fri, 8 Jan 2016 07:28:02 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id J9Sc3JogJ2Km; Fri, 8 Jan 2016 07:28:01 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 981AF160D51; Fri, 8 Jan 2016 07:28:01 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 9IflcM-CltZY; Fri, 8 Jan 2016 07:28:01 -0800 (PST) Received: from [192.168.1.9] (pool-100-32-155-148.lsanca.fios.verizon.net [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 70E46160D3D; Fri, 8 Jan 2016 07:28:01 -0800 (PST) Subject: Re: bug#20526: grep BUG: text file is detected as binary To: Norihiro Tanaka References: <20160101002311.8FB1.27F6AC2D@kcn.ne.jp> <56856E16.3010207@cs.ucla.edu> <20160102063903.C3A5.27F6AC2D@kcn.ne.jp> <56875EAE.7030309@cs.ucla.edu> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <568FD57C.9040201@cs.ucla.edu> Date: Fri, 8 Jan 2016 07:27:56 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <56875EAE.7030309@cs.ucla.edu> Content-Type: multipart/mixed; boundary="------------050906080803040606050809" X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20526 Cc: sebastian.poehn@gmail.com, 20526@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) This is a multi-part message in MIME format. --------------050906080803040606050809 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Paul Eggert wrote: > I missed the possibility of a unibyte encoding where not all bytes are valid > unibyte characters. I found a significant performance problem related to that bug and bug fix, and installed the attached further patch 0001. Come to think of it, this issue should be in NEWS too, so I added the attached patch 0002. --------------050906080803040606050809 Content-Type: text/x-diff; name="0001-grep-improve-unibyte-P-performance.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-improve-unibyte-P-performance.patch" >From d1160ec6d239b2e0f20c2fb3395e3b70963bf916 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 7 Jan 2016 21:28:23 -0800 Subject: [PATCH 1/2] grep: improve unibyte -P performance This is a followon to the recent changes prompted by Bug#20526. In Norihiro Tanaka pointed out that grep mistakenly assumed that unibyte locales cannot have encoding errors. Here, the mistake hurt performance significantly. On Fedora 23 x86-64 in the C locale, this patch improved grep's performance by a factor of 7 when run as "grep -P 'z.*a'" on the output of "yes $(printf '\200\n') | head -n 1000000000". * src/pcresearch.c (multibyte_locale) [HAVE_LIBPCRE]: New static var. (Pcompile): Set it. (Pexecute): Use it to avoid the need to call buf_has_encoding_errors in unibyte locales. --- src/pcresearch.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/src/pcresearch.c b/src/pcresearch.c index c0b8678..1fae94d 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -84,6 +84,8 @@ jit_exec (char const *subject, int search_bytes, int search_offset, /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty string matches when that flag is used. */ static int empty_match[2]; + +static bool multibyte_locale; #endif void @@ -104,10 +106,14 @@ Pcompile (char const *pattern, size_t size) char const *p; char const *pnul; - if (using_utf8 ()) - flags |= PCRE_UTF8; - else if (MB_CUR_MAX != 1) - error (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales")); + if (1 < MB_CUR_MAX) + { + if (! using_utf8 ()) + error (EXIT_TROUBLE, 0, + _("-P supports only unibyte and UTF-8 locales")); + multibyte_locale = true; + flags |= PCRE_UTF8; + } /* FIXME: Remove these restrictions. */ if (memchr (pattern, '\n', size)) @@ -194,12 +200,16 @@ Pexecute (char *buf, size_t size, size_t *match_size, error. */ char const *subject = buf; - /* If the input is free of encoding errors a multiline search is + /* If the input is unibyte or is free of encoding errors a multiline search is typically more efficient. Otherwise, a single-line search is typically faster, so that pcre_exec doesn't waste time validating the entire input buffer. */ - bool multiline = ! buf_has_encoding_errors (buf, size - 1); - buf[size - 1] = eolbyte; + bool multiline = true; + if (multibyte_locale) + { + multiline = ! buf_has_encoding_errors (buf, size - 1); + buf[size - 1] = eolbyte; + } for (; p < buf + size; p = line_start = line_end + 1) { -- 2.5.0 --------------050906080803040606050809 Content-Type: text/x-diff; name="0002-doc-mention-unibyte-encoding-fix.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0002-doc-mention-unibyte-encoding-fix.patch" >From ca68df394ba1d9359c0e4d825394ab875c7fe1c2 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 7 Jan 2016 21:34:00 -0800 Subject: [PATCH 2/2] doc: mention unibyte encoding fix * NEWS: Document recent fix for encoding errors in unibyte locales. --- NEWS | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/NEWS b/NEWS index f572a0c..a0f6bbb 100644 --- a/NEWS +++ b/NEWS @@ -18,6 +18,11 @@ GNU grep NEWS -*- outline -*- grep -c no longer stops counting when finding binary data. [bug introduced in grep-2.21] + grep no longer outputs encoding errors in unibyte locales. + For example, if the byte '\x81' is not a valid character in a + unibyte locale, grep treats the byte as binary data. + [bug introduced in grep-2.21] + grep -oP is no longer susceptible to an infinite loop when processing invalid UTF8 just before a match. [bug introduced in grep-2.22] -- 2.5.0 --------------050906080803040606050809-- From unknown Sun Jun 22 00:05:03 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sat, 06 Feb 2016 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator