From unknown Mon Jun 16 23:51:26 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#28255 <28255@debbugs.gnu.org> To: bug#28255 <28255@debbugs.gnu.org> Subject: Status: grep erroneously skips Microsoft UTF-8 text files as being binary Reply-To: bug#28255 <28255@debbugs.gnu.org> Date: Tue, 17 Jun 2025 06:51:26 +0000 retitle 28255 grep erroneously skips Microsoft UTF-8 text files as being bi= nary reassign 28255 grep submitter 28255 Simon severity 28255 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 27 17:23:52 2017 Received: (at submit) by debbugs.gnu.org; 27 Aug 2017 21:23:52 +0000 Received: from localhost ([127.0.0.1]:58355 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dm527-0003rz-RF for submit@debbugs.gnu.org; Sun, 27 Aug 2017 17:23:52 -0400 Received: from eggs.gnu.org ([208.118.235.92]:57602) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dm4sm-0003dt-KL for submit@debbugs.gnu.org; Sun, 27 Aug 2017 17:14:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dm4sg-0006dV-Ef for submit@debbugs.gnu.org; Sun, 27 Aug 2017 17:14:07 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:47512) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dm4sg-0006dR-Bh for submit@debbugs.gnu.org; Sun, 27 Aug 2017 17:14:06 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48573) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dm4sf-0006ci-GO for bug-grep@gnu.org; Sun, 27 Aug 2017 17:14:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dm4sc-0006cO-6H for bug-grep@gnu.org; Sun, 27 Aug 2017 17:14:05 -0400 Received: from pmta31.teksavvy.com ([76.10.157.38]:43933) by eggs.gnu.org with esmtps (TLS1.0:RSA_ARCFOUR_SHA1:16) (Exim 4.71) (envelope-from ) id 1dm4sc-0006Qe-1K for bug-grep@gnu.org; Sun, 27 Aug 2017 17:14:02 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A2H0AQC+NKNZ/2mYF4cNUBwBAQQBAQoBA?= =?us-ascii?q?YlKmm0BAQEBAQEGgQiYWhyCQIJhhEYBAgEBAQEBAgOGUoELAiYCSwEgCAEBiiC?= =?us-ascii?q?xVWuCJ4hXgy+BDYIdgwmCKisLiDOCR4JCHwWgYwGDDIggixQBggCHRYcolj2BZ?= =?us-ascii?q?VMkhSYBAQEHAgGCYotfAQEB?= X-IPAS-Result: =?us-ascii?q?A2H0AQC+NKNZ/2mYF4cNUBwBAQQBAQoBAYlKmm0BAQEBAQE?= =?us-ascii?q?GgQiYWhyCQIJhhEYBAgEBAQEBAgOGUoELAiYCSwEgCAEBiiCxVWuCJ4hXgy+BD?= =?us-ascii?q?YIdgwmCKisLiDOCR4JCHwWgYwGDDIggixQBggCHRYcolj2BZVMkhSYBAQEHAgG?= =?us-ascii?q?CYotfAQEB?= X-IronPort-AV: E=Sophos;i="5.41,438,1498536000"; d="scan'208";a="2563197" Received: from 135-23-152-105.cpe.pppoe.ca (HELO [192.168.1.148]) ([135.23.152.105]) by smtp.teksavvy.com with ESMTP/TLS/DHE-RSA-AES128-SHA; 27 Aug 2017 17:13:35 -0400 To: bug-grep@gnu.org From: Simon Subject: grep erroneously skips Microsoft UTF-8 text files as being binary Message-ID: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@teksavvy.com> Date: Sun, 27 Aug 2017 17:13:34 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sun, 27 Aug 2017 17:23:51 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) Windows text files can start with a byte order mark of U+FEFF and then be encoded in UTF-8. These are skipped as being binary files. From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 27 17:47:37 2017 Received: (at 28255) by debbugs.gnu.org; 27 Aug 2017 21:47:37 +0000 Received: from localhost ([127.0.0.1]:58371 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dm5P7-0004Q9-Bv for submit@debbugs.gnu.org; Sun, 27 Aug 2017 17:47:37 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60566) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dm5P5-0004Pw-3L for 28255@debbugs.gnu.org; Sun, 27 Aug 2017 17:47:36 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7D70E160938; Sun, 27 Aug 2017 14:47:29 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id LpQBBWyLRyoK; Sun, 27 Aug 2017 14:47:28 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id C0F0C16093C; Sun, 27 Aug 2017 14:47:28 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id lecrfPQLyJDk; Sun, 27 Aug 2017 14:47:28 -0700 (PDT) Received: from [192.168.1.9] (unknown [47.153.184.153]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 9A331160872; Sun, 27 Aug 2017 14:47:28 -0700 (PDT) Subject: Re: bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary To: Simon , 28255@debbugs.gnu.org References: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@teksavvy.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <80b5a5bd-7b47-74b8-01b4-b681d8cc12ee@cs.ucla.edu> Date: Sun, 27 Aug 2017 14:47:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@teksavvy.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 28255 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Simon wrote: > Windows text files can start with a byte order mark of U+FEFF and then > be encoded in UTF-8. These are skipped as being binary files. I can't reproduce this problem on Fedora 26 x86-64. Here's how I tried: $ printf '\357\273\277x\n' >t $ LC_ALL=C grep x t | od -c 0000000 357 273 277 x \n 0000005 To help us diagnose the problem, please send a simple, self-contained example, and mention your platform. From debbugs-submit-bounces@debbugs.gnu.org Sun Aug 27 20:18:58 2017 Received: (at 28255) by debbugs.gnu.org; 28 Aug 2017 00:18:58 +0000 Received: from localhost ([127.0.0.1]:58429 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dm7la-0007lj-7Z for submit@debbugs.gnu.org; Sun, 27 Aug 2017 20:18:58 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39362) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1dm7lY-0007lV-63 for 28255@debbugs.gnu.org; Sun, 27 Aug 2017 20:18:56 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0D0D216091B; Sun, 27 Aug 2017 17:18:49 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id y0uW3O9SdZXt; Sun, 27 Aug 2017 17:18:48 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4561C160921; Sun, 27 Aug 2017 17:18:48 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id WqhC622FwukM; Sun, 27 Aug 2017 17:18:48 -0700 (PDT) Received: from [192.168.1.9] (unknown [47.153.184.153]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 23E81160918; Sun, 27 Aug 2017 17:18:48 -0700 (PDT) Subject: Re: bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary To: Simon References: <8a3899b9-0117-f694-eff5-bcfbdd8150a3@teksavvy.com> <80b5a5bd-7b47-74b8-01b4-b681d8cc12ee@cs.ucla.edu> <148439f0-7616-e9bd-9ccd-fe114e6ab602@teksavvy.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sun, 27 Aug 2017 17:18:47 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <148439f0-7616-e9bd-9ccd-fe114e6ab602@teksavvy.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 28255 Cc: 28255@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Simon wrote: > Sorry my description was slightly ambiguous. I should not have said > skip so much as treats the file as binary and does not find a match > because each character takes 2 octets as per utf-8. > > $ mkdir tmp > $ cd tmp > $ > $ printf > '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt > $ printf 'test2\r\n' >2.txt > $ > $ hexdump -C 1.txt > 00000000 ff fe 74 00 65 00 73 00 74 00 31 00 0d 00 0a 00 > |..t.e.s.t.1.....| > 00000010 > $ hexdump -C 2.txt > 00000000 74 65 73 74 32 0d 0a |test2..| > 00000007 > $ > $ grep --include=*.txt test * > 2.txt:test2 > $ > > I've made the two files as they appear on a Windows system (since lots > of us move lots of files between operating systems). As you can see, > the "1.txt" is skipped because the characters are encoded two octets per > byte. > > As an example that "1.txt" is a valid Windows text file, if you edit > "1.txt" with Notepad on a Windows system, Notepad will detect BOM at the > beginning and switch to UTF-8 encoding, and preserve it upon saving. > > That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file > format for Windows text files. (I can only confirm Win 7 or higher.) > > I guess this should really be considered a feature, not a bug. > > Similar happens for Cygwin grep running under windows. You're right. grep and most other GNU tools do not support UTF-16. You can use the 'recode' command to convert to UTF-8, which grep does support. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 31 14:47:31 2019 Received: (at control) by debbugs.gnu.org; 31 Dec 2019 19:47:31 +0000 Received: from localhost ([127.0.0.1]:35175 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imNUJ-0003Sf-Fr for submit@debbugs.gnu.org; Tue, 31 Dec 2019 14:47:31 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:41416) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imNUH-0003Nx-1l for control@debbugs.gnu.org; Tue, 31 Dec 2019 14:47:29 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7E3FB16027C for ; Tue, 31 Dec 2019 11:47:23 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id YEAwh8yDfJiy for ; Tue, 31 Dec 2019 11:47:22 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id DF23616027E for ; Tue, 31 Dec 2019 11:47:22 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id RmShTKSXbF5R for ; Tue, 31 Dec 2019 11:47:22 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id C1A8E16027C for ; Tue, 31 Dec 2019 11:47:22 -0800 (PST) To: control@debbugs.gnu.org From: Paul Eggert Subject: 28255 is not a bug Organization: UCLA Computer Science Department Message-ID: <6bb40db3-4e5b-2611-fb60-56e768f8f4a3@cs.ucla.edu> Date: Tue, 31 Dec 2019 11:47:22 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) close 28255 From unknown Mon Jun 16 23:51:26 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Wed, 29 Jan 2020 12:24:06 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator