GNU bug report logs - #28255
grep erroneously skips Microsoft UTF-8 text files as being binary

Previous Next

Package: grep;

Reported by: Simon <ixlr82c <at> teksavvy.com>

Date: Sun, 27 Aug 2017 21:24:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Simon <ixlr82c <at> teksavvy.com>
Cc: 28255 <at> debbugs.gnu.org
Subject: bug#28255: grep erroneously skips Microsoft UTF-8 text files as being binary
Date: Sun, 27 Aug 2017 17:18:47 -0700
Simon wrote:
> Sorry my description was slightly ambiguous.  I should not have said
> skip so much as treats the file as binary and does not find a match
> because each character takes 2 octets as per utf-8.
> 
> $ mkdir tmp
> $ cd tmp
> $
> $ printf
> '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt
> $ printf 'test2\r\n' >2.txt
> $
> $ hexdump -C 1.txt
> 00000000  ff fe 74 00 65 00 73 00  74 00 31 00 0d 00 0a 00
> |..t.e.s.t.1.....|
> 00000010
> $ hexdump -C 2.txt
> 00000000  74 65 73 74 32 0d 0a                              |test2..|
> 00000007
> $
> $ grep --include=*.txt test *
> 2.txt:test2
> $
> 
> I've made the two files as they appear on a Windows system (since lots
> of us move lots of files between operating systems).  As you can see,
> the "1.txt" is skipped because the characters are encoded two octets per
> byte.
> 
> As an example that "1.txt" is a valid Windows text file, if you edit
> "1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
> beginning and switch to UTF-8 encoding, and preserve it upon saving.
> 
> That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
> format for Windows text files.  (I can only confirm Win 7 or higher.)
> 
> I guess this should really be considered a feature, not a bug.
> 
> Similar happens for Cygwin grep running under windows.

You're right. grep and most other GNU tools do not support UTF-16. You can use 
the 'recode' command to convert to UTF-8, which grep does support.




This bug report was last modified 5 years and 140 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.