GNU bug report logs -
#28255
grep erroneously skips Microsoft UTF-8 text files as being binary
Previous Next
Reported by: Simon <ixlr82c <at> teksavvy.com>
Date: Sun, 27 Aug 2017 21:24:02 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Simon wrote:
> Sorry my description was slightly ambiguous. I should not have said
> skip so much as treats the file as binary and does not find a match
> because each character takes 2 octets as per utf-8.
>
> $ mkdir tmp
> $ cd tmp
> $
> $ printf
> '\377\376\164\000\145\000\163\000\164\000\061\000\015\000\012\000' >1.txt
> $ printf 'test2\r\n' >2.txt
> $
> $ hexdump -C 1.txt
> 00000000 ff fe 74 00 65 00 73 00 74 00 31 00 0d 00 0a 00
> |..t.e.s.t.1.....|
> 00000010
> $ hexdump -C 2.txt
> 00000000 74 65 73 74 32 0d 0a |test2..|
> 00000007
> $
> $ grep --include=*.txt test *
> 2.txt:test2
> $
>
> I've made the two files as they appear on a Windows system (since lots
> of us move lots of files between operating systems). As you can see,
> the "1.txt" is skipped because the characters are encoded two octets per
> byte.
>
> As an example that "1.txt" is a valid Windows text file, if you edit
> "1.txt" with Notepad on a Windows system, Notepad will detect BOM at the
> beginning and switch to UTF-8 encoding, and preserve it upon saving.
>
> That is, UTF-8 (BOM + 2 octet characters) is an acceptable text file
> format for Windows text files. (I can only confirm Win 7 or higher.)
>
> I guess this should really be considered a feature, not a bug.
>
> Similar happens for Cygwin grep running under windows.
You're right. grep and most other GNU tools do not support UTF-16. You can use
the 'recode' command to convert to UTF-8, which grep does support.
This bug report was last modified 5 years and 140 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.