GNU bug report logs -
#22838
New 'Binary file' detection considered harmful
Previous Next
Full log
View this message in rfc822 format
On 02/29/2016 06:56 PM, Eric Blake wrote:
> On 02/29/2016 10:54 AM, Eric Blake wrote:
>> Encoding errors are not characters, but bytes. A line cannot contain
>> encoding errors. Therefore, a file with encoding errors is not a text file.
>
> Corollary - there exist files which are text files in some locales, but
> binary files in others (based on whether the locale interprets the bytes
> as an encoding error or as valid characters).
>
> Yes, locale dependencies on standard behavior can be annoying.
>
You assume that a user will only ever want to grep text files encoded in
the machine's locale. That is not so.
As a German user I have on my disk files in many encodings: utf-8,
iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
old WordStar files that used control characters inside.
Since 2.21 I will now have to always specify -a or LC_ALL=C when
grepping my files.
Regards
--
Marcello Perathoner
This bug report was last modified 8 years and 256 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.