GNU bug report logs -
#22838
New 'Binary file' detection considered harmful
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22838 in the body.
You can then email your comments to 22838 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Sun, 28 Feb 2016 18:13:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Marcello Perathoner <marcello <at> perathoner.de>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 28 Feb 2016 18:13:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
The new heuristics to detect 'Binary files' should be reverted to the
old one (before 2.20) as the new one has too big a potential to silently
fail important tasks.
One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar. This is often done by backup
software, eg. the in debian package 'backup2l'.
The new behaviour of grep -- to output 'Binary file matches' after
output started -- has silently broken the 'backup2l' script and has the
potential of silently breaking many other backup scripts as well.
Test case:
$ find /etc/ssl/certs/ | LANG= grep pem
Outcome:
grep will stop with 'Binary file (standard input) matches' after
outputting a small percentage of the existing .pem files.
Expected behaviour:
grep should list all .pem files.
This behaviour is particularly insidious because users may not notice
that their backup archives are a bit smaller than before or that their
backups complete a bit faster, while many thousand files may be missing.
Q: Why do you use LANG= ?
A: To illustrate the problem and because 'backup2l' does that.
Q: Why don't people use the -a switch?
A: People may not notice anything wrong with their backups until they
need them.
Q: Why don't you file a bug against 'backup2l'?
A: I will. But this is such a common use case that I suspect that many
of the backup scripts that people wrote just for themselves are now broken.
Q: Why don't you just set the correct locale?
A: Even then it suffices to have one bogus-encoded filename somewhere to
break your whole backup. It is easy to catch such a file from the
internet or from song or picture metadata.
Regards
--
Marcello Perathoner
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Sun, 28 Feb 2016 22:14:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 22838 <at> debbugs.gnu.org (full text, mbox):
Marcello Perathoner wrote:
> The new behaviour of grep -- to output 'Binary file matches' after output
> started
I assume that the "new behavior" you're talking about is for grep 2.21
(2014-11-23) and later, as that's the version of grep that started outputting
"Binary file matches" due to input encoding errors. For example, on my platform
(Ubuntu 15.10), the shell command:
LC_ALL=C awk 'BEGIN {for(i=1; i<256; i++) printf "%c %d\n", i, i}' |
LC_ALL=en_US.utf8 grep 126
outputs "Binary file (standard input) matches" in grep 2.21.
These changes were put in partly due to security issues, not only having to do
with grep's internals (the old 'grep' would dump core sometimes when given
encoding errors), but also for the benefit of invokers expecting properly
encoded text.
To some extent we were stuck between a rock and a hard place here. No matter
what 'grep' does, it will do the wrong thing for some usages. But overall we
thought it better for grep's output to be valid text.
I think you can work around the problem for unfixed backup2l by setting your
system's locale to a unibyte locale where all bytes are valid. The
en_US.ISO-8859-15 locale, say.
Of course backup2l should get fixed, regardless of what we do with 'grep' or
with your system locale.
> $ find /etc/ssl/certs/ | LANG= grep pem
Wouldn't the following be better?
find /etc/ssl/certs/ -name '*.pem'
This avoids false matches like '/etc/ssl/certs/pemmican'. Alternatively:
find /etc/ssl/certs/ -print | grep -a '\.pem$'
> It is easy to catch such a file from the internet or from song or picture metadata.
None of the above approaches will work for arbitrary file names ("off the
Internet"), because they all mishandle file names containing newlines. backup2l
needs to do something like this:
find /etc/ssl/certs/ -name '*.pem' -print0
or like this:
find /etc/ssl/certs/ -print0 | grep -az '\.pem$'
with remaining code using null bytes instead of newlines to terminate file
names. This is the sort of thing that backup2l should be doing.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 17:15:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/28/2016 11:13 PM, Paul Eggert wrote:
> These changes were put in partly due to security issues, not only having
> to do with grep's internals (the old 'grep' would dump core sometimes
> when given encoding errors), but also for the benefit of invokers
> expecting properly encoded text.
>
> To some extent we were stuck between a rock and a hard place here. No
> matter what 'grep' does, it will do the wrong thing for some usages. But
> overall we thought it better for grep's output to be valid text.
You are driving out demons by Beelzebub.
grep is a core component of every unix system. You cannot change the
behaviour or interface of such a fundamental tool without incurring in
substantial breakage. Keeping the old bug is far wiser than to fix it
and introduce a new bug.
Copying faulty input to the output is a preferable failure mode to
dropping part of the expected output. People do not expect grep to
validate their input but they do expect grep to produce a complete
result set.
A text file with encoding problems is a text file and not a binary file.
>> $ find /etc/ssl/certs/ | LANG= grep pem
>
> Wouldn't the following be better?
>
> find /etc/ssl/certs/ -name '*.pem'
I'm not doing that. That was just an example to show how grep now gives
incorrect results.
Many more cases can be made: any process that feeds tainted
(user-provided) strings to grep can now be made to fail. Eg. a process
that greps apache logs for known exploit signatures will now fail if the
attacker sends a bogus user-agent string.
Regards
--
Marcello Perathoner
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 17:23:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 22838 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/29/2016 10:14 AM, Marcello Perathoner wrote:
>
> A text file with encoding problems is a text file and not a binary file.
Wrong, at least according to the POSIX definition of text file. A text
file is one with no encoding errors.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 17:41:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/29/2016 06:22 PM, Eric Blake wrote:
> On 02/29/2016 10:14 AM, Marcello Perathoner wrote:
>>
>> A text file with encoding problems is a text file and not a binary file.
>
> Wrong, at least according to the POSIX definition of text file. A text
> file is one with no encoding errors.
"""
3.397 Text File
A file that contains characters organized into zero or more lines. The
lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
in length, including the <newline> character. Although POSIX.1-2008 does
not distinguish between text files and binary files (see the ISO C
standard), many utilities only produce predictable or meaningful output
when operating on text files. The standard utilities that have such
restrictions always specify "text files" in their STDIN or INPUT FILES
sections.
"""
-- The Open Group Base Specifications Issue 7
IEEE Std 1003.1, 2013 Edition
Copyright © 2001-2013 The IEEE and The Open Group
Regards
--
Marcello Perathoner
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 17:55:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 22838 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/29/2016 10:40 AM, Marcello Perathoner wrote:
>> Wrong, at least according to the POSIX definition of text file. A text
>> file is one with no encoding errors.
>
>
> """
> 3.397 Text File
>
> A file that contains characters organized into zero or more lines. The
> lines do not contain NUL characters and none can exceed {LINE_MAX} bytes
> in length, including the <newline> character. Although POSIX.1-2008 does
> not distinguish between text files and binary files (see the ISO C
> standard), many utilities only produce predictable or meaningful output
> when operating on text files. The standard utilities that have such
> restrictions always specify "text files" in their STDIN or INPUT FILES
> sections.
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html
>
> 3.206 Line
>
> A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
>
> 3.87 Character
>
> A sequence of one or more bytes representing a single graphic symbol or control code.
>
> Note:
> This term corresponds to the ISO C standard term multi-byte character, where a single-byte character is a special case of a multi-byte character. Unlike the usage in the ISO C standard, character here has no necessary relationship with storage space, and byte is used when storage space is discussed.
>
> See the definition of the portable character set in Portable Character Set for a further explanation of the graphical representations of (abstract) characters, as opposed to character encodings.
>
Encoding errors are not characters, but bytes. A line cannot contain
encoding errors. Therefore, a file with encoding errors is not a text file.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 17:57:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 22838 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/29/2016 10:54 AM, Eric Blake wrote:
> Encoding errors are not characters, but bytes. A line cannot contain
> encoding errors. Therefore, a file with encoding errors is not a text file.
Corollary - there exist files which are text files in some locales, but
binary files in others (based on whether the locale interprets the bytes
as an encoding error or as valid characters).
Yes, locale dependencies on standard behavior can be annoying.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 19:30:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/29/2016 09:14 AM, Marcello Perathoner wrote:
> Keeping the old bug is far wiser than to fix it and introduce a new bug.
That depends on the bugs in question. The old bugs were pretty bad.
> Copying faulty input to the output is a preferable failure mode
Again, we cannot satisfy everybody. There are reasonable complaints from
users if 'grep' blasts improperly-encoded data to their terminals, or
more generally if grep's improperly-encoded output trashes other
programs that read the output. This is why grep has the -a option. It
sounds like you need grep's -a option for your application, and it
should be easy to use -a. It's not clear that -a should be the default.
> any process that feeds tainted (user-provided) strings to grep can now
> be made to fail. Eg. a process that greps apache logs for known
> exploit signatures will now fail if the attacker sends a bogus
> user-agent string.
Such a process won't fail if it uses grep's -a option, or if it treats
the "Binary file matches" diagnostic as an indication that there are
possible attacks, or if it is run in a unibyte locale where all bytes
are valid characters, or if it looks at grep's exit status. Granted,
slapdash approaches that don't do any of these things will be
vulnerable, but they'll be vulnerable even with older grep versions.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 20:12:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/29/2016 06:56 PM, Eric Blake wrote:
> On 02/29/2016 10:54 AM, Eric Blake wrote:
>> Encoding errors are not characters, but bytes. A line cannot contain
>> encoding errors. Therefore, a file with encoding errors is not a text file.
>
> Corollary - there exist files which are text files in some locales, but
> binary files in others (based on whether the locale interprets the bytes
> as an encoding error or as valid characters).
>
> Yes, locale dependencies on standard behavior can be annoying.
>
You assume that a user will only ever want to grep text files encoded in
the machine's locale. That is not so.
As a German user I have on my disk files in many encodings: utf-8,
iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
old WordStar files that used control characters inside.
Since 2.21 I will now have to always specify -a or LC_ALL=C when
grepping my files.
Regards
--
Marcello Perathoner
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 20:35:01 GMT)
Full text and
rfc822 format available.
Message #32 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/29/2016 08:29 PM, Paul Eggert wrote:
> On 02/29/2016 09:14 AM, Marcello Perathoner wrote:
>> Keeping the old bug is far wiser than to fix it and introduce a new bug.
>
> That depends on the bugs in question. The old bugs were pretty bad.
>
>> Copying faulty input to the output is a preferable failure mode
>
> Again, we cannot satisfy everybody. There are reasonable complaints from
> users if 'grep' blasts improperly-encoded data to their terminals, or
> more generally if grep's improperly-encoded output trashes other
> programs that read the output.
They would 'blast' their terminals without grep too. I don't see any
grounds for a complaint like that. Grep is not a sanitizer.
> This is why grep has the -a option. It
> sounds like you need grep's -a option for your application, and it
> should be easy to use -a. It's not clear that -a should be the default.
I was lucky in that I noticed that a 17GB tar file could not be a
complete backup of a 500GB drive. I was lucky because the now offending
filename (the same filename that didn't bother grep for over 10 years)
was early in the file list. If it had been late in the file list I
wouldn't have noticed that a 400GB tar file was missing a few thousand
files.
Other people may not be that lucky and they could get understandably
angry at losing their data.
At least, if you must turn grep into a text file sanitizer, make the new
behaviour optional. You can then tell people who complain about
'blasted' terminals to turn on that option, while other people would not
blindly incur into the new bug.
Regards
--
Marcello Perathoner
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 22:09:01 GMT)
Full text and
rfc822 format available.
Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Am 2016-02-29 um 21:11 schrieb Marcello Perathoner:
> As a German user I have on my disk files in many encodings: utf-8,
> iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
> CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
> old WordStar files that used control characters inside.
>
> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.
You can use a wrapper for grep
mv grep in.grep
and create a new grep file with the following
LC_ALL=C; "/usr/bin/in.grep" "${@}"
that worked perfect
Holger
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 22:38:02 GMT)
Full text and
rfc822 format available.
Message #38 received at 22838 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/29/2016 01:11 PM, Marcello Perathoner wrote:
>> Yes, locale dependencies on standard behavior can be annoying.
>>
>
> You assume that a user will only ever want to grep text files encoded in
> the machine's locale. That is not so.
You've been relying on undefined behavior, and it caught up with you.
It's the same as asking for us to keep use-after-free "working" in a
multithreaded program because it has always "worked" in your older
single-threaded program when nothing was perturbing the memory between
free() and its latent use. A latent bug in your usage is still a bug in
your usage, even if it took a change in grep's defaults to expose your
problem.
And meanwhile, newer grep 2.23 has improved the heuristics to only
complain about a binary file if it would otherwise be outputting
encoding errors (rather than blindly complaining about the encoding
error up front and stopping processing immediately), which does
alleviate some of the worst of the change caused by your undefined usage
(that is, you can still grep for valid encodings, and get reasonable
results so long as the valid text doesn't mix with lines with invalid
encodings).
>
> As a German user I have on my disk files in many encodings: utf-8,
> iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
> CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
> old WordStar files that used control characters inside.
>
> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.
Yes, but then you are no longer relying on undefined behavior, and
therefore have a leg to stand on if we break that behavior.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 23:36:02 GMT)
Full text and
rfc822 format available.
Message #41 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/29/2016 12:34 PM, Marcello Perathoner wrote:
> On 02/29/2016 08:29 PM, Paul Eggert wrote:
>
> They would 'blast' their terminals without grep too.
Sure, but in practice it's common for users to do something like this:
grep -r getaddrinfo_a *
I just now did this in my working copy of the GNU Emacs source code. If
-a were the default, I would see 13874778 bytes on my screen, the vast
majority of which would be useless or even harmful. As grep stands now,
I see just 5480 bytes and they're mostly useful.
> I was lucky in that I noticed that a 17GB tar file could not be a
> complete backup of a 500GB drive.
Yes, you were lucky there. But you were unlucky in that your backup
software invoked grep without worrying about file name validity. Suppose
a file name contained a newline? Your backups could be toast.
> At least ... make the new behaviour optional.
It is optional; we merely disagree about the option's default value.
> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.
I suggest using -a. LC_ALL=C won't work the way that you want on
platforms where the C locale is UTF-8, or is pure ASCII. For example, on
Fedora 23 or RHEL 7 with grep 2.23 we have:
$ printf '\200\n' | LC_ALL=C grep .
Binary file (standard input) matches
This is because the C locale is pure ASCII on these platforms, i.e.,
'\200' is not a valid character the way it is with traditional Unix. I
don't know why Red Hat made that change.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Mon, 29 Feb 2016 23:56:01 GMT)
Full text and
rfc822 format available.
Message #44 received at 22838 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 02/29/2016 04:35 PM, Paul Eggert wrote:
> I suggest using -a. LC_ALL=C won't work the way that you want on
> platforms where the C locale is UTF-8, or is pure ASCII. For example, on
> Fedora 23 or RHEL 7 with grep 2.23 we have:
>
> $ printf '\200\n' | LC_ALL=C grep .
> Binary file (standard input) matches
>
> This is because the C locale is pure ASCII on these platforms, i.e.,
> '\200' is not a valid character the way it is with traditional Unix. I
> don't know why Red Hat made that change.
I _think_ the Austin Group is leaning towards requiring the "C" locale
to always be a unibyte locale with all 256 bytes as valid characters, so
neither strict 7-bit ASCII nor UTF-8 would be usable as the "C" locale;
but for that to happen, POSIX would also need to allow a way to get a
UTF-8 locale easily accessible and describe how it differs from the "C"
locale under such a ruling. But it's still all conjecture on what the
final results will be - even in the standards committee, gracefully
documenting how locale corner cases must behave vs. leaving
implementations some latitude is tricky business; and any such change is
at least 3 or 4 years down the road before it could be standardized in
Issue 8 (right now, the focus is on Technical Corrigendum 2 for Issue 7).
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Tue, 01 Mar 2016 01:25:02 GMT)
Full text and
rfc822 format available.
Message #47 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On Mon, Feb 29, 2016 at 3:35 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 02/29/2016 12:34 PM, Marcello Perathoner wrote:
...
>> Since 2.21 I will now have to always specify -a or LC_ALL=C when
>> grepping my files.
>
> I suggest using -a. LC_ALL=C won't work the way that you want on platforms
> where the C locale is UTF-8, or is pure ASCII. For example, on Fedora 23 or
> RHEL 7 with grep 2.23 we have:
>
> $ printf '\200\n' | LC_ALL=C grep .
> Binary file (standard input) matches
>
> This is because the C locale is pure ASCII on these platforms, i.e., '\200'
> is not a valid character the way it is with traditional Unix. I don't know
> why Red Hat made that change.
Wow. I hadn't noticed that using LC_ALL=C is inadequate.
Disturbing...
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Tue, 01 Mar 2016 02:25:02 GMT)
Full text and
rfc822 format available.
Message #50 received at 22838 <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> $ printf '\200\n' | LC_ALL=C grep .
> Binary file (standard input) matches
>
> This is because the C locale is pure ASCII on these platforms, i.e.,
> '\200' is not a valid character the way it is with traditional Unix. I
> don't know why Red Hat made that change.
I also get the 'Binary file (standard input) matches' output from the
above string on a Linux From Scratch system. We build everything in a
fairly generic way and did nothing special in this area. I suspect this
is something buried deep into glibc.
-- Bruce
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Tue, 01 Mar 2016 04:03:01 GMT)
Full text and
rfc822 format available.
Message #53 received at submit <at> debbugs.gnu.org (full text, mbox):
On 03/01/2016 12:55 AM, Eric Blake wrote:
> I _think_ the Austin Group is leaning towards requiring the "C" locale
> to always be a unibyte locale with all 256 bytes as valid characters,
> so neither strict 7-bit ASCII nor UTF-8 would be usable as the "C"
> locale; but for that to happen, POSIX would also need to allow a way
> to get a UTF-8 locale easily accessible and
You do realize that this leaves all _non-US_users_, who rely on
diacritics or even different character sets entirely
for their language, completely out in the cold.
> describe how it differs from the "C" locale under such a ruling. But
> it's still all conjecture on what the final results will be - even in
> the standards committee, gracefully documenting how locale corner
> cases must behave vs. leaving implementations some latitude is tricky
> business; and any such change is at least 3 or 4 years down the road
> before it could be standardized in Issue 8 (right now, the focus is on
> Technical Corrigendum 2 for Issue 7).
Already back in _1987_, an IT professor in Leiden was especially
appointed for the streamlining of
all the competing character sets that later were merged to become
Unicode. Given the current
state of affairs, nearly thirty years down the road, I do not share your
optimism that this issue
will be resolved in the next couple of years.
Hans Pelleboer
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Tue, 01 Mar 2016 10:06:02 GMT)
Full text and
rfc822 format available.
Message #56 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 02/29/2016 11:37 PM, Eric Blake wrote:
> On 02/29/2016 01:11 PM, Marcello Perathoner wrote:
>
>>> Yes, locale dependencies on standard behavior can be annoying.
>>>
>>
>> You assume that a user will only ever want to grep text files encoded in
>> the machine's locale. That is not so.
>
> You've been relying on undefined behavior, and it caught up with you.
(The backup2l author has been relying. I'm just a user of that package
and I already filed a bug against backup2l too.)
You confuse 'undefined' with 'undocumented'. The old behaviour was very
well defined, even if it could turn out nasty. It was defined by
implementation: it was a de-facto standard.
OTOH it was nowhere documented that grepping non-locale files was
considered marginal or illegal.
The old documentation explicitly stated:
"""
If the first few bytes of a file indicate that the file contains
binary data, assume that the file is of type TYPE. By default, TYPE is
binary, and grep normally outputs either a one-line message saying
that a binary file matches, or no message if there is no match.
"""
--- from an old man page
The new behaviour changes documented old behaviour.
Furthermore there's no need to fix the old bug in such a heavy-handed
way. Less disrupting alternatives:
1) Make the new behaviour an opt-in. Print a deprecation warning that
gives people a chance to fix their scripts. After a while make the new
behaviour the default.
2) If you just output
binary line 42 in file x matches
and continue regular output after the next newline, the breakage would
be much more confined.
3) Fail in the old documented way of printing only the error message
instead of introducing a new mode of failure that looks like success and
loses the error message in the noise.
4) Don't implement this change between minor releases. A breaking change
deserves a major release.
Regards
--
Marcello Perathoner
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Tue, 01 Mar 2016 17:15:02 GMT)
Full text and
rfc822 format available.
Message #59 received at 22838 <at> debbugs.gnu.org (full text, mbox):
On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
> 1) Make the new behaviour an opt-in.
Again, this is arguing over what the default should be. For many users,
the new sort of behavior is better.
>
> 2) If you just output
>
> binary line 42 in file x matches
>
> and continue regular output after the next newline, the breakage would
> be much more confined.
This sounds like a good suggestion. That is, grep could keep going if
its only problem is an attempt to output encoding errors (as opposed to
reading null bytes, which are a more-reliable indication of binary
data). It would probably be better to output just one "Binary file
matches" line per file, at the end of the other matches, so that it's
more likely to be noticed.
>
> 3) Fail in the old documented way of printing only the error message
> instead of introducing a new mode of failure that looks like success
> and loses the error message in the noise.
I don't understand this suggestion, as it's not an error or an error
message. But since I like (2) better perhaps it doesn't matter.
>
> 4) Don't implement this change between minor releases. A breaking
> change deserves a major release.
>
Grep does not have minor releases. Whether to call the next release
"2.24" or "3.0" is primarily a marketing decision, not a technical one.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Fri, 09 Sep 2016 01:44:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Marcello Perathoner <marcello <at> perathoner.de>
:
bug acknowledged by developer.
(Fri, 09 Sep 2016 01:44:02 GMT)
Full text and
rfc822 format available.
Message #64 received at 22838-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Paul Eggert wrote:
> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>> 2) If you just output
>>
>> binary line 42 in file x matches
>>
>> and continue regular output after the next newline, the breakage would be much
>> more confined.
>
> This sounds like a good suggestion. That is, grep could keep going if its only
> problem is an attempt to output encoding errors (as opposed to reading null
> bytes, which are a more-reliable indication of binary data). It would probably
> be better to output just one "Binary file matches" line per file, at the end of
> the other matches, so that it's more likely to be noticed.
I finally got around to implementing this, which turned out to be considerably
easier than I thought it would be. I installed the attached patch into the grep
Savannah master. I am boldly closing this old bug report; we can always start a
new report if further problems turn up.
[0001-grep-encoding-errors-suppress-just-their-line.patch (text/x-diff, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22838
; Package
grep
.
(Fri, 09 Sep 2016 05:22:01 GMT)
Full text and
rfc822 format available.
Message #67 received at 22838-done <at> debbugs.gnu.org (full text, mbox):
On Thu, Sep 8, 2016 at 6:43 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Paul Eggert wrote:
>>
>> On 03/01/2016 02:05 AM, Marcello Perathoner wrote:
>>>
>>> 2) If you just output
>>>
>>> binary line 42 in file x matches
>>>
>>> and continue regular output after the next newline, the breakage would be
>>> much
>>> more confined.
>>
>>
>> This sounds like a good suggestion. That is, grep could keep going if its
>> only
>> problem is an attempt to output encoding errors (as opposed to reading
>> null
>> bytes, which are a more-reliable indication of binary data). It would
>> probably
>> be better to output just one "Binary file matches" line per file, at the
>> end of
>> the other matches, so that it's more likely to be noticed.
>
>
> I finally got around to implementing this, which turned out to be
> considerably easier than I thought it would be. I installed the attached
> patch into the grep Savannah master. I am boldly closing this old bug
> report; we can always start a new report if further problems turn up.
Very nice. Thank you!
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Fri, 07 Oct 2016 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 8 years and 256 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.