GNU bug report logs - #25749
grep 3.0 skips "binary" lines in ssconvert output

Previous Next

Package: grep;

Reported by: Alexey Shipunov <dactylorhiza <at> gmail.com>

Date: Thu, 16 Feb 2017 05:01:01 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 25749 in the body.
You can then email your comments to 25749 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 05:01:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Alexey Shipunov <dactylorhiza <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Thu, 16 Feb 2017 05:01:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alexey Shipunov <dactylorhiza <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: grep 3.0 skips "binary" lines in ssconvert output
Date: Wed, 15 Feb 2017 22:36:36 -0600
[Message part 1 (text/plain, inline)]
Dear Madam or Sir,

That problem almost ruined my work today.

I made the following note to myself but you might be also interested:

===
current grep (2.25) is much faster than 2.5.4 from Lucid but SKIPS
"binary" lines in ssconvert output, freshly compiled grep 3.0 skips
less but still does it. Workaround: look for "binary match" phrase in
the end of file and apply grep -a. Report to
https://www.gnu.org/software/grep/manual/html_node/Reporting-Bugs.html
?

===

The file of question (gzipped) is attached.

My system:

===
$ uname -a
Linux ... 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017
x86_64 x86_64 x86_64 GNU/Linux
===

Commands which reproduce the problem:

===
grep . usa-format.txt > 1
grep -a . usa-format.txt > 2
diff 1 2
===

Again, the problem exists with both Ubuntu Xenial default grep 2.25
and new grep 3.0

With best wishes,

Alexey Shipunov
[usa-format.txt.gz (application/x-gzip, attachment)]

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Thu, 16 Feb 2017 07:12:02 GMT) Full text and rfc822 format available.

Notification sent to Alexey Shipunov <dactylorhiza <at> gmail.com>:
bug acknowledged by developer. (Thu, 16 Feb 2017 07:12:02 GMT) Full text and rfc822 format available.

Message #10 received at 25749-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Alexey Shipunov <dactylorhiza <at> gmail.com>, 25749-done <at> debbugs.gnu.org
Subject: Re: bug#25749: grep 3.0 skips "binary" lines in ssconvert output
Date: Wed, 15 Feb 2017 23:11:04 -0800
When I tried to read that attachment, gedit complained "There was a problem 
opening" it, and then "The file you opened has some invalid characters. If you 
continue editing this file you could corrupt this document. You can also choose 
another character encoding and try again." So it is not only "grep" that is 
having problems with the file.

Looking into it further, the file contains a non-text byte in line 13676, in the 
string "1 <at> 8MI W OF RALEIGH", where the "@" denotes a byte with octal value 233. 
This is invalid UTF-8 text. You can work around the issue by replacing the 
non-text byte with a valid character, or by using "grep -a" as you noted, or by 
setting the LC_ALL environment variable to "C", or by using a grep pattern that 
does not match the non-text line.




Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 07:16:01 GMT) Full text and rfc822 format available.

Message #13 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alexey Shipunov <dactylorhiza <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: grep 3.0 skips "binary" lines in ssconvert output
Date: Wed, 15 Feb 2017 22:47:06 -0600
P.S.

That problem does not exists in 2.5.4.

AS

2017-02-15 22:36 GMT-06:00 Alexey Shipunov <dactylorhiza <at> gmail.com>:
> Dear Madam or Sir,
>
> That problem almost ruined my work today.
>
> I made the following note to myself but you might be also interested:
>
> ===
> current grep (2.25) is much faster than 2.5.4 from Lucid but SKIPS
> "binary" lines in ssconvert output, freshly compiled grep 3.0 skips
> less but still does it. Workaround: look for "binary match" phrase in
> the end of file and apply grep -a. Report to
> https://www.gnu.org/software/grep/manual/html_node/Reporting-Bugs.html
> ?
>
> ===
>
> The file of question (gzipped) is attached.
>
> My system:
>
> ===
> $ uname -a
> Linux ... 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017
> x86_64 x86_64 x86_64 GNU/Linux
> ===
>
> Commands which reproduce the problem:
>
> ===
> grep . usa-format.txt > 1
> grep -a . usa-format.txt > 2
> diff 1 2
> ===
>
> Again, the problem exists with both Ubuntu Xenial default grep 2.25
> and new grep 3.0
>
> With best wishes,
>
> Alexey Shipunov




Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 07:16:02 GMT) Full text and rfc822 format available.

Message #16 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Alexey Shipunov <dactylorhiza <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: Re: grep 3.0 skips "binary" lines in ssconvert output
Date: Wed, 15 Feb 2017 23:03:35 -0600
[Message part 1 (text/plain, inline)]
P.P.S.

Attached are three diff files for grep 2.5.4, grep 2.25 and grep 3.0

AS

2017-02-15 22:47 GMT-06:00 Alexey Shipunov <dactylorhiza <at> gmail.com>:
> P.S.
>
> That problem does not exists in 2.5.4.
>
> AS
>
> 2017-02-15 22:36 GMT-06:00 Alexey Shipunov <dactylorhiza <at> gmail.com>:
>> Dear Madam or Sir,
>>
>> That problem almost ruined my work today.
>>
>> I made the following note to myself but you might be also interested:
>>
>> ===
>> current grep (2.25) is much faster than 2.5.4 from Lucid but SKIPS
>> "binary" lines in ssconvert output, freshly compiled grep 3.0 skips
>> less but still does it. Workaround: look for "binary match" phrase in
>> the end of file and apply grep -a. Report to
>> https://www.gnu.org/software/grep/manual/html_node/Reporting-Bugs.html
>> ?
>>
>> ===
>>
>> The file of question (gzipped) is attached.
>>
>> My system:
>>
>> ===
>> $ uname -a
>> Linux ... 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017
>> x86_64 x86_64 x86_64 GNU/Linux
>> ===
>>
>> Commands which reproduce the problem:
>>
>> ===
>> grep . usa-format.txt > 1
>> grep -a . usa-format.txt > 2
>> diff 1 2
>> ===
>>
>> Again, the problem exists with both Ubuntu Xenial default grep 2.25
>> and new grep 3.0
>>
>> With best wishes,
>>
>> Alexey Shipunov
[diffs.tar.gz (application/x-gzip, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 07:40:01 GMT) Full text and rfc822 format available.

Message #19 received at 25749-done <at> debbugs.gnu.org (full text, mbox):

From: Alexey Shipunov <dactylorhiza <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 25749-done <at> debbugs.gnu.org
Subject: Re: bug#25749: grep 3.0 skips "binary" lines in ssconvert output
Date: Thu, 16 Feb 2017 01:39:10 -0600
Hi,

Thanks for explanation. However, it does not explain why grep 2.5.4
has no problem with this file.

With best wishes,

Alexey

2017-02-16 1:11 GMT-06:00 Paul Eggert <eggert <at> cs.ucla.edu>:
> When I tried to read that attachment, gedit complained "There was a problem
> opening" it, and then "The file you opened has some invalid characters. If
> you continue editing this file you could corrupt this document. You can also
> choose another character encoding and try again." So it is not only "grep"
> that is having problems with the file.
>
> Looking into it further, the file contains a non-text byte in line 13676, in
> the string "1 <at> 8MI W OF RALEIGH", where the "@" denotes a byte with octal
> value 233. This is invalid UTF-8 text. You can work around the issue by
> replacing the non-text byte with a valid character, or by using "grep -a" as
> you noted, or by setting the LC_ALL environment variable to "C", or by using
> a grep pattern that does not match the non-text line.




Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 07:54:01 GMT) Full text and rfc822 format available.

Message #22 received at 25749-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Alexey Shipunov <dactylorhiza <at> gmail.com>
Cc: 25749-done <at> debbugs.gnu.org
Subject: Re: bug#25749: grep 3.0 skips "binary" lines in ssconvert output
Date: Wed, 15 Feb 2017 23:53:41 -0800
Alexey Shipunov wrote:
> it does not explain why grep 2.5.4
> has no problem with this file

Your test case relies on undefined behavior. In such cases, the behavior might 
be want you want, and it might not. Although 2.5.4 happened to work the way you 
wanted, its behavior was not guaranteed and had some other downsides, which is 
why it was changed.




Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 22:08:01 GMT) Full text and rfc822 format available.

Message #25 received at 25749 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Alexey Shipunov <dactylorhiza <at> gmail.com>
Cc: 25749 <at> debbugs.gnu.org
Subject: Re: bug#25749: grep 3.0 skips "binary" lines in ssconvert output
Date: Thu, 16 Feb 2017 14:07:46 -0800
On 02/16/2017 10:10 AM, Alexey Shipunov wrote:
> I wonder how much work would require to invent new option, saying
> --binary-text-strict which will cause grep to stop with error and do
> not output anything...

Something like that shouldn't be hard, although "do not output anything" 
is too simple, as we can't expect grep to make two passes through the 
whole input, which means that it could already have output something 
before it discovers the encoding error.





Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 22:15:01 GMT) Full text and rfc822 format available.

Message #28 received at 25749 <at> debbugs.gnu.org (full text, mbox):

From: Alexey Shipunov <dactylorhiza <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 25749 <at> debbugs.gnu.org
Subject: Re: bug#25749: grep 3.0 skips "binary" lines in ssconvert output
Date: Thu, 16 Feb 2017 16:14:09 -0600
Yes, I understand. So maybe that supposed option should make grep go
through data two times? It could be hard though.

But reporting error not (only) in stdout but (also) to stderr should
be really helpful!

AS

2017-02-16 16:07 GMT-06:00 Paul Eggert <eggert <at> cs.ucla.edu>:
> On 02/16/2017 10:10 AM, Alexey Shipunov wrote:
>>
>> I wonder how much work would require to invent new option, saying
>> --binary-text-strict which will cause grep to stop with error and do
>> not output anything...
>
>
> Something like that shouldn't be hard, although "do not output anything" is
> too simple, as we can't expect grep to make two passes through the whole
> input, which means that it could already have output something before it
> discovers the encoding error.
>




Information forwarded to bug-grep <at> gnu.org:
bug#25749; Package grep. (Thu, 16 Feb 2017 22:26:02 GMT) Full text and rfc822 format available.

Message #31 received at 25749 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Alexey Shipunov <dactylorhiza <at> gmail.com>
Cc: 25749 <at> debbugs.gnu.org
Subject: Re: bug#25749: grep 3.0 skips "binary" lines in ssconvert output
Date: Thu, 16 Feb 2017 14:25:43 -0800
On 02/16/2017 02:14 PM, Alexey Shipunov wrote:
> But reporting error not (only) in stdout but (also) to stderr should
> be really helpful!

That sounds like a better idea.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 17 Mar 2017 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 93 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.