GNU bug report logs - #20678
new bug that Paul "asked" for... grep -P aborts on non-utf8 input.

Previous Next

Package: coreutils;

Reported by: "L. A. Walsh" <coreutils <at> tlinx.org>

Date: Wed, 27 May 2015 21:42:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20678 in the body.
You can then email your comments to 20678 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#20678; Package coreutils. (Wed, 27 May 2015 21:42:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "L. A. Walsh" <coreutils <at> tlinx.org>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 27 May 2015 21:42:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "L. A. Walsh" <coreutils <at> tlinx.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-coreutils <at> gnu.org
Subject: new bug that Paul "asked" for... grep -P aborts on non-utf8 input.
Date: Wed, 27 May 2015 14:41:12 -0700
(skip to end if you don't care to read how I found this
mess)...

Paul Eggert wrote:
> Linda Walsh wrote:
>
>> I had one file that it bailed on
>> saying it has an invalid UTF-8 encoding -- but the line was
>> recursive starting from '.' -- and it didn't name the file
>
> That's pretty vague.  Can you reproduce that problem?  I don't observe 
> it:
----
I'm not quite *sure* how to tell someone else to reproduce this, but
I can pretty reliably now some output from a checker....:
*** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
*** file = libvtkPVClientServerCoreCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
*** file = libsystemd.so.0
grep: invalid UTF-8 byte sequence in input
-----
*** file = libvtkParallelCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----

Now before you think I'm too daft, the code that produces those
messages is in perl and is:

for my $k (@sorted_missing) {
   P "*** file = %s", $k;
   open(my $gh, "grep -rP  '/$k'  /home/rpms/13.2|");
   while (<$gh>) {
       print
   }
   P "-----";
}

Those files are files that came up "missing" as pre-reqs.
in /home/rpms/...., I have the *file listings* of each of
the rpms, created in the same structure as in the distro, so
a file under that dir /home/rpms/13.2.. This is why I had
a problem finding it:
Ishtar:rpms/13.2/repo/oss/suse> file -bi x86_64/*>/tmp/x86files.txt
Ishtar:rpms/13.2/repo/oss/suse> sort </tmp/x86files.txt |uniq -c
     2 text/plain; charset=iso-8859-1
 13269 text/plain; charset=us-ascii
     2 text/plain; charset=utf-8
--- I'd say it's likely 1-2 files out of 13274 files that could
have the problem.  Yeah, I run into alot of needles in haystacks..
but trying to find the needle... just generating the file of types:
>  time file -i x86_64/*>/tmp/fullx86files.txt  
27.71sec 27.07usr 0.63sys (99.99% cpu)

Then grep helps!

Ishtar:rpms/13.2/repo/oss/suse> grep iso-88 /tmp/fullx86files.txt
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
x86_64/aspell-nb-0.50.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
---
Ishtar:rpms/13.2/repo/oss/suse> more 
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm   
/usr/lib64/aspell-0.60/icelandic.alias
/usr/lib64/aspell-0.60/is.dat
/usr/lib64/aspell-0.60/is.multi
/usr/lib64/aspell-0.60/is.rws
/usr/lib64/aspell-0.60/is_phonet.dat
/usr/lib64/aspell-0.60/355slenska.alias <<-- the 355 was in inverse color
/usr/share/doc/packages/aspell-is
/usr/share/doc/packages/aspell-is/COPYING
/usr/share/doc/packages/aspell-is/Copyright
/usr/share/doc/packages/aspell-is/README
----
Same w/the other file (had this 1 'violation':

/usr/lib64/aspell-0.60/bokmal.alias
/usr/lib64/aspell-0.60/bokm345l.alias <-3

So those are 'octal' code points (using a little calc prog):
>  pcalc
pcalc V0.1.8: Type 'constants' to see constants
(1)> 0355
  = 237  (0x00ed)  "í" 

(2)> 0345
  = 229  (0x00e5)  "å"
-------------------------------------------------------------------------------
So the 1st part of the bug is the message w/no filename.

the 2nd part of the bug is this: (looking for '^nobody' in
"/etc/passwd" works, as shown in 1st example:

>  grep -P '^nobody' /etc/passwd
nobody:x:65534:65533:(group Nobody):/var/lib/nobody:/bin/nologin

but the 'error' message aborts any further file searches:
---
>  grep -P '^nobody' x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /etc/passwd 
grep: invalid UTF-8 byte sequence in input

----------------------------------------------------------

This is why I objected to '\000' being treated as a binary
file (and why I think it's bad grep can't look for that):
If one works with windows, it's far more likely
just to be in UTF-16 encoding.

-l
















Information forwarded to bug-coreutils <at> gnu.org:
bug#20678; Package coreutils. (Wed, 27 May 2015 22:05:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "L. A. Walsh" <coreutils <at> tlinx.org>
Cc: bug-coreutils <at> gnu.org
Subject: Re: new bug that Paul "asked" for... grep -P aborts on non-utf8 input.
Date: Wed, 27 May 2015 15:04:18 -0700
On 05/27/2015 02:41 PM, L. A. Walsh wrote:
> *** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1
> grep: invalid UTF-8 byte sequence in input 

This looks like you're using an old version of libpcre, or of grep. I 
can't reproduce the problem with the latest stable versions of both 
(libpcre 8.37, grep-2.21).  I can find similar problems if I use old 
libpcre.




Information forwarded to bug-coreutils <at> gnu.org:
bug#20678; Package coreutils. (Wed, 27 May 2015 22:25:02 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <coreutils <at> tlinx.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-coreutils <at> gnu.org
Subject: Re: bug#20678 new bug that Paul "asked" for... grep -P aborts on
 non-utf8 input.
Date: Wed, 27 May 2015 15:24:03 -0700

Paul Eggert wrote:
> On 05/27/2015 02:41 PM, L. A. Walsh wrote:
>> *** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1
>> grep: invalid UTF-8 byte sequence in input 
> 
> This looks like you're using an old version of libpcre, or of grep. I 
> can't reproduce the problem with the latest stable versions of both 
> (libpcre 8.37, grep-2.21).  I can find similar problems if I use old 
> libpcre.
---

ok... ARG -- I just installed the new version of grep from
my distro (suse13.2) -- grep-2.20-2.4.1.x86_64

I think they'll be out with a new distro release in about
a year...(yes, I can probably build my own...like I have
to with a growing body of Software) -- something that has
gotten me in trouble with my distro at times when I've caught
them locking different pieces of software to specific 
libraries (not >== xxx but "==")... grrr...I could acknowledge
their point that most people wouldn't bother rebuilding 
all the perl modules if they upgraded perl... but that's
not *everyone*!...sigh.


coreutils isn't as stable as it used to be (not entirely the
CU-devel team either: I've caught suse's hand in 1-2)...

Just ran into problems in their new gvim & sudo -- I think
the sudo prob is the sudo-dev team...but the gvim I filed
a bug on in previous version... guess it didn't get fixed.

Filing bugs more often than not is a big waste of time.
*grump* *grump*...

;-)




bug closed, send any further explanations to 20678 <at> debbugs.gnu.org and "L. A. Walsh" <coreutils <at> tlinx.org> Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 27 May 2015 22:45:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#20678; Package coreutils. (Thu, 28 May 2015 07:18:03 GMT) Full text and rfc822 format available.

Message #16 received at 20678 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Linda Walsh <coreutils <at> tlinx.org>, Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20678 <at> debbugs.gnu.org
Subject: Re: bug#20678: new bug that Paul "asked" for... grep -P aborts on
 non-utf8 input.
Date: Thu, 28 May 2015 09:17:41 +0200
On 05/28/2015 12:24 AM, Linda Walsh wrote:
> ok... ARG -- I just installed the new version of grep from
> my distro (suse13.2) -- grep-2.20-2.4.1.x86_64
>
> I think they'll be out with a new distro release in about
> a year...(yes, I can probably build my own...like I have
> to with a growing body of Software)

This is openSUSE specific.
When you've built your own version with a patch for a problem,
nothing prevents you from simply creating a submit request for
that patch on OBS to "Base:System/grep", and maybe even creating
a maintenance request for "openSUSE:13.2/grep".  Get involved.

Have a nice day,
Berny




Information forwarded to bug-coreutils <at> gnu.org:
bug#20678; Package coreutils. (Thu, 28 May 2015 13:18:02 GMT) Full text and rfc822 format available.

Message #19 received at 20678 <at> debbugs.gnu.org (full text, mbox):

From: Linda Walsh <coreutils <at> tlinx.org>
To: Bernhard Voelker <mail <at> bernhard-voelker.de>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 20678 <at> debbugs.gnu.org
Subject: Re: bug#20678: new bug that Paul "asked" for... grep -P aborts on
 non-utf8 input.
Date: Thu, 28 May 2015 06:17:32 -0700

Bernhard Voelker wrote:
> On 05/28/2015 12:24 AM, Linda Walsh wrote:
>> ok... ARG -- I just installed the new version of grep from
>> my distro (suse13.2) -- grep-2.20-2.4.1.x86_64
>>
>> I think they'll be out with a new distro release in about
>> a year...(yes, I can probably build my own...like I have
>> to with a growing body of Software)
> 
> This is openSUSE specific.
> When you've built your own version with a patch for a problem,
> nothing prevents you from simply creating a submit request for
> that patch on OBS to "Base:System/grep", and maybe even creating
> a maintenance request for "openSUSE:13.2/grep".  Get involved.
----
	Main thing my patch is restoring functionality of 'rm'
to allow "rm -fr .", I'm not daft enough to try to sneak that in
as a default.  Maybe in a different command, maybe as a non-default,
but I'm anything but duplicitous (unfortunately).

	I _have_ always thought that a shorthand combination of rd and rm, 
might be nice -- maybe 'r'...   Of course it would only work like rmdir
on empty dirs unless they specify the "-r" flag so it could remove contents
first.  And of course it would pay attention to the posix rule about not
trying to delete '.' after it finished its' depth first traversal...
But no one else seems to really care that much, so I'm not sure how much
effort I want to put into it to package something like that up.  But
it has entered my mind... 

Cheers,
Lina




Information forwarded to bug-coreutils <at> gnu.org:
bug#20678; Package coreutils. (Thu, 28 May 2015 13:30:08 GMT) Full text and rfc822 format available.

Message #22 received at 20678 <at> debbugs.gnu.org (full text, mbox):

From: Bernhard Voelker <mail <at> bernhard-voelker.de>
To: Linda Walsh <coreutils <at> tlinx.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 20678 <at> debbugs.gnu.org
Subject: Re: bug#20678: new bug that Paul "asked" for... grep -P aborts on
 non-utf8 input.
Date: Thu, 28 May 2015 15:28:47 +0200
On 05/28/2015 03:17 PM, Linda Walsh wrote:
> Bernhard Voelker wrote:
>> On 05/28/2015 12:24 AM, Linda Walsh wrote:
>>> ok... ARG -- I just installed the new version of grep from
>>> my distro (suse13.2) -- grep-2.20-2.4.1.x86_64
>>>
>>> I think they'll be out with a new distro release in about
>>> a year...(yes, I can probably build my own...like I have
>>> to with a growing body of Software)
>>
>> This is openSUSE specific.
>> When you've built your own version with a patch for a problem,
>> nothing prevents you from simply creating a submit request for
>> that patch on OBS to "Base:System/grep", and maybe even creating
>> a maintenance request for "openSUSE:13.2/grep".  Get involved.
> ----
>      Main thing my patch is restoring functionality of 'rm'
> to allow "rm -fr ." [...]

stop, 'rm -rf .' is a completely different story - your bug report
was about 'grep -P' (for which the bug report is OT on the coreutils
mailing list btw.).
Having your own (probably non-generally wanted) patches in your own
OBS project is perfect.  I was just talking about submitting the
patch for the non-utf8 issue (for which I personally didn't check
whether it is already included downstreams).

Have a nice day,
Berny





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 26 Jun 2015 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 10 years and 52 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.