GNU bug report logs - #20678
new bug that Paul "asked" for... grep -P aborts on non-utf8 input.

Previous Next

Package: coreutils;

Reported by: "L. A. Walsh" <coreutils <at> tlinx.org>

Date: Wed, 27 May 2015 21:42:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: "L. A. Walsh" <coreutils <at> tlinx.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 20678 <at> debbugs.gnu.org
Subject: bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 input.
Date: Wed, 27 May 2015 14:41:12 -0700
(skip to end if you don't care to read how I found this
mess)...

Paul Eggert wrote:
> Linda Walsh wrote:
>
>> I had one file that it bailed on
>> saying it has an invalid UTF-8 encoding -- but the line was
>> recursive starting from '.' -- and it didn't name the file
>
> That's pretty vague.  Can you reproduce that problem?  I don't observe 
> it:
----
I'm not quite *sure* how to tell someone else to reproduce this, but
I can pretty reliably now some output from a checker....:
*** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
*** file = libvtkPVClientServerCoreCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----
*** file = libsystemd.so.0
grep: invalid UTF-8 byte sequence in input
-----
*** file = libvtkParallelCore-pv4.2.so.1
grep: invalid UTF-8 byte sequence in input
-----

Now before you think I'm too daft, the code that produces those
messages is in perl and is:

for my $k (@sorted_missing) {
   P "*** file = %s", $k;
   open(my $gh, "grep -rP  '/$k'  /home/rpms/13.2|");
   while (<$gh>) {
       print
   }
   P "-----";
}

Those files are files that came up "missing" as pre-reqs.
in /home/rpms/...., I have the *file listings* of each of
the rpms, created in the same structure as in the distro, so
a file under that dir /home/rpms/13.2.. This is why I had
a problem finding it:
Ishtar:rpms/13.2/repo/oss/suse> file -bi x86_64/*>/tmp/x86files.txt
Ishtar:rpms/13.2/repo/oss/suse> sort </tmp/x86files.txt |uniq -c
     2 text/plain; charset=iso-8859-1
 13269 text/plain; charset=us-ascii
     2 text/plain; charset=utf-8
--- I'd say it's likely 1-2 files out of 13274 files that could
have the problem.  Yeah, I run into alot of needles in haystacks..
but trying to find the needle... just generating the file of types:
>  time file -i x86_64/*>/tmp/fullx86files.txt  
27.71sec 27.07usr 0.63sys (99.99% cpu)

Then grep helps!

Ishtar:rpms/13.2/repo/oss/suse> grep iso-88 /tmp/fullx86files.txt
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
x86_64/aspell-nb-0.50.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1
---
Ishtar:rpms/13.2/repo/oss/suse> more 
x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm   
/usr/lib64/aspell-0.60/icelandic.alias
/usr/lib64/aspell-0.60/is.dat
/usr/lib64/aspell-0.60/is.multi
/usr/lib64/aspell-0.60/is.rws
/usr/lib64/aspell-0.60/is_phonet.dat
/usr/lib64/aspell-0.60/355slenska.alias <<-- the 355 was in inverse color
/usr/share/doc/packages/aspell-is
/usr/share/doc/packages/aspell-is/COPYING
/usr/share/doc/packages/aspell-is/Copyright
/usr/share/doc/packages/aspell-is/README
----
Same w/the other file (had this 1 'violation':

/usr/lib64/aspell-0.60/bokmal.alias
/usr/lib64/aspell-0.60/bokm345l.alias <-3

So those are 'octal' code points (using a little calc prog):
>  pcalc
pcalc V0.1.8: Type 'constants' to see constants
(1)> 0355
  = 237  (0x00ed)  "í" 

(2)> 0345
  = 229  (0x00e5)  "å"
-------------------------------------------------------------------------------
So the 1st part of the bug is the message w/no filename.

the 2nd part of the bug is this: (looking for '^nobody' in
"/etc/passwd" works, as shown in 1st example:

>  grep -P '^nobody' /etc/passwd
nobody:x:65534:65533:(group Nobody):/var/lib/nobody:/bin/nologin

but the 'error' message aborts any further file searches:
---
>  grep -P '^nobody' x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /etc/passwd 
grep: invalid UTF-8 byte sequence in input

----------------------------------------------------------

This is why I objected to '\000' being treated as a binary
file (and why I think it's bad grep can't look for that):
If one works with windows, it's far more likely
just to be in UTF-16 encoding.

-l
















This bug report was last modified 10 years and 53 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.