#20678 - new bug that Paul "asked" for... grep -P aborts on non-utf8 input.

GNU bug report logs - #20678
new bug that Paul "asked" for... grep -P aborts on non-utf8 input.

Reported by: "L. A. Walsh" <coreutils <at> tlinx.org>

Date: Wed, 27 May 2015 21:42:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: "L. A. Walsh" <coreutils <at> tlinx.org> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 20678 <at> debbugs.gnu.org Subject: bug#20678: new bug that Paul "asked" for... grep -P aborts on non-utf8 input. Date: Wed, 27 May 2015 14:41:12 -0700

(skip to end if you don't care to read how I found this mess)... Paul Eggert wrote: > Linda Walsh wrote: > >> I had one file that it bailed on >> saying it has an invalid UTF-8 encoding -- but the line was >> recursive starting from '.' -- and it didn't name the file > > That's pretty vague. Can you reproduce that problem? I don't observe > it: ---- I'm not quite *sure* how to tell someone else to reproduce this, but I can pretty reliably now some output from a checker....: *** file = libvtkUtilitiesPythonInitializer-pv4.2.so.1 grep: invalid UTF-8 byte sequence in input ----- *** file = libvtkPVClientServerCoreCore-pv4.2.so.1 grep: invalid UTF-8 byte sequence in input ----- *** file = libsystemd.so.0 grep: invalid UTF-8 byte sequence in input ----- *** file = libvtkParallelCore-pv4.2.so.1 grep: invalid UTF-8 byte sequence in input ----- Now before you think I'm too daft, the code that produces those messages is in perl and is: for my $k (@sorted_missing) { P "*** file = %s", $k; open(my $gh, "grep -rP '/$k' /home/rpms/13.2|"); while (<$gh>) { print } P "-----"; } Those files are files that came up "missing" as pre-reqs. in /home/rpms/...., I have the *file listings* of each of the rpms, created in the same structure as in the distro, so a file under that dir /home/rpms/13.2.. This is why I had a problem finding it: Ishtar:rpms/13.2/repo/oss/suse> file -bi x86_64/*>/tmp/x86files.txt Ishtar:rpms/13.2/repo/oss/suse> sort </tmp/x86files.txt |uniq -c 2 text/plain; charset=iso-8859-1 13269 text/plain; charset=us-ascii 2 text/plain; charset=utf-8 --- I'd say it's likely 1-2 files out of 13274 files that could have the problem. Yeah, I run into alot of needles in haystacks.. but trying to find the needle... just generating the file of types: > time file -i x86_64/*>/tmp/fullx86files.txt 27.71sec 27.07usr 0.63sys (99.99% cpu) Then grep helps! Ishtar:rpms/13.2/repo/oss/suse> grep iso-88 /tmp/fullx86files.txt x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1 x86_64/aspell-nb-0.50.10-46.1.2.x86_64.rpm:text/plain; charset=iso-8859-1 --- Ishtar:rpms/13.2/repo/oss/suse> more x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /usr/lib64/aspell-0.60/icelandic.alias /usr/lib64/aspell-0.60/is.dat /usr/lib64/aspell-0.60/is.multi /usr/lib64/aspell-0.60/is.rws /usr/lib64/aspell-0.60/is_phonet.dat /usr/lib64/aspell-0.60/355slenska.alias <<-- the 355 was in inverse color /usr/share/doc/packages/aspell-is /usr/share/doc/packages/aspell-is/COPYING /usr/share/doc/packages/aspell-is/Copyright /usr/share/doc/packages/aspell-is/README ---- Same w/the other file (had this 1 'violation': /usr/lib64/aspell-0.60/bokmal.alias /usr/lib64/aspell-0.60/bokm345l.alias <-3 So those are 'octal' code points (using a little calc prog): > pcalc pcalc V0.1.8: Type 'constants' to see constants (1)> 0355 = 237 (0x00ed) "í" (2)> 0345 = 229 (0x00e5) "å" ------------------------------------------------------------------------------- So the 1st part of the bug is the message w/no filename. the 2nd part of the bug is this: (looking for '^nobody' in "/etc/passwd" works, as shown in 1st example: > grep -P '^nobody' /etc/passwd nobody:x:65534:65533:(group Nobody):/var/lib/nobody:/bin/nologin but the 'error' message aborts any further file searches: --- > grep -P '^nobody' x86_64/aspell-is-0.51.10-46.1.2.x86_64.rpm /etc/passwd grep: invalid UTF-8 byte sequence in input ---------------------------------------------------------- This is why I objected to '\000' being treated as a binary file (and why I think it's bad grep can't look for that): If one works with windows, it's far more likely just to be in UTF-16 encoding. -l

This bug report was last modified 10 years and 53 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20678 new bug that Paul "asked" for... grep -P aborts on non-utf8 input.

GNU bug report logs - #20678
new bug that Paul "asked" for... grep -P aborts on non-utf8 input.