GNU bug report logs -
#78276
grep on file with 0xF3 byte in utf-8 locale
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Tue, 6 May 2025 02:12:25 -0700
with message-id <9803c83b-83e9-4e76-ad05-7fe01dd1476e <at> cs.ucla.edu>
and subject line Re: bug#78276: grep on file with 0xF3 byte in utf-8 locale
has caused the debbugs.gnu.org bug report #78276,
regarding grep on file with 0xF3 byte in utf-8 locale
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
78276: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=78276
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
Hi.
I was trying to grep logs for some mail log entries and spammer used
0xF3 byte to try to hide / trick things. For grep it looks like this:
$ printf 'a\xF3bcdefgh' > x2
$ LC_ALL=C.UTF-8 grep 'a.*h' x2
$
$ LC_ALL=C grep 'a.*h' x2
abcdefgh
$ LC_ALL=C.UTF-8 grep -a 'a.*h' x2
$
[arekm <at> ixion ~]$ LC_ALL=C grep -a 'a.*h' x2
abcdefgh
Is that expected behavior, no binary file warning and no matching with
utf-8 locale, even with -a? AFAIK that's not correct utf-8 sequence.
$ grep --version x2
grep (GNU grep) 3.12
Copyright (C) 2025 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
<https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Mike Haertel and others; see
<https://git.savannah.gnu.org/cgit/grep.git/tree/AUTHORS>.
grep -P uses PCRE2 10.45 2025-02-05
--
Arkadiusz Miśkiewicz, arekm / ( maven.pl | pld-linux.org )
[Message part 3 (message/rfc822, inline)]
On 2025-05-06 00:37, Arkadiusz Miśkiewicz via Bug reports for GNU grep
wrote:
> Is that expected behavior, no binary file warning and no matching with
> utf-8 locale, even with -a?
It's allowed behavior, as '.' need not match encoding errors.[1] Also,
'grep' need not diagnose encoding errors that don't harm the output.[2]
As you mentioned in your email, using LC_ALL=C should let '.' match any
byte, so that should let you do what you want.
[1]:
https://www.gnu.org/software/grep/manual/html_node/Fundamental-Structure.html
[2]:
https://www.gnu.org/software/grep/manual/html_node/File-and-Directory-Selection.html
This bug report was last modified 17 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.