GNU bug report logs -
#30326
grep not searching through a text file (thinking it binary)
Previous Next
Reported by: "L. A. Walsh" <gnu <at> tlinx.org>
Date: Fri, 2 Feb 2018 19:31:02 UTC
Severity: normal
Tags: notabug
Done: Eric Blake <eblake <at> redhat.com>
Bug is archived. No further changes may be made.
Full log
Message #33 received at 30326-done <at> debbugs.gnu.org (full text, mbox):
Paul Eggert wrote:
> On 02/02/2018 03:30 PM, L A Walsh wrote:
> > most computer files (vs. user-files) are still single-byte.
>
> That's because so many of them are ASCII. But ASCII files are not the
> issue here. grep's behavior hasn't changed when operating on ASCII files
> in typical locales. The issue is text using a non-ASCII encoding that is
> not compatible with your locale; e.g., if your text file uses ISO 8859-1
> but your locale specifies UTF-8.
----
I've had my locale as UTF-8 since around 2000. My music collection
needed french, english, middle east, and now japanese chars -- so I set
things
to UTF-8. I didn't need perfection. For the email, I needed to know what
files the text was in so I could look at those mbox's with a mail-reader
or with a text editor. I needed grep to work as a 1st level search tool.
It's failed on that score.
Still if it just searched for the bytes that I put in the search string, I'm
not sure how it would "go wrong".
>
> In my experience, UTF-8 has long been winning this battle, in the sense
> that UTF-8 is by far the dominant encoding for the non-ASCII files I
> regularly use. So I use a UTF-8 locale, and suggest this as a good
> default for most users nowadays.
>
> It's not possible to get direct statistics about encoding for all user
> files. However, we can see what's being published on the web. Currently
> UTF-8 is being used by about 90% of public websites whose character
> encoding can be determined, according to the latest W3Techs survey. ISO
> 8859-1 is in second place, at about 4%. See:
>
> https://w3techs.com/technologies/overview/character_encoding/all
>
Whereas this one was:
Domain: Non-ISO extended-ASCII text, with very long lines
So theoretically, it would never match any locale.
Problem is on a mailbox, different emails can have different encodings.
But I didn't care -- I typed in an ascii string -- so let it search in
octets
w/no encoding.
It's also such that in a mailbox it's very likely there are going to
be lines (maybe "very long lines"), but the text I was searching for
was <80 chars.
I'm really surprised it was decided to break compat -- as I've been
doing searches like this for over 2 decades - not often, mind you, but
it's one of the big advantages for me of keeping mailboxes for my IMAP
server in mbox format. Maildir format or others would kill search ability
with slow file-IO. ;^/
This bug report was last modified 7 years and 34 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.