#30326 - grep not searching through a text file (thinking it binary)

GNU bug report logs - #30326
grep not searching through a text file (thinking it binary)

Package: grep;

Reported by: "L. A. Walsh" <gnu <at> tlinx.org>

Date: Fri, 2 Feb 2018 19:31:02 UTC

Severity: normal

Tags: notabug

Done: Eric Blake <eblake <at> redhat.com>

Bug is archived. No further changes may be made.

Message #33 received at 30326-done <at> debbugs.gnu.org (full text, mbox):

From: L A Walsh <gnu <at> tlinx.org> To: Paul Eggert <eggert <at> cs.ucla.edu> Cc: 30326-done <at> debbugs.gnu.org Subject: Re: bug#30326: grep not searching through a text file (thinking it binary) Date: Fri, 02 Feb 2018 16:51:55 -0800

Paul Eggert wrote: > On 02/02/2018 03:30 PM, L A Walsh wrote: > > most computer files (vs. user-files) are still single-byte. > > That's because so many of them are ASCII. But ASCII files are not the > issue here. grep's behavior hasn't changed when operating on ASCII files > in typical locales. The issue is text using a non-ASCII encoding that is > not compatible with your locale; e.g., if your text file uses ISO 8859-1 > but your locale specifies UTF-8. ---- I've had my locale as UTF-8 since around 2000. My music collection needed french, english, middle east, and now japanese chars -- so I set things to UTF-8. I didn't need perfection. For the email, I needed to know what files the text was in so I could look at those mbox's with a mail-reader or with a text editor. I needed grep to work as a 1st level search tool. It's failed on that score. Still if it just searched for the bytes that I put in the search string, I'm not sure how it would "go wrong". > > In my experience, UTF-8 has long been winning this battle, in the sense > that UTF-8 is by far the dominant encoding for the non-ASCII files I > regularly use. So I use a UTF-8 locale, and suggest this as a good > default for most users nowadays. > > It's not possible to get direct statistics about encoding for all user > files. However, we can see what's being published on the web. Currently > UTF-8 is being used by about 90% of public websites whose character > encoding can be determined, according to the latest W3Techs survey. ISO > 8859-1 is in second place, at about 4%. See: > > https://w3techs.com/technologies/overview/character_encoding/all > Whereas this one was: Domain: Non-ISO extended-ASCII text, with very long lines So theoretically, it would never match any locale. Problem is on a mailbox, different emails can have different encodings. But I didn't care -- I typed in an ascii string -- so let it search in octets w/no encoding. It's also such that in a mailbox it's very likely there are going to be lines (maybe "very long lines"), but the text I was searching for was <80 chars. I'm really surprised it was decided to break compat -- as I've been doing searches like this for over 2 decades - not often, mind you, but it's one of the big advantages for me of keeping mailboxes for my IMAP server in mbox format. Maildir format or others would kill search ability with slow file-IO. ;^/

This bug report was last modified 7 years and 84 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #30326 grep not searching through a text file (thinking it binary)

GNU bug report logs - #30326
grep not searching through a text file (thinking it binary)