GNU bug report logs - #21604
grep doesn't match diacritical chars in ISO-8859 files

Previous Next

Package: grep;

Reported by: Santiago Ruano Rincón <santiagorr <at> riseup.net>

Date: Fri, 2 Oct 2015 14:45:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21604 in the body.
You can then email your comments to 21604 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#21604; Package grep. (Fri, 02 Oct 2015 14:45:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Santiago Ruano Rincón <santiagorr <at> riseup.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 02 Oct 2015 14:45:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Santiago Ruano Rincón <santiagorr <at> riseup.net>
To: bug-grep <at> gnu.org
Subject: grep doesn't match diacritical chars in ISO-8859 files
Date: Fri, 2 Oct 2015 11:43:58 +0200
[Message part 1 (text/plain, inline)]
Hi,

Moreover http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230 , several
debian users report that grep doesn't match characters with diacritical
marks in ISO-8859 files, inside a Unicode enviroment:

% file /tmp/q.h 
/tmp/q.h: ISO-8859 text

% grep c /tmp/q.h
Coincidencia en el fichero binario /tmp/q.h

% grep -a c /tmp/q.h
    struct cara* lcaras; //array de caras, habr� que usar reserva dinamica de memoria.

% grep á /tmp/q.h 

% grep -a á /tmp/q.h

grep matches the "á" pattern if it's is input from an ISO-8859 file:

% grep -f a q.h 
Coincidencia en el fichero binario q.h

Test files attached

Full report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670

Regards,

Santiago

-- System Information:
Debian Release: stretch/sid
  APT prefers squeeze-lts
    APT policy: (500, 'squeeze-lts'), (500, 'oldoldstable'), (500, 'unstable'), (500, 'testing'), (500, 'oldstable'), (1, 'experimental')
    Architecture: amd64 (x86_64)
    Foreign Architectures: i386

    Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores)
    Locale: LANG=es_CO.utf8, LC_CTYPE=es_CO.utf8 (charmap=UTF-8)
    Shell: /bin/sh linked to /bin/dash
    Init: sysvinit (via /sbin/init)

    Versions of packages grep depends on:
    ii  dpkg          1.18.1
    ii  install-info  6.0.0.dfsg.1-3
    ii  libc6         2.19-19
    ii  libpcre3      2:8.35-7
[q.h (text/x-chdr, attachment)]

Added tag(s) notabug. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Fri, 02 Oct 2015 20:02:01 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Fri, 02 Oct 2015 20:02:02 GMT) Full text and rfc822 format available.

Notification sent to Santiago Ruano Rincón <santiagorr <at> riseup.net>:
bug acknowledged by developer. (Fri, 02 Oct 2015 20:02:03 GMT) Full text and rfc822 format available.

Message #12 received at 21604-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Santiago Ruano Rincón <santiagorr <at> riseup.net>,
 21604-done <at> debbugs.gnu.org
Subject: Re: bug#21604: grep doesn't match diacritical chars in ISO-8859 files
Date: Fri, 2 Oct 2015 13:01:04 -0700
On 10/02/2015 02:43 AM, Santiago Ruano Rincón wrote:
> grep doesn't match characters with diacritical
> marks in ISO-8859 files, inside a Unicode enviroment

That is normal and expected behavior.  In a UTF-8 locale, "á" is 
represented by the two bytes 0xC3 and 0xA1.  In an ISO-8859 file, the 
same character is represented by the single byte 0xE1.  The UTF-8 
pattern won't match the ISO-8859 representation.

To avoid this problem, switch to an ISO-8859 locale before using grep to 
read ISO-8859 text files.  This is true for pretty much any standard 
utility, not just grep.  Alternatively, you can translate the text files 
from ISO-8859 to UTF-8, before giving the resulting text to grep or to 
other utilities.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 31 Oct 2015 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 295 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.