GNU bug report logs - #21604
grep doesn't match diacritical chars in ISO-8859 files

Previous Next

Package: grep;

Reported by: Santiago Ruano Rincón <santiagorr <at> riseup.net>

Date: Fri, 2 Oct 2015 14:45:02 UTC

Severity: normal

Tags: notabug

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Santiago Ruano Rincón <santiagorr <at> riseup.net>
Subject: bug#21604: closed (Re: bug#21604: grep doesn't match diacritical
 chars in ISO-8859 files)
Date: Fri, 02 Oct 2015 20:02:03 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#21604: grep doesn't match diacritical chars in ISO-8859 files

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 21604 <at> debbugs.gnu.org.

-- 
21604: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=21604
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Santiago Ruano Rincón <santiagorr <at> riseup.net>,
 21604-done <at> debbugs.gnu.org
Subject: Re: bug#21604: grep doesn't match diacritical chars in ISO-8859 files
Date: Fri, 2 Oct 2015 13:01:04 -0700
On 10/02/2015 02:43 AM, Santiago Ruano Rincón wrote:
> grep doesn't match characters with diacritical
> marks in ISO-8859 files, inside a Unicode enviroment

That is normal and expected behavior.  In a UTF-8 locale, "á" is 
represented by the two bytes 0xC3 and 0xA1.  In an ISO-8859 file, the 
same character is represented by the single byte 0xE1.  The UTF-8 
pattern won't match the ISO-8859 representation.

To avoid this problem, switch to an ISO-8859 locale before using grep to 
read ISO-8859 text files.  This is true for pretty much any standard 
utility, not just grep.  Alternatively, you can translate the text files 
from ISO-8859 to UTF-8, before giving the resulting text to grep or to 
other utilities.

[Message part 3 (message/rfc822, inline)]
From: Santiago Ruano Rincón <santiagorr <at> riseup.net>
To: bug-grep <at> gnu.org
Subject: grep doesn't match diacritical chars in ISO-8859 files
Date: Fri, 2 Oct 2015 11:43:58 +0200
[Message part 4 (text/plain, inline)]
Hi,

Moreover http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230 , several
debian users report that grep doesn't match characters with diacritical
marks in ISO-8859 files, inside a Unicode enviroment:

% file /tmp/q.h 
/tmp/q.h: ISO-8859 text

% grep c /tmp/q.h
Coincidencia en el fichero binario /tmp/q.h

% grep -a c /tmp/q.h
    struct cara* lcaras; //array de caras, habr� que usar reserva dinamica de memoria.

% grep á /tmp/q.h 

% grep -a á /tmp/q.h

grep matches the "á" pattern if it's is input from an ISO-8859 file:

% grep -f a q.h 
Coincidencia en el fichero binario q.h

Test files attached

Full report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670

Regards,

Santiago

-- System Information:
Debian Release: stretch/sid
  APT prefers squeeze-lts
    APT policy: (500, 'squeeze-lts'), (500, 'oldoldstable'), (500, 'unstable'), (500, 'testing'), (500, 'oldstable'), (1, 'experimental')
    Architecture: amd64 (x86_64)
    Foreign Architectures: i386

    Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores)
    Locale: LANG=es_CO.utf8, LC_CTYPE=es_CO.utf8 (charmap=UTF-8)
    Shell: /bin/sh linked to /bin/dash
    Init: sysvinit (via /sbin/init)

    Versions of packages grep depends on:
    ii  dpkg          1.18.1
    ii  install-info  6.0.0.dfsg.1-3
    ii  libc6         2.19-19
    ii  libpcre3      2:8.35-7
[q.h (text/x-chdr, attachment)]

This bug report was last modified 9 years and 297 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.