GNU bug report logs -
#21604
grep doesn't match diacritical chars in ISO-8859 files
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 21604 in the body.
You can then email your comments to 21604 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#21604
; Package
grep
.
(Fri, 02 Oct 2015 14:45:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Santiago Ruano Rincón <santiagorr <at> riseup.net>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Fri, 02 Oct 2015 14:45:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
Moreover http://debbugs.gnu.org/cgi/bugreport.cgi?bug=19230 , several
debian users report that grep doesn't match characters with diacritical
marks in ISO-8859 files, inside a Unicode enviroment:
% file /tmp/q.h
/tmp/q.h: ISO-8859 text
% grep c /tmp/q.h
Coincidencia en el fichero binario /tmp/q.h
% grep -a c /tmp/q.h
struct cara* lcaras; //array de caras, habr� que usar reserva dinamica de memoria.
% grep á /tmp/q.h
% grep -a á /tmp/q.h
grep matches the "á" pattern if it's is input from an ISO-8859 file:
% grep -f a q.h
Coincidencia en el fichero binario q.h
Test files attached
Full report:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=800670
Regards,
Santiago
-- System Information:
Debian Release: stretch/sid
APT prefers squeeze-lts
APT policy: (500, 'squeeze-lts'), (500, 'oldoldstable'), (500, 'unstable'), (500, 'testing'), (500, 'oldstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores)
Locale: LANG=es_CO.utf8, LC_CTYPE=es_CO.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: sysvinit (via /sbin/init)
Versions of packages grep depends on:
ii dpkg 1.18.1
ii install-info 6.0.0.dfsg.1-3
ii libc6 2.19-19
ii libpcre3 2:8.35-7
[q.h (text/x-chdr, attachment)]
Added tag(s) notabug.
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Fri, 02 Oct 2015 20:02:01 GMT)
Full text and
rfc822 format available.
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Fri, 02 Oct 2015 20:02:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Santiago Ruano Rincón <santiagorr <at> riseup.net>
:
bug acknowledged by developer.
(Fri, 02 Oct 2015 20:02:03 GMT)
Full text and
rfc822 format available.
Message #12 received at 21604-done <at> debbugs.gnu.org (full text, mbox):
On 10/02/2015 02:43 AM, Santiago Ruano Rincón wrote:
> grep doesn't match characters with diacritical
> marks in ISO-8859 files, inside a Unicode enviroment
That is normal and expected behavior. In a UTF-8 locale, "á" is
represented by the two bytes 0xC3 and 0xA1. In an ISO-8859 file, the
same character is represented by the single byte 0xE1. The UTF-8
pattern won't match the ISO-8859 representation.
To avoid this problem, switch to an ISO-8859 locale before using grep to
read ISO-8859 text files. This is true for pretty much any standard
utility, not just grep. Alternatively, you can translate the text files
from ISO-8859 to UTF-8, before giving the resulting text to grep or to
other utilities.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 31 Oct 2015 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 295 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.