From unknown Thu Aug 21 14:53:56 2025 X-Loop: help-debbugs@gnu.org Subject: bug#21604: grep doesn't match diacritical chars in ISO-8859 files Resent-From: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 02 Oct 2015 14:45:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 21604 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 21604@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.144379706131130 (code B ref -1); Fri, 02 Oct 2015 14:45:02 +0000 Received: (at submit) by debbugs.gnu.org; 2 Oct 2015 14:44:21 +0000 Received: from localhost ([127.0.0.1]:52277 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zi1ZM-000861-U7 for submit@debbugs.gnu.org; Fri, 02 Oct 2015 10:44:21 -0400 Received: from eggs.gnu.org ([208.118.235.92]:58492) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zhwsr-0007zO-8Q for submit@debbugs.gnu.org; Fri, 02 Oct 2015 05:44:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zhwsq-0003UO-7Q for submit@debbugs.gnu.org; Fri, 02 Oct 2015 05:44:08 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:40417) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zhwsq-0003U7-55 for submit@debbugs.gnu.org; Fri, 02 Oct 2015 05:44:08 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40231) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zhwsp-0003Sf-4p for bug-grep@gnu.org; Fri, 02 Oct 2015 05:44:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zhwsl-0003OZ-Od for bug-grep@gnu.org; Fri, 02 Oct 2015 05:44:07 -0400 Received: from mx1.riseup.net ([198.252.153.129]:55918) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zhwsl-0003Mq-Im for bug-grep@gnu.org; Fri, 02 Oct 2015 05:44:03 -0400 Received: from piha.riseup.net (unknown [10.0.1.162]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.riseup.net", Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK)) by mx1.riseup.net (Postfix) with ESMTPS id 57CF9C2275 for ; Fri, 2 Oct 2015 02:44:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak; t=1443779042; bh=OXzFaBf8vAc9CIE+1fa3BmsLk3BeOnOYdIAzX6+2Bx0=; h=Date:From:To:Subject:From; b=pmy2q5CDwpqbhbxOTdZJHg6PggV7uTLmyJHLya8DtRGQjuXUnYgo/NRATX90GqXv+ MgGJoSWlA9Q2oOAQQzB8+hfEP/wSVa/zVsrgJNEd6HZOTJIZgYgjfi+JJCbHHMIrHv fJ9HDdE8OfUlOW0+lO77dEhEZMPpVEkJhZB/l2pA= Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: santiagorr) with ESMTPSA id BE18B1407B2 Received: by nomada (sSMTP sendmail emulation); Fri, 02 Oct 2015 11:43:58 +0200 Date: Fri, 2 Oct 2015 11:43:58 +0200 From: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Message-ID: <20151002094358.GD344@nomada> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="QWpDgw58+k1mSFBj" Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-Virus-Scanned: clamav-milter 0.98.7 at mx1.riseup.net X-Virus-Status: Clean Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.3 (----) X-Mailman-Approved-At: Fri, 02 Oct 2015 10:44:20 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.3 (----) --QWpDgw58+k1mSFBj Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, Moreover http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D19230 , several debian users report that grep doesn't match characters with diacritical marks in ISO-8859 files, inside a Unicode enviroment: % file /tmp/q.h=20 /tmp/q.h: ISO-8859 text % grep c /tmp/q.h Coincidencia en el fichero binario /tmp/q.h % grep -a c /tmp/q.h struct cara* lcaras; //array de caras, habr=EF=BF=BD que usar reserva= dinamica de memoria. % grep =C3=A1 /tmp/q.h=20 % grep -a =C3=A1 /tmp/q.h grep matches the "=C3=A1" pattern if it's is input from an ISO-8859 file: % grep -f a q.h=20 Coincidencia en el fichero binario q.h Test files attached Full report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D800670 Regards, Santiago -- System Information: Debian Release: stretch/sid APT prefers squeeze-lts APT policy: (500, 'squeeze-lts'), (500, 'oldoldstable'), (500, 'unsta= ble'), (500, 'testing'), (500, 'oldstable'), (1, 'experimental') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores) Locale: LANG=3Des_CO.utf8, LC_CTYPE=3Des_CO.utf8 (charmap=3DUTF-8) Shell: /bin/sh linked to /bin/dash Init: sysvinit (via /sbin/init) Versions of packages grep depends on: ii dpkg 1.18.1 ii install-info 6.0.0.dfsg.1-3 ii libc6 2.19-19 ii libpcre3 2:8.35-7 --QWpDgw58+k1mSFBj Content-Type: text/x-chdr; charset=utf-8 Content-Disposition: attachment; filename="q.h" Content-Transfer-Encoding: quoted-printable struct cara* lcaras; //array de caras, habr=E1 que usar reserva dinamica= de memoria. --QWpDgw58+k1mSFBj-- From debbugs-submit-bounces@debbugs.gnu.org Fri Oct 02 16:01:46 2015 Received: (at control) by debbugs.gnu.org; 2 Oct 2015 20:01:46 +0000 Received: from localhost ([127.0.0.1]:52466 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zi6WY-00007W-4i for submit@debbugs.gnu.org; Fri, 02 Oct 2015 16:01:46 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:47349) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zi6WW-00007O-20 for control@debbugs.gnu.org; Fri, 02 Oct 2015 16:01:44 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 88990160998 for ; Fri, 2 Oct 2015 13:01:43 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id arg_uN1VEf6n for ; Fri, 2 Oct 2015 13:01:43 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id ECC1E160ECC for ; Fri, 2 Oct 2015 13:01:42 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 3qLn6h9mHZTt for ; Fri, 2 Oct 2015 13:01:42 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D784B160998 for ; Fri, 2 Oct 2015 13:01:42 -0700 (PDT) To: control@debbugs.gnu.org From: Paul Eggert Subject: 21604 is not a bug Organization: UCLA Computer Science Department Message-ID: <560EE2A6.3060103@cs.ucla.edu> Date: Fri, 2 Oct 2015 13:01:42 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) tags 21604 notabug thanks From unknown Thu Aug 21 14:53:56 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.503 (Entity 5.503) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Subject: bug#21604: closed (Re: bug#21604: grep doesn't match diacritical chars in ISO-8859 files) Message-ID: References: <560EE280.1060408@cs.ucla.edu> <20151002094358.GD344@nomada> X-Gnu-PR-Message: they-closed 21604 X-Gnu-PR-Package: grep X-Gnu-PR-Keywords: notabug Reply-To: 21604@debbugs.gnu.org Date: Fri, 02 Oct 2015 20:02:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1443816123-501-1" This is a multi-part message in MIME format... ------------=_1443816123-501-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #21604: grep doesn't match diacritical chars in ISO-8859 files which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 21604@debbugs.gnu.org. --=20 21604: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D21604 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1443816123-501-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 21604-done) by debbugs.gnu.org; 2 Oct 2015 20:01:10 +0000 Received: from localhost ([127.0.0.1]:52463 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zi6Vx-00006f-N2 for submit@debbugs.gnu.org; Fri, 02 Oct 2015 16:01:10 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:47296) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zi6Vu-00006W-JB for 21604-done@debbugs.gnu.org; Fri, 02 Oct 2015 16:01:07 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id A36FE160998; Fri, 2 Oct 2015 13:01:05 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id m7_7GY83I3Cm; Fri, 2 Oct 2015 13:01:05 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id EF451160ECC; Fri, 2 Oct 2015 13:01:04 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id adqVy6IOXkUT; Fri, 2 Oct 2015 13:01:04 -0700 (PDT) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D9766160998; Fri, 2 Oct 2015 13:01:04 -0700 (PDT) Subject: Re: bug#21604: grep doesn't match diacritical chars in ISO-8859 files To: =?UTF-8?Q?Santiago_Ruano_Rinc=c3=b3n?= , 21604-done@debbugs.gnu.org References: <20151002094358.GD344@nomada> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <560EE280.1060408@cs.ucla.edu> Date: Fri, 2 Oct 2015 13:01:04 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.1.0 MIME-Version: 1.0 In-Reply-To: <20151002094358.GD344@nomada> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 21604-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 10/02/2015 02:43 AM, Santiago Ruano Rinc=C3=B3n wrote: > grep doesn't match characters with diacritical > marks in ISO-8859 files, inside a Unicode enviroment That is normal and expected behavior. In a UTF-8 locale, "=C3=A1" is=20 represented by the two bytes 0xC3 and 0xA1. In an ISO-8859 file, the=20 same character is represented by the single byte 0xE1. The UTF-8=20 pattern won't match the ISO-8859 representation. To avoid this problem, switch to an ISO-8859 locale before using grep to=20 read ISO-8859 text files. This is true for pretty much any standard=20 utility, not just grep. Alternatively, you can translate the text files=20 from ISO-8859 to UTF-8, before giving the resulting text to grep or to=20 other utilities. ------------=_1443816123-501-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 2 Oct 2015 14:44:21 +0000 Received: from localhost ([127.0.0.1]:52277 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zi1ZM-000861-U7 for submit@debbugs.gnu.org; Fri, 02 Oct 2015 10:44:21 -0400 Received: from eggs.gnu.org ([208.118.235.92]:58492) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Zhwsr-0007zO-8Q for submit@debbugs.gnu.org; Fri, 02 Oct 2015 05:44:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zhwsq-0003UO-7Q for submit@debbugs.gnu.org; Fri, 02 Oct 2015 05:44:08 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_20,T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:40417) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zhwsq-0003U7-55 for submit@debbugs.gnu.org; Fri, 02 Oct 2015 05:44:08 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40231) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zhwsp-0003Sf-4p for bug-grep@gnu.org; Fri, 02 Oct 2015 05:44:08 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Zhwsl-0003OZ-Od for bug-grep@gnu.org; Fri, 02 Oct 2015 05:44:07 -0400 Received: from mx1.riseup.net ([198.252.153.129]:55918) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Zhwsl-0003Mq-Im for bug-grep@gnu.org; Fri, 02 Oct 2015 05:44:03 -0400 Received: from piha.riseup.net (unknown [10.0.1.162]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.riseup.net", Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK)) by mx1.riseup.net (Postfix) with ESMTPS id 57CF9C2275 for ; Fri, 2 Oct 2015 02:44:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak; t=1443779042; bh=OXzFaBf8vAc9CIE+1fa3BmsLk3BeOnOYdIAzX6+2Bx0=; h=Date:From:To:Subject:From; b=pmy2q5CDwpqbhbxOTdZJHg6PggV7uTLmyJHLya8DtRGQjuXUnYgo/NRATX90GqXv+ MgGJoSWlA9Q2oOAQQzB8+hfEP/wSVa/zVsrgJNEd6HZOTJIZgYgjfi+JJCbHHMIrHv fJ9HDdE8OfUlOW0+lO77dEhEZMPpVEkJhZB/l2pA= Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: santiagorr) with ESMTPSA id BE18B1407B2 Received: by nomada (sSMTP sendmail emulation); Fri, 02 Oct 2015 11:43:58 +0200 Date: Fri, 2 Oct 2015 11:43:58 +0200 From: Santiago Ruano =?iso-8859-1?Q?Rinc=F3n?= To: bug-grep@gnu.org Subject: grep doesn't match diacritical chars in ISO-8859 files Message-ID: <20151002094358.GD344@nomada> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="QWpDgw58+k1mSFBj" Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-Virus-Scanned: clamav-milter 0.98.7 at mx1.riseup.net X-Virus-Status: Clean Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.3 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Fri, 02 Oct 2015 10:44:20 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.3 (----) --QWpDgw58+k1mSFBj Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, Moreover http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D19230 , several debian users report that grep doesn't match characters with diacritical marks in ISO-8859 files, inside a Unicode enviroment: % file /tmp/q.h=20 /tmp/q.h: ISO-8859 text % grep c /tmp/q.h Coincidencia en el fichero binario /tmp/q.h % grep -a c /tmp/q.h struct cara* lcaras; //array de caras, habr=EF=BF=BD que usar reserva= dinamica de memoria. % grep =C3=A1 /tmp/q.h=20 % grep -a =C3=A1 /tmp/q.h grep matches the "=C3=A1" pattern if it's is input from an ISO-8859 file: % grep -f a q.h=20 Coincidencia en el fichero binario q.h Test files attached Full report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=3D800670 Regards, Santiago -- System Information: Debian Release: stretch/sid APT prefers squeeze-lts APT policy: (500, 'squeeze-lts'), (500, 'oldoldstable'), (500, 'unsta= ble'), (500, 'testing'), (500, 'oldstable'), (1, 'experimental') Architecture: amd64 (x86_64) Foreign Architectures: i386 Kernel: Linux 3.16.0-4-amd64 (SMP w/4 CPU cores) Locale: LANG=3Des_CO.utf8, LC_CTYPE=3Des_CO.utf8 (charmap=3DUTF-8) Shell: /bin/sh linked to /bin/dash Init: sysvinit (via /sbin/init) Versions of packages grep depends on: ii dpkg 1.18.1 ii install-info 6.0.0.dfsg.1-3 ii libc6 2.19-19 ii libpcre3 2:8.35-7 --QWpDgw58+k1mSFBj Content-Type: text/x-chdr; charset=utf-8 Content-Disposition: attachment; filename="q.h" Content-Transfer-Encoding: quoted-printable struct cara* lcaras; //array de caras, habr=E1 que usar reserva dinamica= de memoria. --QWpDgw58+k1mSFBj-- ------------=_1443816123-501-1--