From unknown Fri Aug 15 20:53:44 2025 X-Loop: help-debbugs@gnu.org Subject: bug#43225: Grep treats extended Latin characters like whitespace Resent-From: Mayo Fark Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 05 Sep 2020 16:06:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 43225 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 43225@debbugs.gnu.org X-Debbugs-Original-To: "bug-grep@gnu.org" Received: via spool by submit@debbugs.gnu.org id=B.159932193620978 (code B ref -1); Sat, 05 Sep 2020 16:06:02 +0000 Received: (at submit) by debbugs.gnu.org; 5 Sep 2020 16:05:36 +0000 Received: from localhost ([127.0.0.1]:43939 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kEah6-0005SD-A9 for submit@debbugs.gnu.org; Sat, 05 Sep 2020 12:05:36 -0400 Received: from lists.gnu.org ([209.51.188.17]:58536) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kEZPC-0007TR-35 for submit@debbugs.gnu.org; Sat, 05 Sep 2020 10:43:02 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:36932) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kEZPB-0002p5-SU for bug-grep@gnu.org; Sat, 05 Sep 2020 10:43:01 -0400 Received: from mail-oln040092009037.outbound.protection.outlook.com ([40.92.9.37]:22344 helo=NAM04-BN3-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kEZP9-00064C-To for bug-grep@gnu.org; Sat, 05 Sep 2020 10:43:01 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Ipgpcq0k/mI/OGrREvX3F4vyx8BgRGgPWO2cK0Mw9GrBuEtIUpcH9g36sCKo++nXSlaKEwtqnW1/5kGUC6uRruLh63G+KOBrDlUZboGvRNAACLjW5m8Bmz+LfNCxJwiqxqjR9qg+fr2KaiQ3Y97L0XPs6Ns7oqNLRO7ZJRF1nmfp5IvegpiMaMuOCMtN9QLhmPv7Z8X2JSKhrNd3CiqrqsVg5ZoEPfFMkbIcdRpIvyE3wfahXJZH1nAnXmjrOAiHTCWPvhfEt5ySv9fZExVLTvIm/V0WjBsgGenUc6O/PPRHKScrpUJhcFTxdI8wZySuYi30JetUAnwI93hlwZHmUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=enTt4hJE513eEaxMLaYMDBQTixq5GJsHHezfHCdtIfs=; b=eenIru1Yz+PCa7I6PPgfj63gWTtF1vL+Cw1t1i+vuJD00/K2o1C8ukpCnq6OF+IuvFeOiTBpmoQhiI5Pd5ePB7qVjr9/lasFdpQgh7tcqse6dGRe1020pA1Lz9RZ9sf7v9/dKU2hfD/vsvyIjdZwnMtGlkl7Ccs56v8nBQChIWUzsDRTIlrUqt+nSjWuqOBSYuVBAsnkbyvSDBjg2eNu2HPhp/UYLZr89ScwlnJ30I4TOfpXdxkEC6jWYWPV4ApfGd63lRJp5CB2CZzB4nga4ovlx4rmByhSqy+VTCbVSLX0VQEWNBCKI0eALZKbP+W+qDn0xcPgkQrgv/33SvmQAQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=enTt4hJE513eEaxMLaYMDBQTixq5GJsHHezfHCdtIfs=; b=oZ11Gt+FqQgfduXHcPNhs7t6oPYcuZp5tcX6l9oWy5TvwI9JUnyzq1tv2U1lpu5U1nUNMk8PBh39t2pXilVSv5WJCBnkbBaYlQm/ca//8wyWo2MzK3Ldf3YWsyCIjxyC2aW+VCizc3vFtQbjmPMfyZ+eo2rDyBBIIQ609HUof9+mWgSPSnuwgoYF5+e1fY4tGTw/u2b6Z3CLBY6llbAYOExt5fbgRQVAWrkp2nrRJHNXEq9JTHO33oxRsKKowDhNXTa36Zl+v5V4heF5NCMz7Ijobek5rBm8hKn1/kZ2t89rDdYmqTVzlJZxCd67jtLDA9xqZXIU3E28D34oKRJaqQ== Received: from BN8NAM04FT057.eop-NAM04.prod.protection.outlook.com (2a01:111:e400:7e85::4e) by BN8NAM04HT165.eop-NAM04.prod.protection.outlook.com (2a01:111:e400:7e85::118) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3348.16; Sat, 5 Sep 2020 14:27:56 +0000 Received: from BN7PR15MB2196.namprd15.prod.outlook.com (2a01:111:e400:7e85::41) by BN8NAM04FT057.mail.protection.outlook.com (2a01:111:e400:7e85::323) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3348.16 via Frontend Transport; Sat, 5 Sep 2020 14:27:56 +0000 Received: from BN7PR15MB2196.namprd15.prod.outlook.com ([fe80::9c72:793b:bebb:7884]) by BN7PR15MB2196.namprd15.prod.outlook.com ([fe80::9c72:793b:bebb:7884%6]) with mapi id 15.20.3348.018; Sat, 5 Sep 2020 14:27:56 +0000 From: Mayo Fark Thread-Topic: Grep treats extended Latin characters like whitespace Thread-Index: AQHWg49uoB82QO3x50m2ASdVIkhTqg== Date: Sat, 5 Sep 2020 14:27:56 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-incomingtopheadermarker: OriginalChecksum:3F782AD6FFD205F6D813276B8A06EEC92CAB42D7371923749619D143B9B900BF; UpperCasedChecksum:4F2B0010EC3F1F6C579E484A660E2C292E73383AC9E2EEE5E2F1743E424940B4; SizeAsReceived:6645; Count:41 x-tmn: [HtsXxBJ8MJ7qxIesQEstaJMonycVdHzp] x-ms-publictraffictype: Email x-incomingheadercount: 41 x-eopattributedmessage: 0 x-ms-office365-filtering-correlation-id: 041264e8-24fb-430e-dcac-08d851a7e1e4 x-ms-traffictypediagnostic: BN8NAM04HT165: x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: TCSMkolGyrWjuVP4dpsI8ZWQAMYdNDf3TwvVnSq1qTkd2FMsYpnCye0fBWFL3KkbfX1Eaisonm0UdiCeDyxfHV/6lPPEsBumtICSidBnEy3Slf/GLtOM171CfeJtAmDTIAh3BY/taBg4uTpVYXRFF4p77qDeCMyezV0mj0EJhYqKbsdyvuVIggzjfQRUDkh/n2z5BcIk95VdbSWwDCJDKA== x-ms-exchange-antispam-messagedata: Nu7rUzhYLdVJNMhWBL3WsBS1DtZFPdjSOsgyJxB2LN6wgKWlEXOFv5kocI2ZNo2NbRL4xvTqk2eQ5d16s/84tGJmYiR1Fh2X0mYG/+9d0pm8Wx4DGG2GNlydRzfeBWX1Xf6k7mGOrFh2qNnN7YVikA== x-ms-exchange-transport-forked: True Content-Type: multipart/alternative; boundary="_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_" MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-AuthSource: BN8NAM04FT057.eop-NAM04.prod.protection.outlook.com X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-Network-Message-Id: 041264e8-24fb-430e-dcac-08d851a7e1e4 X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-originalarrivaltime: 05 Sep 2020 14:27:56.2886 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN8NAM04HT165 Received-SPF: pass client-ip=40.92.9.37; envelope-from=mayofark@outlook.com; helo=NAM04-BN3-obe.outbound.protection.outlook.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/05 10:42:59 X-ACL-Warn: Detected OS = Windows NT kernel [generic] [fuzzy] X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Mailman-Approved-At: Sat, 05 Sep 2020 12:05:34 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable What I did: ``` grep -Riw cone * ''' Expected result: lines with the word "cone" surrounded by whitespace, ignor= ing case. What I got instead: ``` data/po/pt_BR.po:msgstr "Pressione o =EDcone de p=F3dio para iniciar o tuto= rial" ''' Why this is a bug: the word =EDcone is not the same as cone and should not = have been returned in the result set. It appears that grep treats the =ED c= haracter in =EDcone as whitespace, which affects other extended-Latin chara= cters as well. --_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
What I did:
```
grep -Riw cone *
'''

Expected result: lines with the word "cone" surrounded by wh= itespace, ignoring case.

What I got instead:
```
data/po/pt_BR.po:msgstr "Pressione o =EDcone de p=F3dio para inic= iar o tutorial"
'''

Why this is a bug: the word =EDcone is not the same as cone and should= not have been returned in the result set. It appears that grep treats the = =ED character in =EDcone as whitespace, which affects other extended-Latin = characters as well.


--_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_-- From unknown Fri Aug 15 20:53:44 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Mayo Fark Subject: bug#43225: closed (Re: bug#43225: Grep treats extended Latin characters like whitespace) Message-ID: References: <87d378cf-2c5b-c0aa-a9c4-1557ecb7c40e@cs.ucla.edu> X-Gnu-PR-Message: they-closed 43225 X-Gnu-PR-Package: grep Reply-To: 43225@debbugs.gnu.org Date: Wed, 09 Sep 2020 19:46:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1599680762-24646-1" This is a multi-part message in MIME format... ------------=_1599680762-24646-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #43225: Grep treats extended Latin characters like whitespace which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 43225@debbugs.gnu.org. --=20 43225: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D43225 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1599680762-24646-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 43225-done) by debbugs.gnu.org; 9 Sep 2020 19:45:24 +0000 Received: from localhost ([127.0.0.1]:34792 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kG61z-0006Oh-MK for submit@debbugs.gnu.org; Wed, 09 Sep 2020 15:45:23 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:46666) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kG61v-0006OM-NP for 43225-done@debbugs.gnu.org; Wed, 09 Sep 2020 15:45:22 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 43AD0160052; Wed, 9 Sep 2020 12:45:13 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id gwBSRQOEOL7X; Wed, 9 Sep 2020 12:45:12 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 61CA9160104; Wed, 9 Sep 2020 12:45:12 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id Jl5TBHoMgu5H; Wed, 9 Sep 2020 12:45:12 -0700 (PDT) Received: from [192.168.1.9] (cpe-75-82-69-226.socal.res.rr.com [75.82.69.226]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 34679160052; Wed, 9 Sep 2020 12:45:12 -0700 (PDT) Subject: Re: bug#43225: Grep treats extended Latin characters like whitespace To: Mayo Fark References: From: Paul Eggert Autocrypt: addr=eggert@cs.ucla.edu; prefer-encrypt=mutual; keydata= LS0tLS1CRUdJTiBQR1AgUFVCTElDIEtFWSBCTE9DSy0tLS0tCgptUUlOQkV5QWNtUUJFQURB QXlIMnhvVHU3cHBHNUQzYThGTVpFb243NGRDdmM0K3ExWEEySjJ0QnkycHdhVHFmCmhweHhk R0E5Smo1MFVKM1BENGJTVUVnTjh0TFowc2FuNDdsNVhUQUZMaTI0NTZjaVNsNW04c0thSGxH ZHQ5WG0KQUF0bVhxZVpWSVlYL1VGUzk2ZkR6ZjR4aEVtbS95N0xiWUVQUWRVZHh1NDd4QTVL aFRZcDVibHRGM1dZRHoxWQpnZDdneDA3QXV3cDdpdzdlTnZub0RUQWxLQWw4S1lEWnpiRE5D UUdFYnBZM2VmWkl2UGRlSStGV1FONFcra2doCnkrUDZhdTZQcklJaFlyYWV1YTdYRGRiMkxT MWVuM1NzbUUzUWpxZlJxSS9BMnVlOEpNd3N2WGUvV0szOEV6czYKeDc0aVRhcUkzQUZINmls QWhEcXBNbmQvbXNTRVNORnQ3NkRpTzFaS1FNcjlhbVZQa25qZlBtSklTcWRoZ0IxRApsRWR3 MzRzUk9mNlY4bVp3MHhmcVQ2UEtFNDZMY0ZlZnpzMGtiZzRHT1JmOHZqRzJTZjF0azVlVThN Qml5Ti9iClowM2JLTmpOWU1wT0REUVF3dVA4NGtZTGtYMndCeHhNQWhCeHdiRFZadWR6eERa SjFDMlZYdWpDT0pWeHEya2wKakJNOUVUWXVVR3FkNzVBVzJMWHJMdzYrTXVJc0hGQVlBZ1Jy NytLY3dEZ0JBZndoUEJZWDM0blNTaUhsbUxDKwpLYUhMZUNMRjVaSTJ2S20zSEVlQ1R0bE9n N3haRU9OZ3d6TCtmZEtvK0Q2U29DOFJSeEpLczhhM3NWZkk0dDZDCm5yUXp2SmJCbjZneGRn Q3U1aTI5SjFRQ1lyQ1l2cWwyVXlGUEFLK2RvOTkvMWpPWFQ0bTI4MzZqMXdBUkFRQUIKdENC UVlYVnNJRVZuWjJWeWRDQThaV2RuWlhKMFFHTnpMblZqYkdFdVpXUjFQb2tDVlFRVEFRZ0FQ d0liQXdZTApDUWdIQXdJR0ZRZ0NDUW9MQkJZQ0F3RUNIZ0VDRjRBV0lRUitONUtwMkt6MzFq TzhGWWp0bCtrT1lxcCtOQVVDClh5Vzlsd1VKRks0THN3QUtDUkR0bCtrT1lxcCtOS05WRC85 SE1zSTE2MDZuMFV1VFhId0lUc3lPakFJOVNET1QKK0MzRFV2NnFsTTVCSDJuV0FNVGlJaXlB NXVnbHNKdjkzb2kydk50RmYvUS9tLzFjblpXZ25WbkV4a3lMSTRFTgpTZDF1QnZyMC9sQ1Nk UGxQME1nNkdXU3BYTXUreDB2ZFQwQWFaTk9URTBGblB1b2xkYzNYRDc2QzJxZzhzWC9pCmF4 WFRLSHk5UCtCbEFxL0NzNy9weERRMEV6U24wVVNaMkMwbDV2djRQTXBBL3BpY25TNks2MDlK dkRHYU9SbXcKWmVYSVpxUU5aVitaUXMrVVl0Vm9ndURUcWJ5M0lVWTFJOEJsWEhScHRhajlB TW40VW9oL0NxcFFsVm9qb3lXbApIcWFGbm5KQktlRjBodko5U0F5YWx3dXpBakc3dlFXMDdN WW5jYU9GbTB3b2lLYmc1SkxPOEY0U0JUSWt1TzBECkNmOW5MQWF5NlZzQjRyendkRWZSd2pQ TFlBbjdNUjNmdkhDRXpmcmtsZFRyYWlCTzFUMGllREs4MEk3c0xmNnAKTWVDWUkxOXBVbHgw L05STUdDZGRpRklRZGZ0aEtXWEdSUzVMQXM4andCZjhINkc1UFdpblByRUlhb21JUDIxaQp2 dWhRRDA3YllxOUlpSWRlbGpqVWRIY0dJMGkvQjRNNTZaYWE4RmYzOGluaU9sckRZQ21ZV1I0 ZENXWml1UWVaCjNPZ3FlUXM5YTZqVHZnZERHVm1SVnFZK2p6azhQbGFIZmNvazhST2hGY0hL a2NmaHVCaEwyNWhsUklzaFJET0UKc2tYcUt3bnpyYnFnYTNHWFpYZnNYQW9GYnpOaExkTHY5 QStMSkFZU2tYUDYvNXFkVHBFTFZHb3N5SDg4NFZkYgpCcGtHSTA0b1lWcXVsYmtDRFFSTWdI SmtBUkFBcG9YcnZ4UDNESWZqQ05PdFhVL1Bkd01TaEtkWC9SbFNzNVBmCnVuVjF3YktQOGhl clhIcnZRZEZWcUVDYVRTeG1saHpiazhYMFBrWTlnY1ZhVTJPNDlUM3FzT2QxY0hlRjUyWUYK R0V0MExoc0JlTWpnTlg1dVoxVjc2cjhneWVWbEZwV1diMFNJd0pVQkhyRFhleEY2N3VwZVJi MnZkSEJqWUROZQp5U24rMEI3Z0ZFcXZWbVp1K0xhZHVkRHA2a1FMamF0RnZIUUhVU0dOc2hC bmtrY2FUYmlJOVBzdDBHQ2MyYWl6Cm5CaVBQQTJXUXhBUGxQUmgzT0dUc241VEhBRG1ianFZ NkZFTUxhc1ZYOERTQ2JsTXZMd05lTy84U3h6aUJpZGgKcUxwSkNxZFFSV0hrdTVYeGdJa0dl S096NU9MRHZYSFdKeWFmckVZamprUzZBazZCNXo2c3ZLbGlDbFduakhRYwpqbFB6eW9GRmdL VEVmY3FEeENqNFJZMEQwRGd0RkQwTmZ5ZU9pZHJTQi9TelRlMmh3cnlRRTNycFNpcW8rMGNH CmR6aDR5QUhLWUorVXJYWjRwOTNaaGpHZktEMXhsck5ZRGxXeVc5UEdtYnZxRnVEbWlJQVFm OVdEL3d6RWZJQ2MKK0YrdURESSt1WWtSeFVGcDkyeWttZGhERUZnMXlqWXNVOGlHVTY5YUh5 dmhxMzZ6NHpjdHZicWhSTnpPV0IxYgpWSi9kSU1EdnNFeEdjWFFWRElUN3NETlh2MHdFM2pL U0twcDdOREcxb1hVWEwrMitTRjk5S2p5NzUzQWJRU0FtCkg2MTdmeUJOd2hKV3ZRWWcrbVV2 UHBpR090c2VzOUVYVUkzbFM0djBNRWFQRzQzZmxFczFVUisxcnBGUVdWSG8KMXkxT08rc0FF UUVBQVlrQ1BBUVlBUWdBSmdJYkRCWWhCSDQza3FuWXJQZldNN3dWaU8yWDZRNWlxbjQwQlFK ZgpKYjJ6QlFrVXJndlBBQW9KRU8yWDZRNWlxbjQwY25NUC8xN0NnVWtYVDlhSUpyaVBNOHdi Y2VZcmNsNytiZFlFCmY3OVNsd1NiYkhON1I0Q29JSkZPbE45Uy8zNHR5cEdWWXZwZ21DSkRZ RlRCeHlQTzkyaU1YRGdBNCtjV0h6dDUKVDFhWU85aHNLaGg3dkR0Sys2UHJvWkdjKzA4Z1VU WEhoYjk3aE1NUWhrbkpsbmZqcFNFQzllbTkwNkZVK0k5MwpUMWZUR3VwbkJhM2FXY0s4ak0w SmFCR2J5MmhHMVMzb2xhRExTVHRCSU5OQlltdnVXUjlNS09oaHFEcmxrNWN3CkZESkxoNU5y WHRlRVkwOFdBemNMekczcGtyWFBIa0ZlTVF0ZnFrMGpMZEdHdkdDM05DSWtxWXJkTGhpUnZH cHIKdTM4QzI2UkVuNWY0STB2R0UzVmZJWEhlOFRNQ05tUXV0MU50TXVVbXBESXkxYUx4R3p1 cHRVaG5PSk4vL3IrVgpqRFBvaTNMT3lTTllwaHFlL2RNdWJzZlVyNm9oUDQxbUtGODFGdXdJ NGFtcUp0cnFJTDJ5cWF4M2EwcWxmd0N4ClhmdGllcUpjdWVrWCtlQ1BEQ0tyWU1YUjBGWWd3 cEcySVRaVUd0ckVqRVNsRTZEc2N4NzM0SEtkcjVPUklvY0wKVVVLRU9HZWlVNkRHaEdGZGI1 VHd1MFNuK3UxbVVQRE4wTSsrQ2RNdkNsSUU4a2xvNEc5MUVPSW11MVVwYjh4YwpPUFF3eGgx andxU3JVNVF3b05tU1llZ1FTSExwSVV1ckZ6MWlRVWgxdnBQWHpLaW5rV0VxdjRJcUExY2lM K0x5CnlTdUxrcDdNc0pwVlJNYldKQ05XT09TYmFING9EQko1ZEhNR2MzNXg1bW9zQ2s5MFBY a251RkREc1lIZkRvNXMKbWY5bG82WVh4N045Cj0zTGFJCi0tLS0tRU5EIFBHUCBQVUJMSUMg S0VZIEJMT0NLLS0tLS0K Organization: UCLA Computer Science Department Message-ID: <87d378cf-2c5b-c0aa-a9c4-1557ecb7c40e@cs.ucla.edu> Date: Wed, 9 Sep 2020 12:45:11 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: multipart/mixed; boundary="------------86EEA2D1E70452EC4EEDF107" Content-Language: en-US X-Spam-Score: -5.9 (-----) X-Debbugs-Envelope-To: 43225-done Cc: 43225-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.9 (------) This is a multi-part message in MIME format. --------------86EEA2D1E70452EC4EEDF107 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable On 9/5/20 7:27 AM, Mayo Fark wrote: > grep -Riw cone * > ... > data/po/pt_BR.po:msgstr "Pressione o =C3=ADcone de p=C3=B3dio para inic= iar o tutorial" Thanks for the bug report. This bug is due to an overenthusiastic optimiz= ation=20 that I installed in late 2016. I installed the attached patch to fix the = bug. --------------86EEA2D1E70452EC4EEDF107 Content-Type: text/x-patch; charset=UTF-8; name="0001-grep-fix-w-bug-in-UTF-8-locales.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-fix-w-bug-in-UTF-8-locales.patch" >From 8952431b790b409f4ef2ffdcb564475160548c50 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 9 Sep 2020 12:43:11 -0700 Subject: [PATCH] grep: fix -w bug in UTF-8 locales Problem reported by Mayo Fark (Bug#43225). * src/searchutils.c (wordchar_prev): In a UTF-8 locale, do not assume that an encoding-error byte cannot be part of a word constituent, as this assumption is incorrect for the last byte of a multibyte word constituent. * tests/word-delim-multibyte: Add a test for the bug. --- NEWS | 4 ++++ src/searchutils.c | 2 +- tests/word-delim-multibyte | 8 ++++++++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/NEWS b/NEWS index acd95dd..28c7835 100644 --- a/NEWS +++ b/NEWS @@ -11,6 +11,10 @@ GNU grep NEWS -*- outline -*- ** Bug fixes + In UTF-8 locales, grep -w no longer ignores a multibyte word + constituent just before what would otherwise be a word match. + [Bug#43225 introduced in grep 2.28] + A performance regression with many duplicate patterns has been fixed. [Bug#43040 introduced in grep 3.4] diff --git a/src/searchutils.c b/src/searchutils.c index 84c319c..c4bb802 100644 --- a/src/searchutils.c +++ b/src/searchutils.c @@ -195,7 +195,7 @@ wordchar_prev (char const *buf, char const *cur, char const *end) return 0; unsigned char b = *--cur; if (! localeinfo.multibyte - || (localeinfo.using_utf8 && localeinfo.sbclen[b] != -2)) + || (localeinfo.using_utf8 && localeinfo.sbclen[b] == 1)) return sbwordchar[b]; char const *p = buf; cur -= mb_goback (&p, NULL, cur, end); diff --git a/tests/word-delim-multibyte b/tests/word-delim-multibyte index 7d2c433..31190ad 100755 --- a/tests/word-delim-multibyte +++ b/tests/word-delim-multibyte @@ -34,4 +34,12 @@ for locale in C en_US.UTF-8; do compare /dev/null err || fail=1 done +# Bug#43255 +printf 'a \303\255cone b\n' >in +for flag in '' -i; do + returns_ 1 env LC_ALL=en_US.UTF-8 grep -w $flag cone in >out 2>err || fail=1 + compare /dev/null out || fail=1 + compare /dev/null err || fail=1 +done + Exit $fail -- 2.17.1 --------------86EEA2D1E70452EC4EEDF107-- ------------=_1599680762-24646-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 5 Sep 2020 16:05:36 +0000 Received: from localhost ([127.0.0.1]:43939 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kEah6-0005SD-A9 for submit@debbugs.gnu.org; Sat, 05 Sep 2020 12:05:36 -0400 Received: from lists.gnu.org ([209.51.188.17]:58536) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kEZPC-0007TR-35 for submit@debbugs.gnu.org; Sat, 05 Sep 2020 10:43:02 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:36932) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kEZPB-0002p5-SU for bug-grep@gnu.org; Sat, 05 Sep 2020 10:43:01 -0400 Received: from mail-oln040092009037.outbound.protection.outlook.com ([40.92.9.37]:22344 helo=NAM04-BN3-obe.outbound.protection.outlook.com) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1kEZP9-00064C-To for bug-grep@gnu.org; Sat, 05 Sep 2020 10:43:01 -0400 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=Ipgpcq0k/mI/OGrREvX3F4vyx8BgRGgPWO2cK0Mw9GrBuEtIUpcH9g36sCKo++nXSlaKEwtqnW1/5kGUC6uRruLh63G+KOBrDlUZboGvRNAACLjW5m8Bmz+LfNCxJwiqxqjR9qg+fr2KaiQ3Y97L0XPs6Ns7oqNLRO7ZJRF1nmfp5IvegpiMaMuOCMtN9QLhmPv7Z8X2JSKhrNd3CiqrqsVg5ZoEPfFMkbIcdRpIvyE3wfahXJZH1nAnXmjrOAiHTCWPvhfEt5ySv9fZExVLTvIm/V0WjBsgGenUc6O/PPRHKScrpUJhcFTxdI8wZySuYi30JetUAnwI93hlwZHmUg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=enTt4hJE513eEaxMLaYMDBQTixq5GJsHHezfHCdtIfs=; b=eenIru1Yz+PCa7I6PPgfj63gWTtF1vL+Cw1t1i+vuJD00/K2o1C8ukpCnq6OF+IuvFeOiTBpmoQhiI5Pd5ePB7qVjr9/lasFdpQgh7tcqse6dGRe1020pA1Lz9RZ9sf7v9/dKU2hfD/vsvyIjdZwnMtGlkl7Ccs56v8nBQChIWUzsDRTIlrUqt+nSjWuqOBSYuVBAsnkbyvSDBjg2eNu2HPhp/UYLZr89ScwlnJ30I4TOfpXdxkEC6jWYWPV4ApfGd63lRJp5CB2CZzB4nga4ovlx4rmByhSqy+VTCbVSLX0VQEWNBCKI0eALZKbP+W+qDn0xcPgkQrgv/33SvmQAQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=enTt4hJE513eEaxMLaYMDBQTixq5GJsHHezfHCdtIfs=; b=oZ11Gt+FqQgfduXHcPNhs7t6oPYcuZp5tcX6l9oWy5TvwI9JUnyzq1tv2U1lpu5U1nUNMk8PBh39t2pXilVSv5WJCBnkbBaYlQm/ca//8wyWo2MzK3Ldf3YWsyCIjxyC2aW+VCizc3vFtQbjmPMfyZ+eo2rDyBBIIQ609HUof9+mWgSPSnuwgoYF5+e1fY4tGTw/u2b6Z3CLBY6llbAYOExt5fbgRQVAWrkp2nrRJHNXEq9JTHO33oxRsKKowDhNXTa36Zl+v5V4heF5NCMz7Ijobek5rBm8hKn1/kZ2t89rDdYmqTVzlJZxCd67jtLDA9xqZXIU3E28D34oKRJaqQ== Received: from BN8NAM04FT057.eop-NAM04.prod.protection.outlook.com (2a01:111:e400:7e85::4e) by BN8NAM04HT165.eop-NAM04.prod.protection.outlook.com (2a01:111:e400:7e85::118) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3348.16; Sat, 5 Sep 2020 14:27:56 +0000 Received: from BN7PR15MB2196.namprd15.prod.outlook.com (2a01:111:e400:7e85::41) by BN8NAM04FT057.mail.protection.outlook.com (2a01:111:e400:7e85::323) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3348.16 via Frontend Transport; Sat, 5 Sep 2020 14:27:56 +0000 Received: from BN7PR15MB2196.namprd15.prod.outlook.com ([fe80::9c72:793b:bebb:7884]) by BN7PR15MB2196.namprd15.prod.outlook.com ([fe80::9c72:793b:bebb:7884%6]) with mapi id 15.20.3348.018; Sat, 5 Sep 2020 14:27:56 +0000 From: Mayo Fark To: "bug-grep@gnu.org" Subject: Grep treats extended Latin characters like whitespace Thread-Topic: Grep treats extended Latin characters like whitespace Thread-Index: AQHWg49uoB82QO3x50m2ASdVIkhTqg== Date: Sat, 5 Sep 2020 14:27:56 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-incomingtopheadermarker: OriginalChecksum:3F782AD6FFD205F6D813276B8A06EEC92CAB42D7371923749619D143B9B900BF; UpperCasedChecksum:4F2B0010EC3F1F6C579E484A660E2C292E73383AC9E2EEE5E2F1743E424940B4; SizeAsReceived:6645; Count:41 x-tmn: [HtsXxBJ8MJ7qxIesQEstaJMonycVdHzp] x-ms-publictraffictype: Email x-incomingheadercount: 41 x-eopattributedmessage: 0 x-ms-office365-filtering-correlation-id: 041264e8-24fb-430e-dcac-08d851a7e1e4 x-ms-traffictypediagnostic: BN8NAM04HT165: x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: TCSMkolGyrWjuVP4dpsI8ZWQAMYdNDf3TwvVnSq1qTkd2FMsYpnCye0fBWFL3KkbfX1Eaisonm0UdiCeDyxfHV/6lPPEsBumtICSidBnEy3Slf/GLtOM171CfeJtAmDTIAh3BY/taBg4uTpVYXRFF4p77qDeCMyezV0mj0EJhYqKbsdyvuVIggzjfQRUDkh/n2z5BcIk95VdbSWwDCJDKA== x-ms-exchange-antispam-messagedata: Nu7rUzhYLdVJNMhWBL3WsBS1DtZFPdjSOsgyJxB2LN6wgKWlEXOFv5kocI2ZNo2NbRL4xvTqk2eQ5d16s/84tGJmYiR1Fh2X0mYG/+9d0pm8Wx4DGG2GNlydRzfeBWX1Xf6k7mGOrFh2qNnN7YVikA== x-ms-exchange-transport-forked: True Content-Type: multipart/alternative; boundary="_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_" MIME-Version: 1.0 X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-AuthAs: Anonymous X-MS-Exchange-CrossTenant-AuthSource: BN8NAM04FT057.eop-NAM04.prod.protection.outlook.com X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-Network-Message-Id: 041264e8-24fb-430e-dcac-08d851a7e1e4 X-MS-Exchange-CrossTenant-rms-persistedconsumerorg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-CrossTenant-originalarrivaltime: 05 Sep 2020 14:27:56.2886 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Internet X-MS-Exchange-CrossTenant-id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-Transport-CrossTenantHeadersStamped: BN8NAM04HT165 Received-SPF: pass client-ip=40.92.9.37; envelope-from=mayofark@outlook.com; helo=NAM04-BN3-obe.outbound.protection.outlook.com X-detected-operating-system: by eggs.gnu.org: First seen = 2020/09/05 10:42:59 X-ACL-Warn: Detected OS = Windows NT kernel [generic] [fuzzy] X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, HTML_MESSAGE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 05 Sep 2020 12:05:34 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable What I did: ``` grep -Riw cone * ''' Expected result: lines with the word "cone" surrounded by whitespace, ignor= ing case. What I got instead: ``` data/po/pt_BR.po:msgstr "Pressione o =EDcone de p=F3dio para iniciar o tuto= rial" ''' Why this is a bug: the word =EDcone is not the same as cone and should not = have been returned in the result set. It appears that grep treats the =ED c= haracter in =EDcone as whitespace, which affects other extended-Latin chara= cters as well. --_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
What I did:
```
grep -Riw cone *
'''

Expected result: lines with the word "cone" surrounded by wh= itespace, ignoring case.

What I got instead:
```
data/po/pt_BR.po:msgstr "Pressione o =EDcone de p=F3dio para inic= iar o tutorial"
'''

Why this is a bug: the word =EDcone is not the same as cone and should= not have been returned in the result set. It appears that grep treats the = =ED character in =EDcone as whitespace, which affects other extended-Latin = characters as well.


--_000_BN7PR15MB2196A96CE0ECA17A58C64B0ED82A0BN7PR15MB2196namp_-- ------------=_1599680762-24646-1--