GNU bug report logs - #43577
wrong result for grep -io in turkish locale

Previous Next

Package: grep;

Reported by: Norihiro Tanaka <noritnk <at> kcn.ne.jp>

Date: Wed, 23 Sep 2020 13:24:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#43577: closed (wrong result for grep -io in turkish locale)
Date: Thu, 24 Sep 2020 02:58:01 +0000
[Message part 1 (text/plain, inline)]
Your message dated Wed, 23 Sep 2020 19:57:36 -0700
with message-id <566c67b3-062e-d648-2dff-15f8c4b08e36 <at> cs.ucla.edu>
and subject line Re: bug#43577: wrong result for grep -io in turkish locale
has caused the debbugs.gnu.org bug report #43577,
regarding wrong result for grep -io in turkish locale
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
43577: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=43577
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: <bug-grep <at> gnu.org>
Subject: wrong result for grep -io in turkish locale
Date: Wed, 23 Sep 2020 22:23:09 +0900
In turkish locale, upper and lower case are mapped as following.

  U0049 <-> U0131
  U0069 <-> U0130

It's expected that both following test cases returns U0130, but later
returns nothing.

$ printf '\304\260\n' >I  # U0130
$ env LC_ALL=tr_TR.utf8 grep -i i I
?  # U0130
$ env LC_ALL=tr_TR.utf8 grep -oi i I
$ 

By the way, both following test cases work correctly.

$ printf '\304\260\n' >i  # U0131
$ env LC_ALL=tr_TR.utf8 grep -i I i
?  # U0131
$ env LC_ALL=tr_TR.utf8 grep -oi I i
?  # U0131
$



[Message part 3 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 43577-done <at> debbugs.gnu.org
Subject: Re: bug#43577: wrong result for grep -io in turkish locale
Date: Wed, 23 Sep 2020 19:57:36 -0700
[Message part 4 (text/plain, inline)]
On 9/23/20 6:47 PM, Norihiro Tanaka wrote:
> I attach the fix for the bug.  Regex is fixed in Paul, thank you.
> 

Thanks, I had written a similar patch, and your patch helped me find a bug in 
what I wrote. The patch I wrote uses an auxiliary ok_fold table that lets 
fgrep_icase_charlen avoid calling mbrtwoc for single-byte characters in the 
pattern; this may help performance for long patterns. More important, 
fgrep_icase_charlen does not return -1 for a character like 'a' in an 
en_US.UTF-8 locale merely because 'a' has a case folded counterpart 'A'; the 
idea is that we should be OK if the case folded counterparts are single-byte.

I had added more-extensive tests than were in your patch, and some of them found 
a crash in kwsinit that indicated a similar change is needed there. I assume 
this was because the patch I wrote had a more-generous fgrep_icase_charlen. As 
this simplifies kwsinit, this patch does that too.

While looking into this I found a performance glitch I recently introduced (I 
double-counted some regular expressions, messing up later heuristics). Plus I 
checked on this on our old Solaris 10 box and fixed a couple of porting 
glitches. I installed the attached patches, into the master branch, to help make 
it easier for you to compare your changes to mine. Patch 0003 is the enhanced 
version of the patch that you wrote.

Thanks again for working on this.
[0001-grep-fix-recently-introduced-performance-glitch.patch (text/x-patch, attachment)]
[0002-build-update-gnulib-submodule-to-latest.patch (text/x-patch, attachment)]
[0003-grep-fix-more-Turkish-eyes-bugs.patch (text/x-patch, attachment)]
[0004-grep-pacify-Sun-C-5.15.patch (text/x-patch, attachment)]
[0005-grep-don-t-assume-PCRE-in-tests.patch (text/x-patch, attachment)]

This bug report was last modified 4 years and 236 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.