GNU bug report logs -
#43577
wrong result for grep -io in turkish locale
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
Your message dated Wed, 23 Sep 2020 19:57:36 -0700
with message-id <566c67b3-062e-d648-2dff-15f8c4b08e36 <at> cs.ucla.edu>
and subject line Re: bug#43577: wrong result for grep -io in turkish locale
has caused the debbugs.gnu.org bug report #43577,
regarding wrong result for grep -io in turkish locale
to be marked as done.
(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)
--
43577: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=43577
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
In turkish locale, upper and lower case are mapped as following.
U0049 <-> U0131
U0069 <-> U0130
It's expected that both following test cases returns U0130, but later
returns nothing.
$ printf '\304\260\n' >I # U0130
$ env LC_ALL=tr_TR.utf8 grep -i i I
? # U0130
$ env LC_ALL=tr_TR.utf8 grep -oi i I
$
By the way, both following test cases work correctly.
$ printf '\304\260\n' >i # U0131
$ env LC_ALL=tr_TR.utf8 grep -i I i
? # U0131
$ env LC_ALL=tr_TR.utf8 grep -oi I i
? # U0131
$
[Message part 3 (message/rfc822, inline)]
[Message part 4 (text/plain, inline)]
On 9/23/20 6:47 PM, Norihiro Tanaka wrote:
> I attach the fix for the bug. Regex is fixed in Paul, thank you.
>
Thanks, I had written a similar patch, and your patch helped me find a bug in
what I wrote. The patch I wrote uses an auxiliary ok_fold table that lets
fgrep_icase_charlen avoid calling mbrtwoc for single-byte characters in the
pattern; this may help performance for long patterns. More important,
fgrep_icase_charlen does not return -1 for a character like 'a' in an
en_US.UTF-8 locale merely because 'a' has a case folded counterpart 'A'; the
idea is that we should be OK if the case folded counterparts are single-byte.
I had added more-extensive tests than were in your patch, and some of them found
a crash in kwsinit that indicated a similar change is needed there. I assume
this was because the patch I wrote had a more-generous fgrep_icase_charlen. As
this simplifies kwsinit, this patch does that too.
While looking into this I found a performance glitch I recently introduced (I
double-counted some regular expressions, messing up later heuristics). Plus I
checked on this on our old Solaris 10 box and fixed a couple of porting
glitches. I installed the attached patches, into the master branch, to help make
it easier for you to compare your changes to mine. Patch 0003 is the enhanced
version of the patch that you wrote.
Thanks again for working on this.
[0001-grep-fix-recently-introduced-performance-glitch.patch (text/x-patch, attachment)]
[0002-build-update-gnulib-submodule-to-latest.patch (text/x-patch, attachment)]
[0003-grep-fix-more-Turkish-eyes-bugs.patch (text/x-patch, attachment)]
[0004-grep-pacify-Sun-C-5.15.patch (text/x-patch, attachment)]
[0005-grep-don-t-assume-PCRE-in-tests.patch (text/x-patch, attachment)]
This bug report was last modified 4 years and 236 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.