GNU bug report logs - #56350
UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

Previous Next

Package: grep;

Reported by: git <at> taeyeob.kim

Date: Sat, 2 Jul 2022 09:30:02 UTC

Severity: normal

Merged with 56352

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: git <at> taeyeob.kim
Subject: bug#56350: closed (Re: bug#56350: UTF-8 LC_CTYPE bug esp when a
 certain range of Korean characters)
Date: Sat, 02 Jul 2022 21:29:01 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

which was filed against the grep package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 56350 <at> debbugs.gnu.org.

-- 
56350: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=56350
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Paul Eggert <eggert <at> cs.ucla.edu>
To: git <at> taeyeob.kim, 김태엽 <owletkonoha <at> gmail.com>
Cc: 56350-done <at> debbugs.gnu.org, 56352-done <at> debbugs.gnu.org
Subject: Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean
 characters
Date: Sat, 2 Jul 2022 16:28:40 -0500
Thanks, that's a Gnulib bug that was fixed here:

https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b19a10775e54f8ed17e3a8c08a72d261d8c26244

This has been propagated to GNU Grep and the fix should appear in the 
next Grep release. I plan to reply separately about GNU Sed.


[Message part 3 (message/rfc822, inline)]
From: KIM Taeyeob <git <at> taeyeob.kim>
To: bug-grep <at> gnu.org
Subject: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Date: Sat, 02 Jul 2022 12:44:29 +0900
Grep (and also Sed) cannot match a certain range of Korean characters 
when it operates under LC_CTYPE=C.UTF-8 (and whatever language 
environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or 
ja_JP.UTF-8 etc.)

Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- a character that is in the range [가-폿] 
(<UAC00>~<UD3FF>)
                         is matched without any issue
$ echo 퐀 | grep .
$                    <-- but a character in the range [퐀-힣] 
(<UD400>~<UD7A3>)
                         CANNOT be matched but it IS SUPPOSED TO be 
matched.

Sed has the same issue with the period regex too.

The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a                             <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀                            <-- FAILED to match so it doesn't replace

I think it is related with <regex.h> or <iconv.h> on Glibc, but I 
couldn't find way to reproduce the bug with those, so alternatively, I 
report on Grep instead.



This bug report was last modified 3 years and 20 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.