GNU bug report logs - #56350
UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

Previous Next

Package: grep;

Reported by: git <at> taeyeob.kim

Date: Sat, 2 Jul 2022 09:30:02 UTC

Severity: normal

Merged with 56352

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log

View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#56350: closed (UTF-8 LC_CTYPE bug esp when a certain range of
 Korean characters)
Date: Sat, 02 Jul 2022 21:29:01 +0000

[Message part 1 (text/plain, inline)]

Your message dated Sat, 2 Jul 2022 16:28:40 -0500
with message-id <6dc73457-0b41-ce63-c4c1-9c329848c766 <at> cs.ucla.edu>
and subject line Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
has caused the debbugs.gnu.org bug report #56350,
regarding UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
56350: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=56350
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems

[Message part 2 (message/rfc822, inline)]

From: KIM Taeyeob <git <at> taeyeob.kim>
To: bug-grep <at> gnu.org
Subject: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Date: Sat, 02 Jul 2022 12:44:29 +0900

Grep (and also Sed) cannot match a certain range of Korean characters 
when it operates under LC_CTYPE=C.UTF-8 (and whatever language 
environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or 
ja_JP.UTF-8 etc.)

Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- a character that is in the range [가-폿] 
(<UAC00>~<UD3FF>)
                         is matched without any issue
$ echo 퐀 | grep .
$                    <-- but a character in the range [퐀-힣] 
(<UD400>~<UD7A3>)
                         CANNOT be matched but it IS SUPPOSED TO be 
matched.

Sed has the same issue with the period regex too.

The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a                             <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀                            <-- FAILED to match so it doesn't replace

I think it is related with <regex.h> or <iconv.h> on Glibc, but I 
couldn't find way to reproduce the bug with those, so alternatively, I 
report on Grep instead.

[Message part 3 (message/rfc822, inline)]

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: git <at> taeyeob.kim, 김태엽 <owletkonoha <at> gmail.com>
Cc: 56350-done <at> debbugs.gnu.org, 56352-done <at> debbugs.gnu.org
Subject: Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean
 characters
Date: Sat, 2 Jul 2022 16:28:40 -0500

Thanks, that's a Gnulib bug that was fixed here:

https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b19a10775e54f8ed17e3a8c08a72d261d8c26244

This has been propagated to GNU Grep and the fix should appear in the 
next Grep release. I plan to reply separately about GNU Sed.

This bug report was last modified 3 years and 20 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #56350 UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

GNU bug report logs - #56350
UTF-8 LC_CTYPE bug esp when a certain range of Korean characters