GNU bug report logs - #56350
UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

Package: grep;

Date: Sat, 2 Jul 2022 09:30:02 UTC

Severity: normal

Merged with 56352

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 56350 in the body.
You can then email your comments to 56350 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#56350; Package grep. (Sat, 02 Jul 2022 09:30:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to git <at> taeyeob.kim:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 02 Jul 2022 09:30:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: KIM Taeyeob <git <at> taeyeob.kim>
To: bug-grep <at> gnu.org
Subject: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Date: Sat, 02 Jul 2022 12:44:29 +0900

Grep (and also Sed) cannot match a certain range of Korean characters 
when it operates under LC_CTYPE=C.UTF-8 (and whatever language 
environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or 
ja_JP.UTF-8 etc.)

Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- a character that is in the range [가-폿] 
(<UAC00>~<UD3FF>)
                         is matched without any issue
$ echo 퐀 | grep .
$                    <-- but a character in the range [퐀-힣] 
(<UD400>~<UD7A3>)
                         CANNOT be matched but it IS SUPPOSED TO be 
matched.

Sed has the same issue with the period regex too.

The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a                             <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀                            <-- FAILED to match so it doesn't replace

I think it is related with <regex.h> or <iconv.h> on Glibc, but I 
couldn't find way to reproduce the bug with those, so alternatively, I 
report on Grep instead.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 02 Jul 2022 21:29:01 GMT) Full text and rfc822 format available.

Notification sent to git <at> taeyeob.kim:
bug acknowledged by developer. (Sat, 02 Jul 2022 21:29:01 GMT) Full text and rfc822 format available.

Message #10 received at 56350-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: git <at> taeyeob.kim, 김태엽 <owletkonoha <at> gmail.com>
Cc: 56350-done <at> debbugs.gnu.org, 56352-done <at> debbugs.gnu.org
Subject: Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean
 characters
Date: Sat, 2 Jul 2022 16:28:40 -0500

Thanks, that's a Gnulib bug that was fixed here:

https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b19a10775e54f8ed17e3a8c08a72d261d8c26244

This has been propagated to GNU Grep and the fix should appear in the 
next Grep release. I plan to reply separately about GNU Sed.

Merged 56350 56352. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sat, 02 Jul 2022 21:36:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 31 Jul 2022 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 19 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #56350 UTF-8 LC_CTYPE bug esp when a certain range of Korean characters

GNU bug report logs - #56350
UTF-8 LC_CTYPE bug esp when a certain range of Korean characters