GNU bug report logs - #56351
LC_CTYPE=C.UTF-8 causes an matching error on Sed

Previous Next

Package: sed;

Reported by: git <at> taeyeob.kim

Date: Sat, 2 Jul 2022 09:30:03 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: KIM Taeyeob <git <at> taeyeob.kim>
To: bug-sed <at> gnu.org
Subject: LC_CTYPE=C.UTF-8 causes an matching error on Sed
Date: Sat, 02 Jul 2022 14:03:10 +0900

Sed (and also Grep) cannot match a certain range of Korean characters 
when it operates under LC_CTYPE=C.UTF-8 (and whatever language 
environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or 
ja_JP.UTF-8 etc.)

reproducing the bug on Sed:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | sed -e 's/./a/'
a                           <-- matched and replaced without an issue
$ echo 퐀 | sed -e 's/./a/'
퐀                          <-- FAILED to match so it doesn't replace

In detail, a character that is in the range [가-폿] (<UAC00>~<UD3FF>) is 
matched without any issue but a character in the range [퐀-힣] 
(<UD400>~<UD7A3>) CANNOT be matched but it IS SUPPOSED TO be matched.

Grep has the same issue with the period regex too.

reproducing the bug on Grep:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- matched successfully
$ echo 퐀 | grep .
$                    <-- failed to match

I think it is related with <regex.h> or <iconv.h> on Glibc, but I 
couldn't find way to reproduce the bug with those, so alternatively, I 
report on Sed instead.

I also report this issue on the bug-grep list too.

This bug report was last modified 3 years and 20 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #56351 LC_CTYPE=C.UTF-8 causes an matching error on Sed

GNU bug report logs - #56351
LC_CTYPE=C.UTF-8 causes an matching error on Sed