GNU bug report logs -
#56350
UTF-8 LC_CTYPE bug esp when a certain range of Korean characters
Previous Next
Reported by: git <at> taeyeob.kim
Date: Sat, 2 Jul 2022 09:30:02 UTC
Severity: normal
Merged with 56352
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Grep (and also Sed) cannot match a certain range of Korean characters
when it operates under LC_CTYPE=C.UTF-8 (and whatever language
environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or
ja_JP.UTF-8 etc.)
Reproduce the bug:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿 <-- a character that is in the range [가-폿]
(<UAC00>~<UD3FF>)
is matched without any issue
$ echo 퐀 | grep .
$ <-- but a character in the range [퐀-힣]
(<UD400>~<UD7A3>)
CANNOT be matched but it IS SUPPOSED TO be
matched.
Sed has the same issue with the period regex too.
The Example of Sed:
$ export LC_CTYPE=C.UTF-8
$ echo "폿" | sed -e 's/./a/'
a <-- matched and replaced without an issue
$ echo "퐀" | sed -e 's/./a/'
퐀 <-- FAILED to match so it doesn't replace
I think it is related with <regex.h> or <iconv.h> on Glibc, but I
couldn't find way to reproduce the bug with those, so alternatively, I
report on Grep instead.
This bug report was last modified 3 years and 20 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.