GNU bug report logs - #56351
LC_CTYPE=C.UTF-8 causes an matching error on Sed

Previous Next

Package: sed;

Reported by: git <at> taeyeob.kim

Date: Sat, 2 Jul 2022 09:30:03 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 56351 in the body.
You can then email your comments to 56351 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#56351; Package sed. (Sat, 02 Jul 2022 09:30:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to git <at> taeyeob.kim:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Sat, 02 Jul 2022 09:30:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: KIM Taeyeob <git <at> taeyeob.kim>
To: bug-sed <at> gnu.org
Subject: LC_CTYPE=C.UTF-8 causes an matching error on Sed
Date: Sat, 02 Jul 2022 14:03:10 +0900
Sed (and also Grep) cannot match a certain range of Korean characters 
when it operates under LC_CTYPE=C.UTF-8 (and whatever language 
environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or 
ja_JP.UTF-8 etc.)

reproducing the bug on Sed:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | sed -e 's/./a/'
a                           <-- matched and replaced without an issue
$ echo 퐀 | sed -e 's/./a/'
퐀                          <-- FAILED to match so it doesn't replace

In detail, a character that is in the range [가-폿] (<UAC00>~<UD3FF>) is 
matched without any issue but a character in the range [퐀-힣] 
(<UD400>~<UD7A3>) CANNOT be matched but it IS SUPPOSED TO be matched.

Grep has the same issue with the period regex too.

reproducing the bug on Grep:
$ export LC_CTYPE=C.UTF-8
$ echo 폿 | grep .
폿                   <-- matched successfully
$ echo 퐀 | grep .
$                    <-- failed to match

I think it is related with <regex.h> or <iconv.h> on Glibc, but I 
couldn't find way to reproduce the bug with those, so alternatively, I 
report on Sed instead.

I also report this issue on the bug-grep list too.




Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Sat, 02 Jul 2022 22:58:01 GMT) Full text and rfc822 format available.

Notification sent to git <at> taeyeob.kim:
bug acknowledged by developer. (Sat, 02 Jul 2022 22:58:02 GMT) Full text and rfc822 format available.

Message #10 received at 56351-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: git <at> taeyeob.kim
Cc: 56351-done <at> debbugs.gnu.org
Subject: LC_CTYPE=C.UTF-8 causes an matching error on Sed
Date: Sat, 2 Jul 2022 17:57:18 -0500
Thanks for reporting that. This bug was introduced in Sed 4.8. I 
propagated the Gnulib fix into the Sed development tree, here:

https://git.savannah.gnu.org/cgit/sed.git/commit/?id=bfdc4d6ee4811c34d8756fcca7895f5d2eed6946

https://git.savannah.gnu.org/cgit/sed.git/commit/?id=49c90357b9a07fc78904660f68c2e6acd236da9d

and the bug should be fixed in the next Sed release.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 31 Jul 2022 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 20 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.