From unknown Tue Aug 19 07:27:33 2025 X-Loop: help-debbugs@gnu.org Subject: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters Resent-From: KIM Taeyeob Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 02 Jul 2022 09:30:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 56350 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 56350@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Reply-To: git@taeyeob.kim Received: via spool by submit@debbugs.gnu.org id=B.16567541477732 (code B ref -1); Sat, 02 Jul 2022 09:30:02 +0000 Received: (at submit) by debbugs.gnu.org; 2 Jul 2022 09:29:07 +0000 Received: from localhost ([127.0.0.1]:39784 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7ZR3-00020Y-Ra for submit@debbugs.gnu.org; Sat, 02 Jul 2022 05:29:07 -0400 Received: from lists.gnu.org ([209.51.188.17]:33100) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7U3n-00078V-JR for submit@debbugs.gnu.org; Fri, 01 Jul 2022 23:44:43 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:45982) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o7U3m-000237-CI for bug-grep@gnu.org; Fri, 01 Jul 2022 23:44:43 -0400 Received: from mail.vielbein.com ([141.164.61.112]:50146) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o7U3j-0005Sf-Ls for bug-grep@gnu.org; Fri, 01 Jul 2022 23:44:41 -0400 Received: from authenticated-user (PRIMARY_HOSTNAME [PUBLIC_IP]) by mail.vielbein.com (Postfix) with ESMTPA id 36C7E3E7A67 for ; Sat, 2 Jul 2022 03:44:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=taeyeob.kim; s=dkim; t=1656733470; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=KHq4/pdAQYGzfNN6MxXnasHq/fbXzBcQC0ilzr5XVDc=; b=Jod2Kbc8vfNOlrQZ+8BtYX/0/SCm/nPYjHDpvr+HtctY//0iztGZCHw3g0AbRHU9vjPiG/ Cy2G+SsGDUBFukbbxLVLAjizauW78ttSX9Xp6SxbTfZSr/WN9ZL++vAQvYGOk66n/frwfR gbitCN6HcbZ2c+TYP/v+jSbrVWn4RLI= MIME-Version: 1.0 Date: Sat, 02 Jul 2022 12:44:29 +0900 From: KIM Taeyeob Mail-Reply-To: git@taeyeob.kim Message-ID: X-Sender: git@taeyeob.kim Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=taeyeob.kim; s=dkim; t=1656733470; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=KHq4/pdAQYGzfNN6MxXnasHq/fbXzBcQC0ilzr5XVDc=; b=eKjgedn/jdzai6hJECrm+CHV2BpLl+PeUr4fYYYspbR/edDgP+mo4vRXowlGxya8j5yYlZ OmQTiUROCL0R26377OA8yTXokPgxmbjnz8GnBLbTlcgpMV007s64R/1xMxo/+szPh0FVHz mtVOel4B7EnLsRvRUY+o+BltrCSbFH0= ARC-Seal: i=1; s=dkim; d=taeyeob.kim; t=1656733470; a=rsa-sha256; cv=none; b=HBoVgBSGBJ3zbt5TNUmDih86epHxCD1sMuEYo8rMo0LJEv4DAZBu08qOB4MNPSDo+NDWog VlCK6XZFfte7TZVmdJmSsKK6XURg/5rVgL0PTVCUj8vuxO/hG3DXMQcyb7ANbaSqHRJAdp zj/3pDs5bmSrmGvO6RSInUe/Lpbt4oE= ARC-Authentication-Results: i=1; mail.vielbein.com; auth=pass smtp.auth=i@taeyeob.kim smtp.mailfrom=git@taeyeob.kim Authentication-Results: mail.vielbein.com; auth=pass smtp.auth=i@taeyeob.kim smtp.mailfrom=git@taeyeob.kim X-Spamd-Bar: / Received-SPF: pass client-ip=141.164.61.112; envelope-from=git@taeyeob.kim; helo=mail.vielbein.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, UNPARSEABLE_RELAY=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.4 (-) X-Mailman-Approved-At: Sat, 02 Jul 2022 05:29:04 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.4 (--) Grep (and also Sed) cannot match a certain range of Korean characters when it operates under LC_CTYPE=C.UTF-8 (and whatever language environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or ja_JP.UTF-8 etc.) Reproduce the bug: $ export LC_CTYPE=C.UTF-8 $ echo 폿 | grep . 폿 <-- a character that is in the range [가-폿] (~) is matched without any issue $ echo 퐀 | grep . $ <-- but a character in the range [퐀-힣] (~) CANNOT be matched but it IS SUPPOSED TO be matched. Sed has the same issue with the period regex too. The Example of Sed: $ export LC_CTYPE=C.UTF-8 $ echo "폿" | sed -e 's/./a/' a <-- matched and replaced without an issue $ echo "퐀" | sed -e 's/./a/' 퐀 <-- FAILED to match so it doesn't replace I think it is related with or on Glibc, but I couldn't find way to reproduce the bug with those, so alternatively, I report on Grep instead. From unknown Tue Aug 19 07:27:33 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: git@taeyeob.kim Subject: bug#56350: closed (Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters) Message-ID: References: <6dc73457-0b41-ce63-c4c1-9c329848c766@cs.ucla.edu> X-Gnu-PR-Message: they-closed 56350 X-Gnu-PR-Package: grep Reply-To: 56350@debbugs.gnu.org Date: Sat, 02 Jul 2022 21:29:01 +0000 Content-Type: multipart/mixed; boundary="----------=_1656797341-20353-1" This is a multi-part message in MIME format... ------------=_1656797341-20353-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 56350@debbugs.gnu.org. --=20 56350: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D56350 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1656797341-20353-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 56350-done) by debbugs.gnu.org; 2 Jul 2022 21:28:54 +0000 Received: from localhost ([127.0.0.1]:42971 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7kfd-0005Hs-Tq for submit@debbugs.gnu.org; Sat, 02 Jul 2022 17:28:54 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:59748) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7kfZ-0005HZ-J3; Sat, 02 Jul 2022 17:28:52 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id CBC88160143; Sat, 2 Jul 2022 14:28:43 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id khXGx1qB9Hys; Sat, 2 Jul 2022 14:28:42 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 31CB3160145; Sat, 2 Jul 2022 14:28:42 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 27yEds6_iLhz; Sat, 2 Jul 2022 14:28:41 -0700 (PDT) Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 954F8160143; Sat, 2 Jul 2022 14:28:41 -0700 (PDT) Message-ID: <6dc73457-0b41-ce63-c4c1-9c329848c766@cs.ucla.edu> Date: Sat, 2 Jul 2022 16:28:40 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Subject: Re: bug#56350: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters Content-Language: en-US To: git@taeyeob.kim, =?UTF-8?B?6rmA7YOc7Je9?= References: From: Paul Eggert In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 56350-done Cc: 56350-done@debbugs.gnu.org, 56352-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Thanks, that's a Gnulib bug that was fixed here: https://git.savannah.gnu.org/cgit/gnulib.git/commit/?id=b19a10775e54f8ed17e3a8c08a72d261d8c26244 This has been propagated to GNU Grep and the fix should appear in the next Grep release. I plan to reply separately about GNU Sed. ------------=_1656797341-20353-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 2 Jul 2022 09:29:07 +0000 Received: from localhost ([127.0.0.1]:39784 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7ZR3-00020Y-Ra for submit@debbugs.gnu.org; Sat, 02 Jul 2022 05:29:07 -0400 Received: from lists.gnu.org ([209.51.188.17]:33100) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7U3n-00078V-JR for submit@debbugs.gnu.org; Fri, 01 Jul 2022 23:44:43 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:45982) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o7U3m-000237-CI for bug-grep@gnu.org; Fri, 01 Jul 2022 23:44:43 -0400 Received: from mail.vielbein.com ([141.164.61.112]:50146) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o7U3j-0005Sf-Ls for bug-grep@gnu.org; Fri, 01 Jul 2022 23:44:41 -0400 Received: from authenticated-user (PRIMARY_HOSTNAME [PUBLIC_IP]) by mail.vielbein.com (Postfix) with ESMTPA id 36C7E3E7A67 for ; Sat, 2 Jul 2022 03:44:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=taeyeob.kim; s=dkim; t=1656733470; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=KHq4/pdAQYGzfNN6MxXnasHq/fbXzBcQC0ilzr5XVDc=; b=Jod2Kbc8vfNOlrQZ+8BtYX/0/SCm/nPYjHDpvr+HtctY//0iztGZCHw3g0AbRHU9vjPiG/ Cy2G+SsGDUBFukbbxLVLAjizauW78ttSX9Xp6SxbTfZSr/WN9ZL++vAQvYGOk66n/frwfR gbitCN6HcbZ2c+TYP/v+jSbrVWn4RLI= MIME-Version: 1.0 Date: Sat, 02 Jul 2022 12:44:29 +0900 From: KIM Taeyeob To: bug-grep@gnu.org Subject: UTF-8 LC_CTYPE bug esp when a certain range of Korean characters Mail-Reply-To: git@taeyeob.kim Message-ID: X-Sender: git@taeyeob.kim Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=taeyeob.kim; s=dkim; t=1656733470; h=from:from:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=KHq4/pdAQYGzfNN6MxXnasHq/fbXzBcQC0ilzr5XVDc=; b=eKjgedn/jdzai6hJECrm+CHV2BpLl+PeUr4fYYYspbR/edDgP+mo4vRXowlGxya8j5yYlZ OmQTiUROCL0R26377OA8yTXokPgxmbjnz8GnBLbTlcgpMV007s64R/1xMxo/+szPh0FVHz mtVOel4B7EnLsRvRUY+o+BltrCSbFH0= ARC-Seal: i=1; s=dkim; d=taeyeob.kim; t=1656733470; a=rsa-sha256; cv=none; b=HBoVgBSGBJ3zbt5TNUmDih86epHxCD1sMuEYo8rMo0LJEv4DAZBu08qOB4MNPSDo+NDWog VlCK6XZFfte7TZVmdJmSsKK6XURg/5rVgL0PTVCUj8vuxO/hG3DXMQcyb7ANbaSqHRJAdp zj/3pDs5bmSrmGvO6RSInUe/Lpbt4oE= ARC-Authentication-Results: i=1; mail.vielbein.com; auth=pass smtp.auth=i@taeyeob.kim smtp.mailfrom=git@taeyeob.kim Authentication-Results: mail.vielbein.com; auth=pass smtp.auth=i@taeyeob.kim smtp.mailfrom=git@taeyeob.kim X-Spamd-Bar: / Received-SPF: pass client-ip=141.164.61.112; envelope-from=git@taeyeob.kim; helo=mail.vielbein.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, UNPARSEABLE_RELAY=0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 02 Jul 2022 05:29:04 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: git@taeyeob.kim Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.4 (--) Grep (and also Sed) cannot match a certain range of Korean characters when it operates under LC_CTYPE=C.UTF-8 (and whatever language environment with UTF-8 encoding including en_US.UTF-8, ko_KR.UTF-8, or ja_JP.UTF-8 etc.) Reproduce the bug: $ export LC_CTYPE=C.UTF-8 $ echo 폿 | grep . 폿 <-- a character that is in the range [가-폿] (~) is matched without any issue $ echo 퐀 | grep . $ <-- but a character in the range [퐀-힣] (~) CANNOT be matched but it IS SUPPOSED TO be matched. Sed has the same issue with the period regex too. The Example of Sed: $ export LC_CTYPE=C.UTF-8 $ echo "폿" | sed -e 's/./a/' a <-- matched and replaced without an issue $ echo "퐀" | sed -e 's/./a/' 퐀 <-- FAILED to match so it doesn't replace I think it is related with or on Glibc, but I couldn't find way to reproduce the bug with those, so alternatively, I report on Grep instead. ------------=_1656797341-20353-1-- From debbugs-submit-bounces@debbugs.gnu.org Sat Jul 02 17:35:33 2022 Received: (at control) by debbugs.gnu.org; 2 Jul 2022 21:35:33 +0000 Received: from localhost ([127.0.0.1]:42986 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7km4-0005Ty-R7 for submit@debbugs.gnu.org; Sat, 02 Jul 2022 17:35:32 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:60430) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7km3-0005Tk-C8 for control@debbugs.gnu.org; Sat, 02 Jul 2022 17:35:31 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id EE112160143 for ; Sat, 2 Jul 2022 14:35:25 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id cabCBq82CgI5 for ; Sat, 2 Jul 2022 14:35:25 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 5DEB5160145 for ; Sat, 2 Jul 2022 14:35:25 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 2ky6qqoGunf7 for ; Sat, 2 Jul 2022 14:35:25 -0700 (PDT) Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 2040B160143 for ; Sat, 2 Jul 2022 14:35:25 -0700 (PDT) Message-ID: Date: Sat, 2 Jul 2022 16:35:24 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Content-Language: en-US To: control@debbugs.gnu.org From: Paul Eggert Subject: 56350 and 56352 are the same Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) merge 56350 56352