From unknown Sun Jun 15 08:24:58 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#66236 <66236@debbugs.gnu.org> To: bug#66236 <66236@debbugs.gnu.org> Subject: Status: Specific Korean characters break Unicode parsing Reply-To: bug#66236 <66236@debbugs.gnu.org> Date: Sun, 15 Jun 2025 15:24:58 +0000 retitle 66236 Specific Korean characters break Unicode parsing reassign 66236 sed submitter 66236 kristian.jarventaus@clausal.com severity 66236 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Wed Sep 27 08:43:35 2023 Received: (at submit) by debbugs.gnu.org; 27 Sep 2023 12:43:35 +0000 Received: from localhost ([127.0.0.1]:50835 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qlTt6-00039e-HL for submit@debbugs.gnu.org; Wed, 27 Sep 2023 08:43:35 -0400 Received: from lists.gnu.org ([2001:470:142::17]:46106) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1qlSsL-00073d-Fo for submit@debbugs.gnu.org; Wed, 27 Sep 2023 07:38:44 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qlSs2-0005N4-2X for bug-sed@gnu.org; Wed, 27 Sep 2023 07:38:22 -0400 Received: from imap.clausal.com ([212.16.102.130]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qlSrz-0006Lc-FS for bug-sed@gnu.org; Wed, 27 Sep 2023 07:38:21 -0400 Received: from [10.21.0.196] (lov.kaikki.org [212.16.102.136]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by imap.clausal.com (Postfix) with ESMTPSA id 5FBFA7A00DF for ; Wed, 27 Sep 2023 11:38:15 +0000 (UTC) Message-ID: <5eb1ebf4-2180-0244-7d27-7d1e186fe3c7@clausal.com> Date: Wed, 27 Sep 2023 14:38:15 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.15.1 Content-Language: en-US To: bug-sed@gnu.org From: =?UTF-8?Q?Kristian_J=c3=a4rventaus?= Subject: Specific Korean characters break Unicode parsing Organization: Clausal Computing Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=212.16.102.130; envelope-from=kristian@clausal.com; helo=imap.clausal.com X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 27 Sep 2023 08:43:31 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: kristian.jarventaus@clausal.com Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) sed (GNU sed) 4.8 Packaged by Debian Issue: I have a bunch of data that I want to clean up in the form ==== GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648 GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash: 86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4d GET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash: 3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897 GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash: f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcb GET_PAGE: ForkPoolWorker-19, title='Module:ko', hash: 8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743 GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash: bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9 GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash: ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62 GET_PAGE: ForkPoolWorker-19, title='traffica', hash: 40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5 ===== (title contains basically all article titles from en.wiktionary.org, so tons and tons of Unicode, from everywhere in the Unicode set) However, certain Hangeul (Korean) characters break *something*. After doing some replacements on data that looks like the above, I am always left with a bunch of lines with Korean titles. > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/' Output: ====== ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648 ForkPoolWorker-19, Module:munge text ForkPoolWorker-19, Module:ko-translit ====== I tried to figure out if there was some kind of weird end-of-line character or something that would stop the regex from processing, and in all the faulty examples (all with Korean titles) I could find one shared byte: what is M-m in `cat -v` output, 237 decimal ('m' + 128). ===== '허공''M-mM-^WM-^HM-jM-3M-5' title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L' title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T' '프로''M-mM-^TM-^DM-kM-!M-^\' 기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T 맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$ 애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0 고해M-jM-3M- M-mM-^UM-4 얼큰하다M-lM-^VM- sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/' parses correctly, because the .\+ captures to the end of the line (so my initial suspect was wrong). Afaict, if my Unicode is correct (and I don't have much reason to believe it is mangled, the file contains basically the titles of every en.wiktionary.org article, so not just Korean and ascii), it seems that the presence of a character with the M-m byte causes the rest of the line to be broken unicode-parsing-wise, which causes any specific regexes (like the second ["\x27]) to fail parsing because the unicode 'cursor' is out of synch or something similar. I can confirm that the presence of specific characters is the cause by eliminating individual characters: ==== GET_PAGE: ForkPoolWorker-19, title='외출하다', GET_PAGE: ForkPoolWorker-19, title='출하다', GET_PAGE: ForkPoolWorker-19, title='외하다', GET_PAGE: ForkPoolWorker-19, title='외출다', GET_PAGE: ForkPoolWorker-19, title='외출하', GET_PAGE: ForkPoolWorker-20, title='부도덕하다', GET_PAGE: ForkPoolWorker-20, title='도덕하다', GET_PAGE: ForkPoolWorker-20, title='부덕하다', GET_PAGE: ForkPoolWorker-20, title='부도하다', GET_PAGE: ForkPoolWorker-20, title='부도덕다', GET_PAGE: ForkPoolWorker-20, title='부도덕하', ==== > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test ===== GET_PAGE: ForkPoolWorker-19, title='외출하다', GET_PAGE: ForkPoolWorker-19, title='출하다', GET_PAGE: ForkPoolWorker-19, title='외하다', ForkPoolWorker-19, 외출다 GET_PAGE: ForkPoolWorker-19, title='외출하', GET_PAGE: ForkPoolWorker-20, title='부도덕하다', GET_PAGE: ForkPoolWorker-20, title='도덕하다', GET_PAGE: ForkPoolWorker-20, title='부덕하다', GET_PAGE: ForkPoolWorker-20, title='부도하다', ForkPoolWorker-20, 부도덕다 ===== Every single occurrence of this issue that I found (and there were many of them, because the data is very big) had a M-m byte somewhere in the hangeul. I can't reproduce this on https://sed.js.org/, there the output is as expected. -- Kristian Järventaus Research Assistant / Tutkimusavustaja Clausal Computing Oy kristian.jarventaus@clausal.com