From unknown Sun Jun 15 08:24:58 2025
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.509 (Entity 5.509)
Content-Type: text/plain; charset=utf-8
From: bug#66236 <66236@debbugs.gnu.org>
To: bug#66236 <66236@debbugs.gnu.org>
Subject: Status: Specific Korean characters break Unicode parsing
Reply-To: bug#66236 <66236@debbugs.gnu.org>
Date: Sun, 15 Jun 2025 15:24:58 +0000

retitle 66236 Specific Korean characters break Unicode parsing
reassign 66236 sed
submitter 66236 kristian.jarventaus@clausal.com
severity 66236 normal

thanks


From debbugs-submit-bounces@debbugs.gnu.org Wed Sep 27 08:43:35 2023
Received: (at submit) by debbugs.gnu.org; 27 Sep 2023 12:43:35 +0000
Received: from localhost ([127.0.0.1]:50835 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1qlTt6-00039e-HL
	for submit@debbugs.gnu.org; Wed, 27 Sep 2023 08:43:35 -0400
Received: from lists.gnu.org ([2001:470:142::17]:46106)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <kristian@clausal.com>) id 1qlSsL-00073d-Fo
 for submit@debbugs.gnu.org; Wed, 27 Sep 2023 07:38:44 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <kristian@clausal.com>)
 id 1qlSs2-0005N4-2X
 for bug-sed@gnu.org; Wed, 27 Sep 2023 07:38:22 -0400
Received: from imap.clausal.com ([212.16.102.130])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <kristian@clausal.com>)
 id 1qlSrz-0006Lc-FS
 for bug-sed@gnu.org; Wed, 27 Sep 2023 07:38:21 -0400
Received: from [10.21.0.196] (lov.kaikki.org [212.16.102.136])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by imap.clausal.com (Postfix) with ESMTPSA id 5FBFA7A00DF
 for <bug-sed@gnu.org>; Wed, 27 Sep 2023 11:38:15 +0000 (UTC)
Message-ID: <5eb1ebf4-2180-0244-7d27-7d1e186fe3c7@clausal.com>
Date: Wed, 27 Sep 2023 14:38:15 +0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.15.1
Content-Language: en-US
To: bug-sed@gnu.org
From: =?UTF-8?Q?Kristian_J=c3=a4rventaus?= <kristian@clausal.com>
Subject: Specific Korean characters break Unicode parsing
Organization: Clausal Computing
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=212.16.102.130; envelope-from=kristian@clausal.com;
 helo=imap.clausal.com
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Wed, 27 Sep 2023 08:43:31 -0400
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Reply-To: kristian.jarventaus@clausal.com
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

sed (GNU sed) 4.8
Packaged by Debian


Issue: I have a bunch of data that I want to clean up in the form

====
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 
3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash: 
86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4d
GET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash: 
3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897
GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash: 
f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcb
GET_PAGE: ForkPoolWorker-19, title='Module:ko', hash: 
8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743
GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash: 
bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9
GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash: 
ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62
GET_PAGE: ForkPoolWorker-19, title='traffica', hash: 
40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5
=====

(title contains basically all article titles from en.wiktionary.org, so 
tons and tons of Unicode, from everywhere in the Unicode set)

However, certain Hangeul (Korean) characters break *something*. After 
doing some replacements on data that looks like the above, I am always 
left with a bunch of lines with Korean titles.

 > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), 
title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/'


Output:
======
ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 
3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
ForkPoolWorker-19, Module:munge text
ForkPoolWorker-19, Module:ko-translit
======

I tried to figure out if there was some kind of weird end-of-line 
character or something that would stop the regex from processing, and in 
all the faulty examples (all with Korean titles) I could find one shared 
byte: what is M-m in `cat -v` output, 237 decimal ('m' + 128).

=====
'허공''M-mM-^WM-^HM-jM-3M-5'
title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L'
title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T'
'프로''M-mM-^TM-^DM-kM-!M-^\'
기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T
맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$
애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0
고해M-jM-3M- M-mM-^UM-4
얼큰하다M-lM-^VM-<M-mM-^AM-0M-mM-^UM-^XM-kM-^KM-$
추가하다M-lM-6M-^TM-jM-0M-^@M-mM-^UM-^XM-kM-^KM-$
푼체M-mM-^QM-<M-lM-2M-4
목표어M-kM-*M-)M-mM-^QM-^\M-lM-^VM-4
=====

The version of the above command without anything after the capture block

 >  sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/'

parses correctly, because the .\+ captures to the end of the line (so my 
initial suspect was wrong). Afaict, if my Unicode is correct (and I 
don't have much reason to believe it is mangled, the file contains 
basically the titles of every en.wiktionary.org article, so not just 
Korean and ascii), it seems that the presence of a character with the 
M-m byte causes the rest of the line to be broken unicode-parsing-wise, 
which causes any specific regexes (like the second ["\x27]) to fail 
parsing because the unicode 'cursor' is out of synch or something similar.

I can confirm that the presence of specific characters is the cause by 
eliminating individual characters:

====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
GET_PAGE: ForkPoolWorker-19, title='외출다',
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
GET_PAGE: ForkPoolWorker-20, title='부도덕다',
GET_PAGE: ForkPoolWorker-20, title='부도덕하',
====
 > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), 
title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test
=====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
ForkPoolWorker-19, 외출다
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
ForkPoolWorker-20, 부도덕다
=====

Every single occurrence of this issue that I found (and there were many 
of them, because the data is very big) had a M-m byte somewhere in the 
hangeul.

I can't reproduce this on https://sed.js.org/, there the output is as 
expected.


-- 
Kristian Järventaus
Research Assistant / Tutkimusavustaja
Clausal Computing Oy
kristian.jarventaus@clausal.com