GNU bug report logs - #66236
Specific Korean characters break Unicode parsing

Previous Next

Package: sed;

Reported by: kristian.jarventaus <at> clausal.com

Date: Wed, 27 Sep 2023 12:44:01 UTC

Severity: normal

Full log


Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Kristian Järventaus <kristian <at> clausal.com>
To: bug-sed <at> gnu.org
Subject: Specific Korean characters break Unicode parsing
Date: Wed, 27 Sep 2023 14:38:15 +0300
sed (GNU sed) 4.8
Packaged by Debian


Issue: I have a bunch of data that I want to clean up in the form

====
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 
3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash: 
86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4d
GET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash: 
3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897
GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash: 
f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcb
GET_PAGE: ForkPoolWorker-19, title='Module:ko', hash: 
8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743
GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash: 
bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9
GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash: 
ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62
GET_PAGE: ForkPoolWorker-19, title='traffica', hash: 
40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5
=====

(title contains basically all article titles from en.wiktionary.org, so 
tons and tons of Unicode, from everywhere in the Unicode set)

However, certain Hangeul (Korean) characters break *something*. After 
doing some replacements on data that looks like the above, I am always 
left with a bunch of lines with Korean titles.

> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), 
title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/'


Output:
======
ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix
GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 
3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648
ForkPoolWorker-19, Module:munge text
ForkPoolWorker-19, Module:ko-translit
======

I tried to figure out if there was some kind of weird end-of-line 
character or something that would stop the regex from processing, and in 
all the faulty examples (all with Korean titles) I could find one shared 
byte: what is M-m in `cat -v` output, 237 decimal ('m' + 128).

=====
'허공''M-mM-^WM-^HM-jM-3M-5'
title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L'
title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T'
'프로''M-mM-^TM-^DM-kM-!M-^\'
기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T
맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$
애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0
고해M-jM-3M- M-mM-^UM-4
얼큰하다M-lM-^VM-<M-mM-^AM-0M-mM-^UM-^XM-kM-^KM-$
추가하다M-lM-6M-^TM-jM-0M-^@M-mM-^UM-^XM-kM-^KM-$
푼체M-mM-^QM-<M-lM-2M-4
목표어M-kM-*M-)M-mM-^QM-^\M-lM-^VM-4
=====

The version of the above command without anything after the capture block

>  sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/'

parses correctly, because the .\+ captures to the end of the line (so my 
initial suspect was wrong). Afaict, if my Unicode is correct (and I 
don't have much reason to believe it is mangled, the file contains 
basically the titles of every en.wiktionary.org article, so not just 
Korean and ascii), it seems that the presence of a character with the 
M-m byte causes the rest of the line to be broken unicode-parsing-wise, 
which causes any specific regexes (like the second ["\x27]) to fail 
parsing because the unicode 'cursor' is out of synch or something similar.

I can confirm that the presence of specific characters is the cause by 
eliminating individual characters:

====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
GET_PAGE: ForkPoolWorker-19, title='외출다',
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
GET_PAGE: ForkPoolWorker-20, title='부도덕다',
GET_PAGE: ForkPoolWorker-20, title='부도덕하',
====
> sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), 
title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test
=====
GET_PAGE: ForkPoolWorker-19, title='외출하다',
GET_PAGE: ForkPoolWorker-19, title='출하다',
GET_PAGE: ForkPoolWorker-19, title='외하다',
ForkPoolWorker-19, 외출다
GET_PAGE: ForkPoolWorker-19, title='외출하',
GET_PAGE: ForkPoolWorker-20, title='부도덕하다',
GET_PAGE: ForkPoolWorker-20, title='도덕하다',
GET_PAGE: ForkPoolWorker-20, title='부덕하다',
GET_PAGE: ForkPoolWorker-20, title='부도하다',
ForkPoolWorker-20, 부도덕다
=====

Every single occurrence of this issue that I found (and there were many 
of them, because the data is very big) had a M-m byte somewhere in the 
hangeul.

I can't reproduce this on https://sed.js.org/, there the output is as 
expected.


-- 
Kristian Järventaus
Research Assistant / Tutkimusavustaja
Clausal Computing Oy
kristian.jarventaus <at> clausal.com




This bug report was last modified 1 year and 262 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.