Package: sed;
Reported by: kristian.jarventaus <at> clausal.com
Date: Wed, 27 Sep 2023 12:44:01 UTC
Severity: normal
View this message in rfc822 format
From: Kristian Järventaus <kristian <at> clausal.com> To: 66236 <at> debbugs.gnu.org Subject: bug#66236: Specific Korean characters break Unicode parsing Date: Wed, 27 Sep 2023 14:38:15 +0300
sed (GNU sed) 4.8 Packaged by Debian Issue: I have a bunch of data that I want to clean up in the form ==== GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648 GET_PAGE: ForkPoolWorker-19, title='Module:munge text', hash: 86aa20ba5f2a310911fc93b32b7ef14de944b233f2894236ed236350cf467a4d GET_PAGE: ForkPoolWorker-19, title='Module:ko-translit', hash: 3f795c903dc252d3dedad1f7100c22de324986980a475396aabcdd554b886897 GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron', hash: f4dde115a55246e97c0a14ea30f6896d9759e748040b8d45ac9c60ebb073cdcb GET_PAGE: ForkPoolWorker-19, title='Module:ko', hash: 8ebb346f32119102d15f4b464dcf178912f5ca4889ece0cbeed97ae198a6e743 GET_PAGE: ForkPoolWorker-19, title='Module:ko-pron/data', hash: bd4e173ed2d8f9140b524ba76d7c9862494d8fb798d8e756ea5229a830e815d9 GET_PAGE: ForkPoolWorker-19, title='Template:it-pr', hash: ecdb98dc9ac1387ad4f847c7bc2113fcafd016b2e7b44dc8ae806fcb83c95d62 GET_PAGE: ForkPoolWorker-19, title='traffica', hash: 40728b79d679469e655593a096dbf2780a92b584d1a79d296d3b24a1543832b5 ===== (title contains basically all article titles from en.wiktionary.org, so tons and tons of Unicode, from everywhere in the Unicode set) However, certain Hangeul (Korean) characters break *something*. After doing some replacements on data that looks like the above, I am always left with a bunch of lines with Korean titles. > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=["\x27]\(.\+\)["\x27], hash.*/\1, \2/' Output: ====== ForkPoolWorker-19, Template:ko-conj/verbForkPoolWorker-19, Template:affix GET_PAGE: ForkPoolWorker-19, title='외출하다', hash: 3ff685bdbf1db10566e9c3bbbc680a0ec6656e3b3d9752b5d10c9e6dcc08d648 ForkPoolWorker-19, Module:munge text ForkPoolWorker-19, Module:ko-translit ====== I tried to figure out if there was some kind of weird end-of-line character or something that would stop the regex from processing, and in all the faulty examples (all with Korean titles) I could find one shared byte: what is M-m in `cat -v` output, 237 decimal ('m' + 128). ===== '허공''M-mM-^WM-^HM-jM-3M-5' title='평의회'title='M-mM-^OM-^IM-lM-^]M-^XM-mM-^ZM-^L' title='풍년화'title='M-mM-^RM-^MM-kM-^EM-^DM-mM-^YM-^T' '프로''M-mM-^TM-^DM-kM-!M-^\' 기계화M-jM-8M-0M-jM-3M-^DM-mM-^YM-^T 맹세하다M-kM-'M-9M-lM-^DM-8M-mM-^UM-^XM-kM-^KM-$ 애프터M-lM-^UM- M-mM-^TM-^DM-mM-^DM-0 고해M-jM-3M- M-mM-^UM-4 얼큰하다M-lM-^VM-<M-mM-^AM-0M-mM-^UM-^XM-kM-^KM-$ 추가하다M-lM-6M-^TM-jM-0M-^@M-mM-^UM-^XM-kM-^KM-$ 푼체M-mM-^QM-<M-lM-2M-4 목표어M-kM-*M-)M-mM-^QM-^\M-lM-^VM-4 ===== The version of the above command without anything after the capture block > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=\(.\+\)/\1, \2/' parses correctly, because the .\+ captures to the end of the line (so my initial suspect was wrong). Afaict, if my Unicode is correct (and I don't have much reason to believe it is mangled, the file contains basically the titles of every en.wiktionary.org article, so not just Korean and ascii), it seems that the presence of a character with the M-m byte causes the rest of the line to be broken unicode-parsing-wise, which causes any specific regexes (like the second ["\x27]) to fail parsing because the unicode 'cursor' is out of synch or something similar. I can confirm that the presence of specific characters is the cause by eliminating individual characters: ==== GET_PAGE: ForkPoolWorker-19, title='외출하다', GET_PAGE: ForkPoolWorker-19, title='출하다', GET_PAGE: ForkPoolWorker-19, title='외하다', GET_PAGE: ForkPoolWorker-19, title='외출다', GET_PAGE: ForkPoolWorker-19, title='외출하', GET_PAGE: ForkPoolWorker-20, title='부도덕하다', GET_PAGE: ForkPoolWorker-20, title='도덕하다', GET_PAGE: ForkPoolWorker-20, title='부덕하다', GET_PAGE: ForkPoolWorker-20, title='부도하다', GET_PAGE: ForkPoolWorker-20, title='부도덕다', GET_PAGE: ForkPoolWorker-20, title='부도덕하', ==== > sed 's/GET_PAGE: \(ForkPoolWorker-[0-9]\+\), title=["\x27]\(.\+\)["\x27],.*/\1, \2/' kor.txt > kor.test ===== GET_PAGE: ForkPoolWorker-19, title='외출하다', GET_PAGE: ForkPoolWorker-19, title='출하다', GET_PAGE: ForkPoolWorker-19, title='외하다', ForkPoolWorker-19, 외출다 GET_PAGE: ForkPoolWorker-19, title='외출하', GET_PAGE: ForkPoolWorker-20, title='부도덕하다', GET_PAGE: ForkPoolWorker-20, title='도덕하다', GET_PAGE: ForkPoolWorker-20, title='부덕하다', GET_PAGE: ForkPoolWorker-20, title='부도하다', ForkPoolWorker-20, 부도덕다 ===== Every single occurrence of this issue that I found (and there were many of them, because the data is very big) had a M-m byte somewhere in the hangeul. I can't reproduce this on https://sed.js.org/, there the output is as expected. -- Kristian Järventaus Research Assistant / Tutkimusavustaja Clausal Computing Oy kristian.jarventaus <at> clausal.com
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.