From unknown Fri Jun 20 07:10:19 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#77392 <77392@debbugs.gnu.org> To: bug#77392 <77392@debbugs.gnu.org> Subject: Status: =?UTF-8?Q?=E2=80=98regexp-exec=E2=80=99?= gets match boundaries wrong for multibyte strings Reply-To: bug#77392 <77392@debbugs.gnu.org> Date: Fri, 20 Jun 2025 14:10:19 +0000 retitle 77392 =E2=80=98regexp-exec=E2=80=99 gets match boundaries wrong for= multibyte strings reassign 77392 guile submitter 77392 Ludovic Court=C3=A8s severity 77392 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Sun Mar 30 16:54:39 2025 Received: (at submit) by debbugs.gnu.org; 30 Mar 2025 20:54:39 +0000 Received: from localhost ([127.0.0.1]:38237 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tyzfy-0002ZX-Nk for submit@debbugs.gnu.org; Sun, 30 Mar 2025 16:54:39 -0400 Received: from lists.gnu.org ([2001:470:142::17]:38022) by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1tyzfw-0002Yz-FW for submit@debbugs.gnu.org; Sun, 30 Mar 2025 16:54:37 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tyzfq-00046Q-FA for bug-guile@gnu.org; Sun, 30 Mar 2025 16:54:30 -0400 Received: from hera.aquilenet.fr ([185.233.100.1]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tyzfo-00076T-Ez for bug-guile@gnu.org; Sun, 30 Mar 2025 16:54:30 -0400 Received: from localhost (localhost [127.0.0.1]) by hera.aquilenet.fr (Postfix) with ESMTP id A4AA11F3 for ; Sun, 30 Mar 2025 22:54:25 +0200 (CEST) Authentication-Results: hera.aquilenet.fr; none X-Virus-Scanned: Debian amavis at hera.aquilenet.fr Received: from hera.aquilenet.fr ([127.0.0.1]) by localhost (hera.aquilenet.fr [127.0.0.1]) (amavis, port 10024) with ESMTP id gHH-tPA0TREx for ; Sun, 30 Mar 2025 22:54:25 +0200 (CEST) Received: from ribbon (91-160-117-201.subs.proxad.net [91.160.117.201]) by hera.aquilenet.fr (Postfix) with ESMTPSA id AEAA916F for ; Sun, 30 Mar 2025 22:54:24 +0200 (CEST) From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: bug-guile@gnu.org Subject: =?utf-8?Q?=E2=80=98regexp-exec=E2=80=99?= gets match boundaries wrong for multibyte strings X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: =?utf-8?Q?D=C3=A9cadi?= 10 Germinal an 233 de la =?utf-8?Q?R=C3=A9volution=2C?= jour du Couvoir X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Sun, 30 Mar 2025 22:54:24 +0200 Message-ID: <87iknqdytb.fsf@inria.fr> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Rspamd-Queue-Id: A4AA11F3 X-Spamd-Result: default: False [-6.10 / 15.00]; BAYES_HAM(-3.00)[100.00%]; NEURAL_HAM(-3.00)[-1.000]; MIME_GOOD(-0.10)[multipart/mixed,text/plain]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:+]; RCVD_TLS_ALL(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; ARC_NA(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; PREVIOUSLY_DELIVERED(0.00)[bug-guile@gnu.org]; RCVD_VIA_SMTP_AUTH(0.00)[]; TO_DN_NONE(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DOM_EQ_FROM_DOM(0.00)[] X-Rspamd-Action: no action X-Spamd-Bar: ------ X-Rspamd-Server: hera Received-SPF: softfail client-ip=185.233.100.1; envelope-from=ludo@gnu.org; helo=hera.aquilenet.fr X-Spam_score_int: -11 X-Spam_score: -1.2 X-Spam_bar: - X-Spam_report: (-1.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001, SPF_HELO_PASS=-0.001, SPF_SOFTFAIL=0.665 autolearn=no autolearn_force=no X-Spam_action: no action X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable =E2=80=98regexp-exec=E2=80=99 sometimes gets match boundaries wrong when op= erating on a Unicode string but in a C locale (this is with af96820e072d18c49ac03e80c6f3466d568dc77d): --8<---------------cut here---------------start------------->8--- scheme@(guile-user)> ,use(ice-9 regex) scheme@(guile-user)> (setlocale LC_ALL "C") $52 =3D "C" scheme@(guile-user)> (string-match "start (.*)" (string-append "start " (string (integer->char 1002)))) $53 =3D #("start \u03ea" (0 . 8) (6 . 8)) scheme@(guile-user)> (match:substring $53 1) ice-9/boot-9.scm:1683:22: In procedure raise-exception: Value out of range 6 to< 7: 8 Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue. --8<---------------cut here---------------end--------------->8--- The attached program produces more failures at random. (The example above works well under a UTF-8 locale.) So I believe =E2=80=98fixup_multibyte_match=E2=80=99 isn=E2=80=99t quite co= rrect. Ludo=E2=80=99. PS: This originates in . --=-=-= Content-Type: text/plain Content-Disposition: inline; filename=regexp-unicode-ascii.scm (use-modules (ice-9 regex)) (define rx (make-regexp "^start (.*)")) (setlocale LC_ALL "C") (let loop () (let* ((i (+ 256 (random (expt 2 10)))) (str (string-append "start " (string (integer->char i))))) (with-exception-handler (lambda (exc) (pk 'exc exc '<-- i) (display-backtrace (make-stack #t) (current-error-port)) (exit 1)) (lambda () (match:substring (regexp-exec rx str) 1))) (loop))) --=-=-=--