GNU bug report logs -
#77392
‘regexp-exec’ gets match boundaries wrong for multibyte strings
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
‘regexp-exec’ sometimes gets match boundaries wrong when operating on a
Unicode string but in a C locale (this is with
af96820e072d18c49ac03e80c6f3466d568dc77d):
--8<---------------cut here---------------start------------->8---
scheme@(guile-user)> ,use(ice-9 regex)
scheme@(guile-user)> (setlocale LC_ALL "C")
$52 = "C"
scheme@(guile-user)> (string-match "start (.*)"
(string-append "start "
(string (integer->char 1002))))
$53 = #("start \u03ea" (0 . 8) (6 . 8))
scheme@(guile-user)> (match:substring $53 1)
ice-9/boot-9.scm:1683:22: In procedure raise-exception:
Value out of range 6 to< 7: 8
Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue.
--8<---------------cut here---------------end--------------->8---
The attached program produces more failures at random. (The example
above works well under a UTF-8 locale.)
So I believe ‘fixup_multibyte_match’ isn’t quite correct.
Ludo’.
PS: This originates in <https://issues.guix.gnu.org/77283>.
[regexp-unicode-ascii.scm (text/plain, inline)]
(use-modules (ice-9 regex))
(define rx
(make-regexp "^start (.*)"))
(setlocale LC_ALL "C")
(let loop ()
(let* ((i (+ 256 (random (expt 2 10))))
(str (string-append "start " (string (integer->char i)))))
(with-exception-handler
(lambda (exc)
(pk 'exc exc '<-- i)
(display-backtrace (make-stack #t) (current-error-port))
(exit 1))
(lambda ()
(match:substring (regexp-exec rx str) 1)))
(loop)))
This bug report was last modified 81 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.