GNU bug report logs -
#57507
Regular expression matching depends on locale encoding
Previous Next
To reply to this bug, email your comments to 57507 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guile <at> gnu.org
:
bug#57507
; Package
guile
.
(Wed, 31 Aug 2022 16:55:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Jean Abou Samra <jean <at> abou-samra.fr>
:
New bug report received and forwarded. Copy sent to
bug-guile <at> gnu.org
.
(Wed, 31 Aug 2022 16:55:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Regular expressions do funky things with Unicode if a non-Unicode-aware
locale is set. Yet, they're purely string operations, so I don't think
it's expected that they depend on the locale encoding.
$ LC_ALL=C guile3.0
GNU Guile 3.0.7
Copyright (C) 1995-2021 Free Software Foundation, Inc.
Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.
Enter `,help' for help.
scheme@(guile-user)> (use-modules (ice-9 regex))
scheme@(guile-user)> (match:substring (string-match "\u203f" "\u3091"))
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
In procedure make-regexp: Invalid preceding regular expression
Entering a new prompt. Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> ,q
scheme@(guile-user)> (match:substring (string-match "[\u203f]" "\u3091"))
$1 = "\u3091"
scheme@(guile-user)>
Information forwarded
to
bug-guile <at> gnu.org
:
bug#57507
; Package
guile
.
(Thu, 01 Sep 2022 19:35:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 57507 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Also remember that Guile uses the system C library regex routines. And
is using C strings, not Guile strings.
(sorry for top post, too tired to fight with this web editor)
-Dale
-----------------------------------------From: "Jean Abou Samra"
To: 57507 <at> debbugs.gnu.org
Cc:
Sent: Wednesday August 31 2022 12:55:13PM
Subject: bug#57507: Regular expression matching depends on locale
encoding
Regular expressions do funky things with Unicode if a
non-Unicode-aware
locale is set. Yet, they're purely string operations, so I don't
think
it's expected that they depend on the locale encoding.
$ LC_ALL=C guile3.0
GNU Guile 3.0.7
Copyright (C) 1995-2021 Free Software Foundation, Inc.
Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.
Enter `,help' for help.
scheme@(guile-user)> (use-modules (ice-9 regex))
scheme@(guile-user)> (match:substring (string-match "u203f" "u3091"))
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
In procedure make-regexp: Invalid preceding regular expression
Entering a new prompt. Type `,bt' for a backtrace or `,q' to
continue.
scheme@(guile-user) [1]> ,q
scheme@(guile-user)> (match:substring (string-match "[u203f]"
"u3091"))
$1 = "u3091"
scheme@(guile-user)>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-guile <at> gnu.org
:
bug#57507
; Package
guile
.
(Mon, 05 Sep 2022 07:49:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 57507 <at> debbugs.gnu.org (full text, mbox):
Hi Jean,
Jean Abou Samra <jean <at> abou-samra.fr> skribis:
> Regular expressions do funky things with Unicode if a non-Unicode-aware
> locale is set. Yet, they're purely string operations, so I don't think
> it's expected that they depend on the locale encoding.
This is the expected behavior: first because (ice-9 regex) is
implemented in terms of the libc regex functions, as Dale put (but that
could be thought as an implementation detail), and second because things
such as character classes are necessarily locale-dependent (this has
bitten us in the past, for instance with <https://bugs.gnu.org/35785>).
I hope that makes sense.
Thanks,
Ludo’.
Information forwarded
to
bug-guile <at> gnu.org
:
bug#57507
; Package
guile
.
(Mon, 05 Sep 2022 18:40:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 57507 <at> debbugs.gnu.org (full text, mbox):
Le 05/09/2022 à 09:48, Ludovic Courtès a écrit :
> Hi Jean,
>
> Jean Abou Samra <jean <at> abou-samra.fr> skribis:
>
>> Regular expressions do funky things with Unicode if a non-Unicode-aware
>> locale is set. Yet, they're purely string operations, so I don't think
>> it's expected that they depend on the locale encoding.
> This is the expected behavior: first because (ice-9 regex) is
> implemented in terms of the libc regex functions, as Dale put (but that
> could be thought as an implementation detail), and second because things
> such as character classes are necessarily locale-dependent (this has
> bitten us in the past, for instance with <https://bugs.gnu.org/35785>).
>
> I hope that makes sense.
OK, thanks, but in this case, it should be clearly stated as a limitation
in the (ice-9 regex) documentation IMHO. If you don't know what constraints
there are on the implementation, there is no reason to expect this. Would it
help if I submitted a patch for that?
Information forwarded
to
bug-guile <at> gnu.org
:
bug#57507
; Package
guile
.
(Mon, 05 Sep 2022 19:25:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 57507 <at> debbugs.gnu.org (full text, mbox):
Hi,
Jean Abou Samra <jean <at> abou-samra.fr> skribis:
> Le 05/09/2022 à 09:48, Ludovic Courtès a écrit :
>> Hi Jean,
>>
>> Jean Abou Samra <jean <at> abou-samra.fr> skribis:
>>
>>> Regular expressions do funky things with Unicode if a non-Unicode-aware
>>> locale is set. Yet, they're purely string operations, so I don't think
>>> it's expected that they depend on the locale encoding.
>> This is the expected behavior: first because (ice-9 regex) is
>> implemented in terms of the libc regex functions, as Dale put (but that
>> could be thought as an implementation detail), and second because things
>> such as character classes are necessarily locale-dependent (this has
>> bitten us in the past, for instance with <https://bugs.gnu.org/35785>).
>>
>> I hope that makes sense.
>
>
>
> OK, thanks, but in this case, it should be clearly stated as a limitation
> in the (ice-9 regex) documentation IMHO. If you don't know what constraints
> there are on the implementation, there is no reason to expect this. Would it
> help if I submitted a patch for that?
Yes, that’d be welcome. I would not call it a constraint or limitation;
for example, that ‘w’ is not a letter in Swedish is the kind of thing
you’d generally want to take into account. Now, it’d be nice if one
could easily specify the locale to operate under, with an API similar to
that of (ice-9 i18n) and its first-class locale objects.
Thanks,
Ludo’.
Information forwarded
to
bug-guile <at> gnu.org
:
bug#57507
; Package
guile
.
(Thu, 17 Nov 2022 20:35:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 57507 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Le 05/09/2022 à 21:24, Ludovic Courtès a écrit :
> Yes, that’d be welcome. I would not call it a constraint or limitation;
> for example, that ‘w’ is not a letter in Swedish is the kind of thing
> you’d generally want to take into account. Now, it’d be nice if one
> could easily specify the locale to operate under, with an API similar to
> that of (ice-9 i18n) and its first-class locale objects.
Sorry that it took me forever to send this.
From c666ca4f72dc0a00d28b8d7ef1221ebfc9741551 Mon Sep 17 00:00:00 2001
From: Jean Abou Samra <jean <at> abou-samra.fr>
Date: Thu, 17 Nov 2022 21:26:07 +0100
Subject: [PATCH] Doc: clarification on regexes and encodings
* doc/ref/api-regex.texi: make it more obviously clear that regexp
matching supports only characters supported by the locale encoding.
---
doc/ref/api-regex.texi | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
index b14c2b39c..bd1f4079d 100644
--- a/doc/ref/api-regex.texi
+++ b/doc/ref/api-regex.texi
@@ -57,7 +57,11 @@ locale's encoding, and then passed to the C library's
regular expression
routines (@pxref{Regular Expressions,,, libc, The GNU C Library
Reference Manual}). The returned match structures always point to
characters in the strings, not to individual bytes, even in the case of
-multi-byte encodings.
+multi-byte encodings. This ensures that the match structures are
+correct when performing matching with characters that have a multi-byte
+representation in the locale encoding. Note, however, that using
+characters which cannot be represented in the locale encoding can lead
+to surprising results.
@deffn {Scheme Procedure} string-match pattern str [start]
Compile the string @var{pattern} into a regular expression and compare
--
2.38.1
[OpenPGP_signature (application/pgp-signature, attachment)]
This bug report was last modified 2 years and 208 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.