GNU bug report logs - #57507
Regular expression matching depends on locale encoding

Previous Next

Package: guile;

Reported by: Jean Abou Samra <jean <at> abou-samra.fr>

Date: Wed, 31 Aug 2022 16:55:02 UTC

Severity: normal

Full log


View this message in rfc822 format

From: Jean Abou Samra <jean <at> abou-samra.fr>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 57507 <at> debbugs.gnu.org
Subject: bug#57507: Regular expression matching depends on locale encoding
Date: Thu, 17 Nov 2022 21:33:42 +0100
[Message part 1 (text/plain, inline)]
Le 05/09/2022 à 21:24, Ludovic Courtès a écrit :
> Yes, that’d be welcome.  I would not call it a constraint or limitation;
> for example, that ‘w’ is not a letter in Swedish is the kind of thing
> you’d generally want to take into account.  Now, it’d be nice if one
> could easily specify the locale to operate under, with an API similar to
> that of (ice-9 i18n) and its first-class locale objects.



Sorry that it took me forever to send this.



From c666ca4f72dc0a00d28b8d7ef1221ebfc9741551 Mon Sep 17 00:00:00 2001
From: Jean Abou Samra <jean <at> abou-samra.fr>
Date: Thu, 17 Nov 2022 21:26:07 +0100
Subject: [PATCH] Doc: clarification on regexes and encodings

* doc/ref/api-regex.texi: make it more obviously clear that regexp
  matching supports only characters supported by the locale encoding.
---
 doc/ref/api-regex.texi | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/doc/ref/api-regex.texi b/doc/ref/api-regex.texi
index b14c2b39c..bd1f4079d 100644
--- a/doc/ref/api-regex.texi
+++ b/doc/ref/api-regex.texi
@@ -57,7 +57,11 @@ locale's encoding, and then passed to the C library's 
regular expression
 routines (@pxref{Regular Expressions,,, libc, The GNU C Library
 Reference Manual}).  The returned match structures always point to
 characters in the strings, not to individual bytes, even in the case of
-multi-byte encodings.
+multi-byte encodings.  This ensures that the match structures are
+correct when performing matching with characters that have a multi-byte
+representation in the locale encoding.  Note, however, that using
+characters which cannot be represented in the locale encoding can lead
+to surprising results.

 @deffn {Scheme Procedure} string-match pattern str [start]
 Compile the string @var{pattern} into a regular expression and compare
-- 
2.38.1


[OpenPGP_signature (application/pgp-signature, attachment)]

This bug report was last modified 2 years and 209 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.