GNU bug report logs -
#57507
Regular expression matching depends on locale encoding
Previous Next
Full log
Message #17 received at 57507 <at> debbugs.gnu.org (full text, mbox):
Hi,
Jean Abou Samra <jean <at> abou-samra.fr> skribis:
> Le 05/09/2022 à 09:48, Ludovic Courtès a écrit :
>> Hi Jean,
>>
>> Jean Abou Samra <jean <at> abou-samra.fr> skribis:
>>
>>> Regular expressions do funky things with Unicode if a non-Unicode-aware
>>> locale is set. Yet, they're purely string operations, so I don't think
>>> it's expected that they depend on the locale encoding.
>> This is the expected behavior: first because (ice-9 regex) is
>> implemented in terms of the libc regex functions, as Dale put (but that
>> could be thought as an implementation detail), and second because things
>> such as character classes are necessarily locale-dependent (this has
>> bitten us in the past, for instance with <https://bugs.gnu.org/35785>).
>>
>> I hope that makes sense.
>
>
>
> OK, thanks, but in this case, it should be clearly stated as a limitation
> in the (ice-9 regex) documentation IMHO. If you don't know what constraints
> there are on the implementation, there is no reason to expect this. Would it
> help if I submitted a patch for that?
Yes, that’d be welcome. I would not call it a constraint or limitation;
for example, that ‘w’ is not a letter in Swedish is the kind of thing
you’d generally want to take into account. Now, it’d be nice if one
could easily specify the locale to operate under, with an API similar to
that of (ice-9 i18n) and its first-class locale objects.
Thanks,
Ludo’.
This bug report was last modified 2 years and 209 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.