GNU bug report logs - #35785
'string->uri' fails in sv_SE locale

Previous Next

Package: guile;

Reported by: Einar Largenius <einar.largenius <at> gmail.com>

Date: Fri, 17 May 2019 21:21:01 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: Ricardo Wurmus <rekado <at> elephly.net>, 35785 <at> debbugs.gnu.org, Einar Largenius <einar.largenius <at> gmail.com>
Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’
Date: Sun, 02 Jun 2019 20:39:16 -0400
[Message part 1 (text/plain, inline)]
Hi,

Ludovic Courtès <ludo <at> gnu.org> writes:

> Hi Timothy,
>
> Timothy Sample <samplet <at> ngyro.com> skribis:
>
>> A quick reading of RFC 3986 suggests that the host part of a URI can be
>> an IP address (version 4 or 6) or a registered name.  It gives the
>> following rules for registered names:
>>
>> reg-name      = *( unreserved / pct-encoded / sub-delims )
>> unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
>> pct-encoded   = "%" HEXDIG HEXDIG
>> sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
>>               / "*" / "+" / "," / ";" / "="
>>
>> Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are
>> just the ASCII ranges you might expect (except for that “HEXDIG” only
>> allows uppercase letters).
>
> Do you think you could turn that into a patch for Guile?  I’d happily
> apply it.  :-)
>
> It looks like both [[:alnum:]] & co. and ranges would be
> locale-dependent, so my understanding is that we’ll have to list all the
> characters explicitly, right?

Here’s a patch for Guile that uses explicit lists of characters in the
‘(web uri)’ module instead of character ranges.  It includes two tests
that are pretty verbose, but seem to do the trick.

I have a bit more background on the problem, mostly coming from a Glibc
bug report: <https://sourceware.org/bugzilla/show_bug.cgi?id=23393>.

It turns out that it is well-known upstream, and avoiding character
ranges is the recommended approach for know.  Some other GNU tools have
adopted what is being called the “Rational Range Interpretation”
<https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html>.
AIUI, this means they use the underlying encoding numbers for ranges (I
checked the source, but I’m only mostly sure I read it right).  It looks
like the Glibc folks are unsure how to proceed on this (but are maybe
slightly leaning towards the “rational” approach).

It’s all a pretty big mess, really.  I was hoping there would be some
obvious thing that would fix the problem more generally.  Short of
pulling in the Gnulib regex code or writing something in Scheme, it
looks like Guile is stuck where it is now.

I’m unsure if the changes are considered “trivial” from a copyright
perspective.  It’s pretty close, but I think programmers tend to
underestimate here.  I’ve started the FSF copyright assignment process
either way, since is likely not my last Guile patch.  :)


-- Tim

[0001-Make-URI-handling-locale-independent.patch (text/x-patch, attachment)]

This bug report was last modified 6 years and 72 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.