#35785 - 'string->uri' fails in sv_SE locale

GNU bug report logs - #35785
'string->uri' fails in sv_SE locale

Package: guile;

Reported by: Einar Largenius <einar.largenius <at> gmail.com>

Date: Fri, 17 May 2019 21:21:01 UTC

Severity: important

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Timothy Sample <samplet <at> ngyro.com> To: Ricardo Wurmus <rekado <at> elephly.net> Cc: 35785 <at> debbugs.gnu.org, Ludovic Courtès <ludo <at> gnu.org>, Einar Largenius <einar.largenius <at> gmail.com> Subject: bug#35785: ‘string->uri’ is locale-dependent and breaks in ‘sv_SE’ Date: Mon, 27 May 2019 09:39:03 -0400

Hello, Ricardo Wurmus <rekado <at> elephly.net> writes: > Ludovic Courtès <ludo <at> gnu.org> writes: > >> Using the “lower” regexp class instead of “[a-z]” works: >> >> --8<---------------cut here---------------start------------->8--- >> scheme@(guile-user)> (string-match "[[:lower:]]" "w") >> $12 = #("w" (0 . 1)) >> --8<---------------cut here---------------end--------------->8--- >> >> However, it’s not clear to me whether the “lower” class is supposed to >> be the same for all locales or if we’re just lucky: >> >> http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html >> >> Thoughts? > > The lower class is much larger than [a-z]. If we only wanted to work > around this particular problem we could explicitly spell out the range, > which would be the same in all locales. (Obviously, that wouldn’t be > pretty.) I think that explicitly spelling out the range is the right thing to do here. The POSIX spec says that character ranges work in the POSIX locale, but “in other locales, a range expression has unspecified behavior.” > But can’t URI parts contain more than those characters? A quick reading of RFC 3986 suggests that the host part of a URI can be an IP address (version 4 or 6) or a registered name. It gives the following rules for registered names: reg-name = *( unreserved / pct-encoded / sub-delims ) unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" pct-encoded = "%" HEXDIG HEXDIG sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" Here, “ALPHA”, “DIGIT”, and “HEXDIG” are specified in RFC 2234, and are just the ASCII ranges you might expect (except for that “HEXDIG” only allows uppercase letters). It looks like Guile is currently a little stricter than this, but pretty close (if you take the character ranges to mean ASCII ranges). > To circumvent > the question whether the lower class is locale dependent we could > generate an explicit range from a charset. I think this is the right approach. Using “[:lower:]” would allow things outside of the RFC, like ‘é’. Adding support for internationalized domain names using Punycode would be cool, but well outside the scope of this bug. :) -- Tim

This bug report was last modified 6 years and 72 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #35785 'string->uri' fails in sv_SE locale

GNU bug report logs - #35785
'string->uri' fails in sv_SE locale