GNU bug report logs -
#20823
argv mangled by locale
Previous Next
To reply to this bug, email your comments to 20823 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guile <at> gnu.org
:
bug#20823
; Package
guile
.
(Tue, 16 Jun 2015 04:34:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Zefram <zefram <at> fysh.org>
:
New bug report received and forwarded. Copy sent to
bug-guile <at> gnu.org
.
(Tue, 16 Jun 2015 04:34:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
When guile-2.0 stores argv for later access via program-arguments,
it sometimes decodes the underlying octet string according to the
nominal character encoding of the locale suggested by the environment.
This is a problem, because the arguments are not necessarily encoded
that way, and may not even be encodings of character strings at all.
The decoding is lossy, where the octet string isn't consistent with the
character encoding, so the original octet string cannot be recovered
from the mangled form. I don't see any Scheme interface that reliably
retrieves the command line arguments without locale decoding.
The decoding doesn't follow the usual rules for locale control. It is
not at all sensitive to setlocale, which is understandable due to the
arguments being acquired before any of the actual program's code runs.
Empirically, if the environment nominates no locale, "POSIX", or a
non-existent locale, then argv is decoded according to ISO-8859-1, thus
preserving the octets. If the environment nominates an extant locale
other than "POSIX", then argv is decoded according to that locale's
nominal character encoding.
Demos:
$ env - guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on'
(76 195 169 111 110)
$ env - LANG=C guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on'
(76 63 63 111 110)
$ env - LANG=de_DE.utf8 guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on'
(76 233 111 110)
$ env - LANG=de_DE.iso88591 guile-2.0 -c '(write (map char->integer (string->list (cadr (program-arguments))))) (newline)' $'L\xc3\xa9on'
(76 195 169 111 110)
The actual data passed between processes is an octet string, and
there really needs to be some reliable way to access that octet string.
My comments about resolution in bug#20822 "environment mangled by locale"
mostly apply here too, with a slight change: it seems necessary to store
the original octet strings and decode at the time program-arguments is
called. With that change, the decoding can be responsive to setlocale
(and in particular can reliably use ISO-8859-1 in the absence of
setlocale).
-zefram
Information forwarded
to
bug-guile <at> gnu.org
:
bug#20823
; Package
guile
.
(Fri, 04 Mar 2016 23:25:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 20823 <at> debbugs.gnu.org (full text, mbox):
I wrote:
>My comments about resolution in bug#20822 "environment mangled by locale"
>mostly apply here too,
The revised comments that I have just made on that ticket also apply
here. Short version: "absence of setlocale" isn't a useful criterion,
so explicit control of encoding will be necessary.
-zefram
Information forwarded
to
bug-guile <at> gnu.org
:
bug#20823
; Package
guile
.
(Fri, 24 Jun 2016 06:12:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 20823 <at> debbugs.gnu.org (full text, mbox):
On Tue 16 Jun 2015 06:33, Zefram <zefram <at> fysh.org> writes:
> I don't see any Scheme interface that reliably retrieves the command
> line arguments without locale decoding.
[...]
> The actual data passed between processes is an octet string, and
> there really needs to be some reliable way to access that octet string.
> My comments about resolution in bug#20822 "environment mangled by locale"
> mostly apply here too, with a slight change: it seems necessary to store
> the original octet strings and decode at the time program-arguments is
> called. With that change, the decoding can be responsive to setlocale
> (and in particular can reliably use ISO-8859-1 in the absence of
> setlocale).
Proposal: scm_i_set_boot_program_arguments just copies the bytes, and
scm_program_arguments decodes them. I don't know whether to save the
locale that was current at program start and use that locale to decode
the arguments, or default the current locale, or what. I also don't
know whether to supply an optional "encoding" argument, and use that
encoding to decode the command line arguments. Thoughts, Mark and
Ludovic?
Andy
Information forwarded
to
bug-guile <at> gnu.org
:
bug#20823
; Package
guile
.
(Fri, 24 Jun 2016 08:43:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 20823 <at> debbugs.gnu.org (full text, mbox):
Andy Wingo wrote:
> I also don't
>know whether to supply an optional "encoding" argument, and use that
>encoding to decode the command line arguments.
That, or something that just retrieves octets, is necessary. Decoding via
the selected locale does not suffice, because there's no guarantee that
there'll be a locale with a cooperative encoding.
-zefram
Information forwarded
to
bug-guile <at> gnu.org
:
bug#20823
; Package
guile
.
(Sun, 14 Aug 2016 21:37:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 20823 <at> debbugs.gnu.org (full text, mbox):
Andy Wingo wrote:
> I also don't
>know whether to supply an optional "encoding" argument, and use that
>encoding to decode the command line arguments.
If you don't fancy the profusion of extra "encoding" parameters on
argv access (this ticket), environment access (bug#20822), and all
sorts of syscalls (bug#22913), you could bundle them all together in
a fluid. This would be a bit like the %default-port-encoding fluid,
but setlocale should absolutely not modify it. It should follow the
scheme that I laid out in bug#24186: its value can be either a string
naming an encoding, or #:locale-at-io meaning that whenever encoding
is required the currently selected locale is consulted. There should
also be a fluid determining the conversion strategy, like the existing
%default-port-conversion-strategy. These two fluids together would
control the encoding and decoding for all operations that currently
apply the locale encoding to arbitrary data. (Decoding locale-supplied
messages is a different matter.)
-zefram
This bug report was last modified 8 years and 304 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.