GNU bug report logs -
#39970
guix commands broken on Azerbaijani 'az_AZ' and Turkish 'tr_TR' locales
Previous Next
To reply to this bug, email your comments to 39970 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Sat, 07 Mar 2020 12:02:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
:
New bug report received and forwarded. Copy sent to
bug-guix <at> gnu.org
.
(Sat, 07 Mar 2020 12:02:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
After running
export LC_ALL=tr_TR.utf8
many important Guix commands like 'guix environment', 'guix install'
and 'guix pull' fail.
$ guix environment --ad-hoc hello
Backtrace:
1 (primitive-load "/home/florian/.config/guix/current/bin…")
In guix/ui.scm:
1826:12 0 (run-guix-command _ . _)
guix/ui.scm:1826:12: In procedure run-guix-command:
In procedure string-length: Wrong type argument in position 1 (expecting string): #f
Running guix via ./pre-inst-env gives a more useful backtrace. The
reason is that in guix/store.scm
(use-modules (ice-9 regex))
(regexp-exec (make-regexp "^/gnu/store/([0-9a-df-np-sv-z]{32})-([^/]+)$")
"/gnu/store/bv9py3f2dsa5iw0aijqjv9zxwprcy1nb-fontconfig-2.13.1.drv")
evaluates to #f in Turkish, possibly because of the presence of
dotless i (ı) in the range.
The attached patch fixes the issue by including i explicitly, but I
believe enumerating all of [0-9abcdfghijklmnpqrsvwxyz] explicitly
might be more future-proof.
Shall I push the patch modified to list all letters in
[0-9abcdfghijklmnpqrsvwxyz] explicitly? Numbers too? I suppose there
is no downside to listing all without ranges.
I wonder what else is affected; the installer maybe? I have not
tested yet.
Regards,
Florian
[0001-store-Fix-many-guix-commands-failing-on-some-locales.patch (text/plain, attachment)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Sat, 07 Mar 2020 15:21:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 39970 <at> debbugs.gnu.org (full text, mbox):
On Sat, Mar 07, 2020 at 01:00:52PM +0100, pelzflorian (Florian Pelz) wrote:
> Running guix via ./pre-inst-env gives a more useful backtrace. The
> reason is that in guix/store.scm
>
> (use-modules (ice-9 regex))
> (regexp-exec (make-regexp "^/gnu/store/([0-9a-df-np-sv-z]{32})-([^/]+)$")
> "/gnu/store/bv9py3f2dsa5iw0aijqjv9zxwprcy1nb-fontconfig-2.13.1.drv")
>
> evaluates to #f in Turkish, possibly because of the presence of
> dotless i (ı) in the range.
>
Actually it seems the issue is that i is missing from the range [a-z]
ı and ğ are missing as well, as are non-Turkish letters like ä that
are included when using the en_US.utf8 locale, even though they are no
English letters either.
(use-modules (ice-9 regex))
(regexp-exec (make-regexp "^([a-z]+)$")
"iyiyim")
fails.
But running a glibc C program
florian <at> florianmacbook ~$ cat iyiyim.c
#include <regex.h>
#include <stdio.h>
#define STR "iyiyim"
int main (int argc,
char** argv)
{
regex_t only_letters;
int r = regcomp (&only_letters, "[a-z]", 0);
if (r != 0)
printf ("This error does not happen.\n");
r = regexec (&only_letters, STR, 0, NULL, 0);
if (r == 0)
printf ("The string " STR " matched!\n");
else
printf ("No match for " STR ".\n");
}
florian <at> florianmacbook ~$ gcc -o iyiyim iyiyim.c
florian <at> florianmacbook ~$ LANG=tr_TR.utf8 ./iyiyim
The string iyiyim matched!
succeeds on tr_TR.utf8 and en_US.utf8 locales (and a native Turkish
speaker confirmed to me ıi should be in the alphabet right after h).
Maybe this is a bug in Guile, somehow?
> […]
> I wonder what else is affected; the installer maybe? I have not
> tested yet.
>
I checked; the graphical installer appears unaffected, but the issue
appears on the installed system.
Regards,
Florian
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Sun, 08 Mar 2020 07:09:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 39970 <at> debbugs.gnu.org (full text, mbox):
This seems similar to <https://bugs.gnu.org/35785>. I think
enumerating all characters explicitly is a similar fix, whether or not
there is a bug in Guile.
Regards,
Florian
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Mon, 09 Mar 2020 17:03:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 39970 <at> debbugs.gnu.org (full text, mbox):
Hi Florian,
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
> This seems similar to <https://bugs.gnu.org/35785>.
Yes, same story.
> I think enumerating all characters explicitly is a similar fix,
> whether or not there is a bug in Guile.
To me it’s not a bug in Guile, but simply the fact that regexps, as
implemented by the C library, are locale-dependent.
The patch you proposed looks good to me, though perhaps we could
explicitly list all the alphabet in the regexp?
A better option is to reimplement ‘store-path-package-name’ in a way
similar to ‘store-path-hash-part’, as in commit
35eb77b09d957019b2437e7681bd88013d67d3cd.
Thoughts?
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Thu, 12 Mar 2020 11:03:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 39970 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Mar 09, 2020 at 06:02:40PM +0100, Ludovic Courtès wrote:
> To me it’s not a bug in Guile, but simply the fact that regexps, as
> implemented by the C library, are locale-dependent.
>
(use-modules (ice-9 regex))
(regexp-exec (make-regexp "^([a-z]+)$")
"iyiyim")
⇒ #f
Guile’s behavior that i is not among [a-z] has been confirmed as
unexpected by a natively Turkish friend of mine. It is different from
the behavior of current glibc:
florian <at> florianmacbook ~$ cat iyiyim.c
#include <regex.h>
#include <stdio.h>
#include <stdlib.h>
#define STR "iyiyım"
int main (int argc,
char** argv)
{
regex_t only_letters;
int r = regcomp (&only_letters, "[a-z]+", REG_EXTENDED);
if (r != 0)
printf ("This error does not happen.\n");
r = regexec (&only_letters, STR, 1, malloc (sizeof (regmatch_t)), 0);
if (r == 0)
printf ("The string " STR " matched!\n");
else
printf ("No match for " STR ".\n");
}
florian <at> florianmacbook ~$ gcc -o iyiyim iyiyim.c
florian <at> florianmacbook ~$ LANG=tr_TR.utf8 ./iyiyim
The string iyiyım matched!
Apparently Guile uses a bundled regular expression library rather than
glibc. I can try making Guile use a newer GNUlib for its regular
expressions, maybe that helps. Shall I file a separate bug for Guile?
> The patch you proposed looks good to me, though perhaps we could
> explicitly list all the alphabet in the regexp?
>
> A better option is to reimplement ‘store-path-package-name’ in a way
> similar to ‘store-path-hash-part’, as in commit
> 35eb77b09d957019b2437e7681bd88013d67d3cd.
I suppose it would be better to cache the compiled regexp. What is
this mcached syntax inside (guix store)? Or do I use Scheme’s 'delay'
and 'force' for caching?
The attached patch fixes the regexp. Shall I push the attached patch
and then try making it cache the compiled regexp or do you still
prefer an implementation without regexps? Why would not using a
regexp be better?
Regards,
Florian
[0001-store-Fix-many-guix-commands-failing-on-some-locales.patch (text/plain, attachment)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Thu, 12 Mar 2020 16:06:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 39970 <at> debbugs.gnu.org (full text, mbox):
Hi Florian,
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
> On Mon, Mar 09, 2020 at 06:02:40PM +0100, Ludovic Courtès wrote:
>> To me it’s not a bug in Guile, but simply the fact that regexps, as
>> implemented by the C library, are locale-dependent.
>>
>
> (use-modules (ice-9 regex))
> (regexp-exec (make-regexp "^([a-z]+)$")
> "iyiyim")
> ⇒ #f
>
> Guile’s behavior that i is not among [a-z] has been confirmed as
> unexpected by a natively Turkish friend of mine. It is different from
> the behavior of current glibc:
>
> florian <at> florianmacbook ~$ cat iyiyim.c
> #include <regex.h>
> #include <stdio.h>
> #include <stdlib.h>
> #define STR "iyiyım"
> int main (int argc,
> char** argv)
> {
You’re seeing a different behavior because you forgot a:
setlocale (LC_ALL, "");
call here.
>> The patch you proposed looks good to me, though perhaps we could
>> explicitly list all the alphabet in the regexp?
>>
>> A better option is to reimplement ‘store-path-package-name’ in a way
>> similar to ‘store-path-hash-part’, as in commit
>> 35eb77b09d957019b2437e7681bd88013d67d3cd.
>
> I suppose it would be better to cache the compiled regexp. What is
> this mcached syntax inside (guix store)? Or do I use Scheme’s 'delay'
> and 'force' for caching?
I lean towards avoiding regexps altogether, as I wrote above.
WDYT?
> The attached patch fixes the regexp. Shall I push the attached patch
> and then try making it cache the compiled regexp or do you still
> prefer an implementation without regexps? Why would not using a
> regexp be better?
It reduces reliance on libc, reduces complexity, and performs better as
noted in the commit log of 35eb77b09d957019b2437e7681bd88013d67d3cd.
Thanks,
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Tue, 17 Mar 2020 09:45:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 39970 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, Mar 12, 2020 at 05:05:26PM +0100, Ludovic Courtès wrote:
> "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
> > Why would not using a regexp be better?
>
> It reduces reliance on libc, reduces complexity, and performs better as
> noted in the commit log of 35eb77b09d957019b2437e7681bd88013d67d3cd.
Thank you for your wisdom. I hope the attached patch is OK.
`LC_ALL=en_US.utf8 make check` is mostly fine (except tests/pack.scm,
which also failed before).
Manual testing of `./pre-inst-env guix environment` works.
`LC_ALL=tr_TR.utf8 make check` is still very unhappy though.
There are many failures. I will continue to investigate later today.
Regards,
Florian
[0001-store-Fix-many-guix-commands-failing-on-some-locales.patch (text/plain, attachment)]
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Tue, 17 Mar 2020 21:21:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 39970 <at> debbugs.gnu.org (full text, mbox):
Hi,
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
> On Thu, Mar 12, 2020 at 05:05:26PM +0100, Ludovic Courtès wrote:
>> "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
>> > Why would not using a regexp be better?
>>
>> It reduces reliance on libc, reduces complexity, and performs better as
>> noted in the commit log of 35eb77b09d957019b2437e7681bd88013d67d3cd.
>
> Thank you for your wisdom. I hope the attached patch is OK.
>
> `LC_ALL=en_US.utf8 make check` is mostly fine (except tests/pack.scm,
> which also failed before).
>
> Manual testing of `./pre-inst-env guix environment` works.
Good!
> `LC_ALL=tr_TR.utf8 make check` is still very unhappy though.
> There are many failures. I will continue to investigate later today.
OK.
> From: Florian Pelz <pelzflorian <at> pelzflorian.de>
> Date: Thu, 12 Mar 2020 11:08:16 +0100
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> Subject: [PATCH] store: Fix many guix commands failing on some locales.
>
> Partly fixes bug #39970 (see: https://bugs.gnu.org/39970).
I’d just write:
Partly fixes <https://bugs.gnu.org/39970>.
Concise, clear, greppable. :-)
> At least 'guix environment', 'guix install' and 'guix pull'
> on 'az_AZ.utf8' and 'tr_TR.utf8' were affected.
>
> * guix/store.scm (store-path-hash-part): Move base path detection to ...
> (store-path-base): ... this new exported procedure.
> (store-path-package-name): Use it instead of locale-dependent regexps.
> (store-regexp*): Remove.
LGTM, thank you!
Ludo’.
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Wed, 18 Mar 2020 06:48:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 39970 <at> debbugs.gnu.org (full text, mbox):
On Tue, Mar 17, 2020 at 10:20:01PM +0100, Ludovic Courtès wrote:
> "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
> > `LC_ALL=tr_TR.utf8 make check` is still very unhappy though.
> > There are many failures. I will continue to investigate later today.
>
> OK.
The tests fail to many other uses of [a-z] in regexps. I will look;
for e.g. guix/import/cran.scm
(if (string-match "^[A-Za-z][^ :]+:( |\n|$)" line)
…)
it would be easier and clearer to just list [a-z] explicitly:
> LGTM, thank you!
:) Pushed as 771c5e155d7862ed91a5d503eecc00c1db1150ad.
Regards,
Florian
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Wed, 18 Mar 2020 08:41:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 39970 <at> debbugs.gnu.org (full text, mbox):
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
> On Tue, Mar 17, 2020 at 10:20:01PM +0100, Ludovic Courtès wrote:
>> "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
>> > `LC_ALL=tr_TR.utf8 make check` is still very unhappy though.
>> > There are many failures. I will continue to investigate later today.
>>
>> OK.
>
> The tests fail to many other uses of [a-z] in regexps. I will look;
> for e.g. guix/import/cran.scm
>
> (if (string-match "^[A-Za-z][^ :]+:( |\n|$)" line)
> …)
>
> it would be easier and clearer to just list [a-z] explicitly:
Yes, agreed.
It would be nice if ‘string-match’ & co. could take an optional locale
object (info "(guile) i18n Introduction") but that’s not the case
currently.
Thanks,
Ludo’.
Reply sent
to
Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
:
You have taken responsibility.
(Wed, 05 May 2021 04:48:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de>
:
bug acknowledged by developer.
(Wed, 05 May 2021 04:48:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 39970-done <at> debbugs.gnu.org (full text, mbox):
"pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> writes:
> On Tue, Mar 17, 2020 at 10:20:01PM +0100, Ludovic Courtès wrote:
>> "pelzflorian (Florian Pelz)" <pelzflorian <at> pelzflorian.de> skribis:
>> > `LC_ALL=tr_TR.utf8 make check` is still very unhappy though.
>> > There are many failures. I will continue to investigate later today.
>>
>> OK.
>
> The tests fail to many other uses of [a-z] in regexps. I will look;
> for e.g. guix/import/cran.scm
>
> (if (string-match "^[A-Za-z][^ :]+:( |\n|$)" line)
> …)
>
> it would be easier and clearer to just list [a-z] explicitly:
>
>
>> LGTM, thank you!
>
> :) Pushed as 771c5e155d7862ed91a5d503eecc00c1db1150ad.
Closing.
Thank you,
Maxim
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Wed, 05 May 2021 07:05:01 GMT)
Full text and
rfc822 format available.
Message #40 received at 39970 <at> debbugs.gnu.org (full text, mbox):
On 12.03.2020 12:02, pelzflorian (Florian Pelz) wrote:
>
> Guile’s behavior that i is not among [a-z] has been confirmed as
> unexpected by a natively Turkish friend of mine. It is different from
> the behavior of current glibc:
>
> florian <at> florianmacbook ~$ cat iyiyim.c
> #include <regex.h>
> #include <stdio.h>
> #include <stdlib.h>
> #define STR "iyiyım"
> int main (int argc,
> char** argv)
> {
> regex_t only_letters;
> int r = regcomp (&only_letters, "[a-z]+", REG_EXTENDED);
> if (r != 0)
> printf ("This error does not happen.\n");
> r = regexec (&only_letters, STR, 1, malloc (sizeof (regmatch_t)), 0);
> if (r == 0)
> printf ("The string " STR " matched!\n");
> else
> printf ("No match for " STR ".\n");
> }
> florian <at> florianmacbook ~$ gcc -o iyiyim iyiyim.c
> florian <at> florianmacbook ~$ LANG=tr_TR.utf8 ./iyiyim
> The string iyiyım matched!
>
> Apparently Guile uses a bundled regular expression library rather than
> glibc. I can try making Guile use a newer GNUlib for its regular
> expressions, maybe that helps. Shall I file a separate bug for Guile?
>
Also native Turkish speaker here, and yeah that seems like a clear bug.
By the way, Turkish doesn't have q, w, or x. So if [a-z] is interpreted
by locale, it would fail to match those letters. I suppose that doesn't
matter for the patch you guys used but it might have been part of the
original problem.
The dotless lowercase i / dotted uppercase I mostly bites programmers in
case conversion. The uppercase of i is İ and the lowercase of I is ı.
There was even an exploit in GitHub related to this:
https://eng.getwisdom.io/hacking-github-with-unicode-dotless-i/
- Taylan
Information forwarded
to
bug-guix <at> gnu.org
:
bug#39970
; Package
guix
.
(Wed, 05 May 2021 09:23:01 GMT)
Full text and
rfc822 format available.
Message #43 received at 39970-done <at> debbugs.gnu.org (full text, mbox):
On Wed, May 05, 2021 at 12:47:02AM -0400, Maxim Cournoyer wrote:
> Closing.
>
> Thank you,
>
> Maxim
Sorry for forgetting about this bug. The above
LC_ALL=tr_TR.utf8 make check TESTS=tests/cran.scm
is *not* fixed, but I won’t take the time to really understand and fix
the few remaining troubles, I think. Possibly libc bug
<https://sourceware.org/bugzilla/show_bug.cgi?id=23393> is the real
issue.
Regards,
Florian
Did not alter fixed versions and reopened.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Wed, 05 May 2021 15:14:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 4 years and 39 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.