GNU bug report logs -
#31526
Range [a-z] does not follow collate order from locale.
Previous Next
Reported by: Bize Ma <binaryzebra <at> gmail.com>
Date: Sat, 19 May 2018 07:39:02 UTC
Severity: important
Tags: notabug
Found in version 4.4-2
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #12 received at 31526-done <at> debbugs.gnu.org (full text, mbox):
tag 31526 notabug
close 31526
thanks
Hello,
On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote:
> With a locale set to en_US.utf8 it is expected that the collating order is
> this:
>
> $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n'
> `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ
> kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ
While in practice this is correct on all GNU/linux systems which
use glibc, there is no officially documented collation order for
punctuation marks - it might differ on other systems. Please see here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14
> It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and
> upper letters.
> But it isn't:
It should not be "expected". I don't think it is documented to be
so anywhere in GNU programs. Both sed's and grep's manuals contain
the following text:
In other locales, the sorting sequence is not specified, and ‘[a-d]’
might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to
match any character, or the set of characters that it matches might
even be erratic.
https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes
https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html
Furthermore, in POSIX 2008 standard range expressions are
underfined for locales other than "C/POSIX", see this comment by Eric Blake
(also the entire bug report might be of interest to this topic):
https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24
> However, the range [a-Z] does match all letters, lower or upper:
>
> $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
> ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
I would recommend avoiding mixing upper-lower case in regex
ranges, as the result might be unexpected. Compare the following:
$ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[a-Z]/p'
[[ no output, no failure ]]
$ echo '[' | LC_ALL=C sed -n '/[a-Z]/p'
sed: -e expression #1, char 7: Invalid range end
$ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[A-z]/p'
sed: -e expression #1, char 7: Invalid range end
$ echo '[' | LC_ALL=C sed -n '/[A-z]/p'
[
> If this is the correct way in which sed should work, then, if you please:
Yes, it is.
> - What is the rationale leading to such decision?.
The bug reports linked above contain long discussions about it.
Please also see the following thread, which promoted the restriction
of "sane regex ranges" - meaning ASCII order alone (and applies to gawk,
grep, sed and other programs using gnulib's regex engine):
https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html
> - Where is it documented?.
The links above to the sed and grep manuals.
> - Where is it implemented in the code?.
I think a good place to start is gnulib's DFA regex engine,
here:
https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c
or here:
http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c
Search for the comment 'build range characters' for a starting point.
Both gnu grep and sed use this code.
> - Why does the manual document otherwise?.
Errors in the manual are always a possibility.
If you spot such an error, or an example showing incorrect
usage/output - please let us know where it is (e.g. a link
to a manual page / section).
As such, I'm marking this as "not a bug" and closing the ticket,
but discussion can continue by replying to this thread.
regards,
- assaf
This bug report was last modified 7 years and 92 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.