#31526 - Range [a-z] does not follow collate order from locale.

GNU bug report logs - #31526
Range [a-z] does not follow collate order from locale.

Package: sed;

Reported by: Bize Ma <binaryzebra <at> gmail.com>

Date: Sat, 19 May 2018 07:39:02 UTC

Severity: important

Tags: notabug

Found in version 4.4-2

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Message #12 received at 31526-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com> To: Bize Ma <binaryzebra <at> gmail.com> Cc: 31526-done <at> debbugs.gnu.org Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale. Date: Sat, 19 May 2018 20:13:00 -0600

tag 31526 notabug close 31526 thanks Hello, On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote: > With a locale set to en_US.utf8 it is expected that the collating order is > this: > > $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n' > `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ > kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ While in practice this is correct on all GNU/linux systems which use glibc, there is no officially documented collation order for punctuation marks - it might differ on other systems. Please see here: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14 > It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and > upper letters. > But it isn't: It should not be "expected". I don't think it is documented to be so anywhere in GNU programs. Both sed's and grep's manuals contain the following text: In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or the set of characters that it matches might even be erratic. https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html Furthermore, in POSIX 2008 standard range expressions are underfined for locales other than "C/POSIX", see this comment by Eric Blake (also the entire bug report might be of interest to this topic): https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24 > However, the range [a-Z] does match all letters, lower or upper: > > $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g' > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz I would recommend avoiding mixing upper-lower case in regex ranges, as the result might be unexpected. Compare the following: $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[a-Z]/p' [[ no output, no failure ]] $ echo '[' | LC_ALL=C sed -n '/[a-Z]/p' sed: -e expression #1, char 7: Invalid range end $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[A-z]/p' sed: -e expression #1, char 7: Invalid range end $ echo '[' | LC_ALL=C sed -n '/[A-z]/p' [ > If this is the correct way in which sed should work, then, if you please: Yes, it is. > - What is the rationale leading to such decision?. The bug reports linked above contain long discussions about it. Please also see the following thread, which promoted the restriction of "sane regex ranges" - meaning ASCII order alone (and applies to gawk, grep, sed and other programs using gnulib's regex engine): https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html > - Where is it documented?. The links above to the sed and grep manuals. > - Where is it implemented in the code?. I think a good place to start is gnulib's DFA regex engine, here: https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c or here: http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c Search for the comment 'build range characters' for a starting point. Both gnu grep and sed use this code. > - Why does the manual document otherwise?. Errors in the manual are always a possibility. If you spot such an error, or an example showing incorrect usage/output - please let us know where it is (e.g. a link to a manual page / section). As such, I'm marking this as "not a bug" and closing the ticket, but discussion can continue by replying to this thread. regards, - assaf

This bug report was last modified 7 years and 92 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #31526 Range [a-z] does not follow collate order from locale.

GNU bug report logs - #31526
Range [a-z] does not follow collate order from locale.