Package: sed;
Reported by: Bize Ma <binaryzebra <at> gmail.com>
Date: Sat, 19 May 2018 07:39:02 UTC
Severity: important
Tags: notabug
Found in version 4.4-2
Done: Assaf Gordon <assafgordon <at> gmail.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 31526 in the body.
You can then email your comments to 31526 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
View this report as an mbox folder, status mbox, maintainer mbox
bug-sed <at> gnu.org
:bug#31526
; Package sed
.
(Sat, 19 May 2018 07:39:02 GMT) Full text and rfc822 format available.Bize Ma <binaryzebra <at> gmail.com>
:bug-sed <at> gnu.org
.
(Sat, 19 May 2018 07:39:02 GMT) Full text and rfc822 format available.Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
From: Bize Ma <binaryzebra <at> gmail.com> To: bug-sed <at> gnu.org Subject: Range [a-z] does not follow collate order from locale. Date: Fri, 18 May 2018 17:58:05 -0400
[Message part 1 (text/plain, inline)]
Package: sed Version: 4.4-2 Severity: important Dear Maintainer, With a locale set to en_US.utf8 it is expected that the collating order is this: $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n' `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and upper letters. But it isn't: $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-z]//g' abcdefghijklmnopqrstuvwxyz However, the range [a-Z] does match all letters, lower or upper: $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g' ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz If this is the correct way in which sed should work, then, if you please: - What is the rationale leading to such decision?. - Where is it documented?. - Where is it implemented in the code?. - Why does the manual document otherwise?.
[Message part 2 (text/html, inline)]
Assaf Gordon <assafgordon <at> gmail.com>
to control <at> debbugs.gnu.org
.
(Sun, 20 May 2018 02:14:01 GMT) Full text and rfc822 format available.Assaf Gordon <assafgordon <at> gmail.com>
to control <at> debbugs.gnu.org
.
(Sun, 20 May 2018 02:14:01 GMT) Full text and rfc822 format available.bug-sed <at> gnu.org
:bug#31526
; Package sed
.
(Sun, 20 May 2018 02:14:02 GMT) Full text and rfc822 format available.Message #12 received at 31526-done <at> debbugs.gnu.org (full text, mbox):
From: Assaf Gordon <assafgordon <at> gmail.com> To: Bize Ma <binaryzebra <at> gmail.com> Cc: 31526-done <at> debbugs.gnu.org Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale. Date: Sat, 19 May 2018 20:13:00 -0600
tag 31526 notabug close 31526 thanks Hello, On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote: > With a locale set to en_US.utf8 it is expected that the collating order is > this: > > $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n' > `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ > kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ While in practice this is correct on all GNU/linux systems which use glibc, there is no officially documented collation order for punctuation marks - it might differ on other systems. Please see here: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14 > It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and > upper letters. > But it isn't: It should not be "expected". I don't think it is documented to be so anywhere in GNU programs. Both sed's and grep's manuals contain the following text: In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or the set of characters that it matches might even be erratic. https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html Furthermore, in POSIX 2008 standard range expressions are underfined for locales other than "C/POSIX", see this comment by Eric Blake (also the entire bug report might be of interest to this topic): https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24 > However, the range [a-Z] does match all letters, lower or upper: > > $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g' > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz I would recommend avoiding mixing upper-lower case in regex ranges, as the result might be unexpected. Compare the following: $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[a-Z]/p' [[ no output, no failure ]] $ echo '[' | LC_ALL=C sed -n '/[a-Z]/p' sed: -e expression #1, char 7: Invalid range end $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[A-z]/p' sed: -e expression #1, char 7: Invalid range end $ echo '[' | LC_ALL=C sed -n '/[A-z]/p' [ > If this is the correct way in which sed should work, then, if you please: Yes, it is. > - What is the rationale leading to such decision?. The bug reports linked above contain long discussions about it. Please also see the following thread, which promoted the restriction of "sane regex ranges" - meaning ASCII order alone (and applies to gawk, grep, sed and other programs using gnulib's regex engine): https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html > - Where is it documented?. The links above to the sed and grep manuals. > - Where is it implemented in the code?. I think a good place to start is gnulib's DFA regex engine, here: https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c or here: http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c Search for the comment 'build range characters' for a starting point. Both gnu grep and sed use this code. > - Why does the manual document otherwise?. Errors in the manual are always a possibility. If you spot such an error, or an example showing incorrect usage/output - please let us know where it is (e.g. a link to a manual page / section). As such, I'm marking this as "not a bug" and closing the ticket, but discussion can continue by replying to this thread. regards, - assaf
bug-sed <at> gnu.org
:bug#31526
; Package sed
.
(Wed, 23 May 2018 08:50:01 GMT) Full text and rfc822 format available.Message #15 received at 31526 <at> debbugs.gnu.org (full text, mbox):
From: Assaf Gordon <assafgordon <at> gmail.com> To: Bize Ma <binaryzebra <at> gmail.com> Cc: 31526 <at> debbugs.gnu.org Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale. Date: Wed, 23 May 2018 02:49:43 -0600
(adding debbugs mailing list, please use "reply all" to ensure the thread is public and archived). Hello, On 22/05/18 07:48 PM, Bize Ma wrote: > > 2018-05-19 22:13 GMT-04:00 Assaf Gordon <assafgordon <at> gmail.com > > Hi!, thanks for your answer, time and detailed references. > > In range definitions I believe that there are two goals in conflict: > > - An stable, simple, range description for programmers. > - A clear descrition (even if long) for multilanguage users. Why are they in conflict? users of sed (programmers or not, using multibyte locale or not) should understand that regex ranges are tricky in multibyte locales. > For a programmer: > The old wisdom is that [a-d] should match only `abcd` (in C locale). > The usual recommendation is: "do not use other locales". > That is making the use of any other locale almost invalid. > However, [a-z] may also match many accented (Latin) characters. > > For a multi language user: > But if other locales are used, as is a must to allow for most > languages used > on this world, the range has never been clearly defined, much less > the order > in which a range will match. There are some clues about "collation > order" in > GNU sed, but it remains unclear as which collation sort order apply > to that. [...] > Then, the real question is: What order does sed follow? Exactly because regex ranges in multibyte locales are not well-defined, the recommendation is not to use them in portable sed scripts. > ********************************************************************** > 1.- About ASCII character numeric ranges: > > Yes, I agree that it may be conceptually unnecessary to give a collation > order to "punctuation marks". > However, that it may be "conceptually unnecessary" does not mean that > such order is "invalid". A practical inplementation may define some > such order. > Please understand that the goal of the code above is to show the practical > result of using some (locale defined) collation order equivalent to what > is given by the c function strcoll(). exactly - and strcoll() is implemented in glibc (with possible replacement in gnulib). It is outside the scope of 'sed' to define the collation order. And the order could change from one operating system to the other. > ********************************************************************** > 2.- About using collating order. > > > > It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and > > > upper letters. > > > But it isn't: > > > > It should not be "expected". I don't think it is documented to be > > so anywhere in GNU programs. > > Well, yes, 'info sed', in section `5 Regular Expressions: selecting text` > sub-section `5.5 Character Classes and Bracket Expressions` include: > > Within a bracket expression, a "range expression" consists of two > characters separated by a hyphen. It matches any single character > that sorts between the two characters, inclusive. In the default > C locale, the sorting sequence is the native character order; for > example, '[a-d]' is equivalent to '[abcd]'. > > From 'info sed' (not man sed) sub-section `5.9 Locale Considerations`: > > In other locales, the sorting sequence is not specified, and '[a-d]' > might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail > to match any character, or the set of characters that it matches > might even be erratic. > > So, the `[a-d]` expression match characters that sort between `a` and `d`. > That is defined above for the C locale. In other locales the sorting is > "undefined". > > > > … Both sed's and grep's manuals contain > > the following text: > > > > In other locales, the sorting sequence is not specified, and ‘[a-d]’ > > might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might > fail to > > match any character, or the set of characters that it matches might > > even be erratic. > > Yes, It is the exact same text that I also quoted above. But all it > clearly defines is that the order is based on the definition of each > locale "in some unspecified way". When the locale change, the order > may also change. > > > > https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes I'm not sure I understand if are you agreeing with me or not? It seems (to me) that the text is clear: In "C/POSIX" locale, regex range [a-d] matches a,b,c,d. In other locales, it is not well defined (and can match many variations, depending on your operating system/libc). > Yes, At the same page, but at Reporting-Bugs, under the heading > [a-z] is case insensitive > > https://www.gnu.org/software/sed/manual/sed.html#Reporting-Bugs > > We can read: > > [a-z] is case insensitive > You are encountering problems with locales. POSIX mandates that [a-z] > uses the current locale’s collation order – in C parlance, that means > using strcoll(3) instead of strcmp(3). Some locales have a case- > insensitive collation order, others don’t. > > It seems to say: "current locale's collation order" !! Yes, there is a locale collation order. It is defined in libc (e.g. glibc, but there are other libc's out there), not in sed, and it is not well documented. It can also change from one locale to the next (see example below). GNU sed has no way to change/determine it, or document what it is. > > Furthermore, in POSIX 2008 standard range expressions are > > undefined for locales other than "C/POSIX" > > Yes, however: Does undefined also mean invalid, forbidden, banned or > illegal? I should have used a more accurate term: "Unspecified" instead of "undefined" (and thank you for quoting Eric Blake's message about it). Both terms are explained here: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap01.html#tag_01_05_07 In this context, saying "unspecified" means that the results are not specified by the standard. It could work reliably, it could work and return unexpected results, it might not work. It does not mean it is forbidden, but it does mean some implementation can choose to reject such ranges completely and it would not be considered a violation of POSIX standard. > At the moment, it is not illegal to use a bracket range in some other > locale. > Such use does not raise any error (or even warning). As it is not > illegal, the > only aspect that remains to be clearly defined is what is the range > order that > we should expect in every other locale than C. This is exactly the point of saying "unspecified" - there is (currently) no definition which GNU sed developers can guarantee will always work in the specified manner. > Also, We rely everyday on "not specified" behavior (for some spec): > > The -E option is not (yet) defined in current POSIX (The Open Group > Base Specifications Issue 7, 2018 edition) for sed. > Yes, It is believed that it will be accepted for the next POSIX version. Technically speaking, the "-E" option is not "unspecified". It is an extension beyond the current POSIX standard, and GNU programs have many such extensions. But there are two strong cases for "-E": First, there is an extremely high likelihood it will be accepted to the next version of the standard. Second, several other sed implementations (non-gnu) support "-E" with the same semantics. > Some elements are undefined in POSIX just to allow implementations to be > diverse: > > http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html > > The results of giving <tilde> with an unknown login name are undefined > because the KornShell "˜+" and "˜-" constructs make use of this > condition … > > Read carefully: undefined because it is used !. > That is, it is undefined in the spec to allow implementations to resolve in > practical ways that might be diferent than the specification (or other > implementations). While this does not relate directly to sed, "undefined" here means that according to the POSIX standard, the described input is *invalid*, and implementations can decide how they want to handle it. You are correct in saying that often POSIX says something is "unspecified" or "undefined" because existing systems have had their own behavior long before POSIX even existed, and POSIX does not want to contradict or forbid existing behavior. > In the same "comment by Eric Blake" we can read this: > > The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_ > "undefined". What "unspecified" means is: POSIX standard deems the input *valid*, but does not force implementations to return specific results. (had the input been *invalid*, it would be "undefined" instead of "unspecified"). [BTW, I welcome corrections and clarifications if the above is inaccurate]. > Exactly the same I was meaning: "unspecified", but _not_ "invalid". > > And, exactly, what I am asking for: "glibc should document and define > this behavior" I fully support this: it would be beneficial of GLIBC developers to documented exactly how collation order works in various multibyte locales. However, GNU Sed developers have no way to do so. This issue should be sent to GLIBC developers (on their mailing list or bug-tracker website). > > > > > However, the range [a-Z] does match all letters, lower or upper: > > > > > > $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g' > > > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz > > > > I would recommend avoiding mixing upper-lower case in regex > > ranges, as the result might be unexpected. Compare the following: > > In the "comment by Eric Blake" we can also read: > > That is, [A-z] is well-defined in the POSIX locale, and in all other > locales where A collates before z (which includes en_US.UTF-8) > > Again: "[A-z] is well-defined … " Yes, in "C" locale, the range "[A-z]" means ASCII value 65 ("A") to ASCII value 122 ("z"). It means the range also includes backslash (ASCII 92) and underscore (ASCII 95). But how do you treat range "[a-Z]" ? This is range ASCII 97 to ASCII 90 ... is an implementation expected to swap the min/max values, and treat it as ASCII range 90-97 ? or somehow understand these are letters, and change it to ASCII 65 to 122 ? Here's a simpler and more obvious case: The range [3-8] is intuitively clear, but the reverse is not valid: $ echo 7 | LC_ALL=C grep '[3-8]' 7 $ echo 7 | LC_ALL=C grep '[8-3]' grep: Invalid range end > Frankly, if I were to follow both main recommendations: > > - Any other locale than C is unspecified: do not use them. > - Any range that does not match the previously known ranges: > "recommend avoiding mixing upper-lower case in regex ranges" > > The usefulness of a bracket range is reduced to almost nothing. > Only C and only either [a-z] or [A-Z]. "Almost nothing" is a strong statement... I would say the following: 1. In "C" locale, where each character is a single byte (and assuming an ASCII environment) - ranges are very well defined and easy to use, not just [a-z] [A-Z], but any ASCII value (including octal values, etc.). 2. In multibyte locales, ranges of specific letters (e.g. "[A-D]") are not well specified and should be avoided in portable scripts. However, the character classes are very usable in multibyte locale, and can be used to match all letters or all digits, etc. Example: $ echo "Γειά σου 123" | LC_ALL=en_CA.UTF-8 sed 's/[[:alpha:]]/*/g' **** *** 123 3. If you always use the same environment (e.g. always GLIBC, always GNU SED, always the same locale) - then it is very likely (but still not guaranteed) that the collation order you observe in regex ranges will remain the same in the future. > Is it not possible to declare and document what the collation > order is/should be for other locales? Again, this is a glibc issue (or any other library that implements collation order) - outside the scope of SED. > ********************************************************************** > 3.- Corect exactly how. > > > > If this is the correct way in which sed should work, then, if you > please: > > > > Yes, it is. > > Thanks, but: What does it mean exactly? My opinion in the right. > > - That [a-z] will always mean 'abcdefghijklmnopqrstuvwxyz' in the C > locale?. (Yes) Correct. > - That the order in C locale follows the ASCII numeric order?. > (Yes) Correct. > - That no other locale should be used? > (No?) Non-C locales can be used if one understands the limitations as shown above. Specifically, portable SED scripts should not use regex ranges in non-C locale. If you are absolutely certain you will always run your SED scripts under GLIBC, it is very very likely the collation order you observe now will remain for a long time. > - That the order in any other locale is secret? > (Yes) Not "secret" as in someone actively trying to hide it, but unknown/undocumented because the developers of GLIBC have not documented it. > - That ranges like [A-z] (valid in C) can not be used in other > locales? (No?) Should not be used in portable SED scripts. > - That other ranges like [*-d] (valid in C) are a crazy idea? > (No?) Instead of "crazy" let's call it "unspecified" - meaning that each program can return different results, and there is no single "correct" result according to the POSIX standard. In practice, if you always use GLIBC systems, you will very very likely see the same results every time. > - References to collation order in the manuals must be stricken out? > (No?) I'm not sure I understand this... > And we have not even started with more characters as they are possible > in UNICODE. [...] > Yes, there are discussions about what was relevant at the time. > But none explain in clear simple words what order the characters > in a bracket range will follow in a locale that is NOT C. (see > some simple examples above). Correct - that is not documented anywhere at the moment. > > > - Why does the manual document otherwise?. > > > > Errors in the manual are always a possibility. > > If you spot such an error, or an example showing incorrect > > usage/output - please let us know where it is (e.g. a link > > to a manual page / section). > > I have provided a couple of points where "collating order" is used. > But I suspect that those are not mistakes from your point of view and > that what is missing is a more detailed description of which collating > order is being used. That is a good way to describe the issue. The term "collation order" is defined in POSIX, e.g. here: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02 But the actual order (which character comes before/after another) is left to implementations to decide. GLIBC is one such implementations, and GLIBC developers have decided on such order. Sadly they have not documented it well. Here's an example of glibc's strange behavior (or at least strange to me, as I found no explanation for it): In most multibyte UTF-8 locales the punctuation order differs from ASCII order, but is consistently the same (e.g. en_CA.UTF-8 and fr_FR.UTF-8). For some reason, ja_JP.UTF-8 order is more like ASCII. Compare the following: $ printf "%s\n" a A b B "á" "あ" "ひ" . , : - = > in $ LC_ALL=C sort in > out-C $ LC_ALL=en_CA.UTF-8 sort in > out-CA $ LC_ALL=ja_JP.UTF-8 sort in > out-JA $ paste out-C out-CA out-JA , = , - - - . , . : : : = . = A あ A B ひ B a A a b a b á á あ あ B ひ ひ b á And that is an example of why we simply can not tell you what is the "correct" order that you'll get, even if it seems that in all of your testing you see the same order. Another example: $ echo "あáb" | LC_ALL=ja_JP.utf8 sed 's/[a-z]/*/g' あá* $ echo "あáb" | LC_ALL=en_CA.utf8 sed 's/[a-z]/*/g' あ** (This is at least the case with GLIBC 2.24-11+deb9u3 on Debian 9). > > As such, I'm marking this as "not a bug" and closing the ticket, > > but discussion can continue by replying to this thread. > > I still remain in doubt, at the very minimum. I hope this helps clears things out, but I'm happy to continue this discussion if there are other questions. regards, - assaf
bug-sed <at> gnu.org
:bug#31526
; Package sed
.
(Wed, 23 May 2018 23:15:01 GMT) Full text and rfc822 format available.Message #18 received at 31526-done <at> debbugs.gnu.org (full text, mbox):
From: Bize Ma <binaryzebra <at> gmail.com> To: 31526-done <at> debbugs.gnu.org Subject: Fwd: bug#31526: Range [a-z] does not follow collate order from locale. Date: Wed, 23 May 2018 19:13:55 -0400
[Message part 1 (text/plain, inline)]
Following your request: > From: Assaf Gordon *> *(adding debbugs mailing list, please use "reply all" to > ensure the thread is public and archived). I am sending the message to which you just have answered to the debbugs mailing list, Sorry for my mistake. ---------- Forwarded message ---------- From: Bize Ma <binaryzebra <at> gmail.com> Date: 2018-05-22 21:48 GMT-04:00 Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale. To: Assaf Gordon <assafgordon <at> gmail.com> > 2018-05-19 22:13 GMT-04:00 Assaf Gordon <assafgordon <at> gmail.com>: > Hello, Hi!, thanks for your answer, time and detailed references. In range definitions I believe that there are two goals in conflict: - An stable, simple, range description for programmers. - A clear descrition (even if long) for multilanguage users. For a programmer: The old wisdom is that [a-d] should match only `abcd` (in C locale). The usual recommendation is: "do not use other locales". That is making the use of any other locale almost invalid. However, [a-z] may also match many accented (Latin) characters. For a multi language user: But if other locales are used, as is a must to allow for most languages used on this world, the range has never been clearly defined, much less the order in which a range will match. There are some clues about "collation order" in GNU sed, but it remains unclear as which collation sort order apply to that. Using a range in other locale does not follow ASCII numeric order: printf '%b' "$(printf '\\U%x\\n' {32..255})" | LC_ALL=C sort | tr -d '\n' | sed 's/[^a-ä]//g'; echo abcdªàáâãäåæç The result above should have ended in a `d`, but `d` falls in the middle. Nor it follows the locale collate order in effect (it should end in ä): printf '%b' "$(printf '\\U%x\\n' {32..255})" | LC_ALL=en_CA.utf8 sort | tr -d '\n' | sed 's/[^a-ä]//g'; echo aáàâä㪠Then, the real question is: What order does sed follow? > On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote: > > > > $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n' > > `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ > > kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ > While in practice this is correct on all GNU/linux systems which > use glibc, there is no officially documented collation order for > punctuation marks - it might differ on other systems. Please see here: > https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14 ********************************************************************** 1.- About ASCII character numeric ranges: Yes, I agree that it may be conceptually unnecessary to give a collation order to "punctuation marks". However, that it may be "conceptually unnecessary" does not mean that such order is "invalid". A practical inplementation may define some such order. Please understand that the goal of the code above is to show the practical result of using some (locale defined) collation order equivalent to what is given by the c function strcoll(). The range may be more limited to only letters and numbers: {48..57} {65..90} {97..122} (in hex: 0x30-0x39 0x41-0x5a 0x61-0x7a). Let us define and use a function that should work on bash 4.2+: collorder(){ a=$1; shift 1; until (($#<2)); do printf '%b' $(printf '\\U%x\\n' $(seq "$1" "$2")) shift 2 done | sort | tr -d '\n' | sed 's/'"$a"'//g' echo } That function will allow us to do: $ LC_ALL=en_CA.utf8 collorder ' ' 48 57 65 90 97 122 0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz And (In C locale the sort is identical to ASCII numeric sort): $ LC_ALL=C collorder ' ' 48 57 65 90 97 122 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz And filtering by a bracket range: $ LC_ALL=C collorder '[^a-z]' 48 57 65 90 97 122 abcdefghijklmnopqrstuvwxyz But those ranges avoid the character that you use latter (`[`). Including the characters between Upper-Case and lowercase ASCII: $ LC_ALL=C collorder '[^Y-d]' 48 57 65 122 YZ[\]^_`abcd That was the reason to include all 95 (126-32+1) ASCII that are not control. One simple range. Including such characters allow (perfectly valid) mixed bracket ranges: $ LC_ALL=C collorder '[^+-d]' 32 126 +,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcd Not because I was interested to deviate the discusion to "punctuation marks". Just because it was one simple character numeric range. That is all, the bash function defined here: collorder, is a tool to reveal the (practical) collation order valid for the applied locale. ********************************************************************** 2.- About using collating order. > > It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and > > upper letters. > > But it isn't: > > It should not be "expected". I don't think it is documented to be > so anywhere in GNU programs. Well, yes, 'info sed', in section `5 Regular Expressions: selecting text` sub-section `5.5 Character Classes and Bracket Expressions` include: Within a bracket expression, a "range expression" consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive. In the default C locale, the sorting sequence is the native character order; for example, '[a-d]' is equivalent to '[abcd]'. From 'info sed' (not man sed) sub-section `5.9 Locale Considerations`: In other locales, the sorting sequence is not specified, and '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail to match any character, or the set of characters that it matches might even be erratic. So, the `[a-d]` expression match characters that sort between `a` and `d`. That is defined above for the C locale. In other locales the sorting is "undefined". > … Both sed's and grep's manuals contain > the following text: > > In other locales, the sorting sequence is not specified, and ‘[a-d]’ > might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to > match any character, or the set of characters that it matches might > even be erratic. Yes, It is the exact same text that I also quoted above. But all it clearly defines is that the order is based on the definition of each locale "in some unspecified way". When the locale change, the order may also change. > https://www.gnu.org/software/sed/manual/sed.html#Multibyte- regexp-character-classes Yes, At the same page, but at Reporting-Bugs, under the heading [a-z] is case insensitive https://www.gnu.org/software/sed/manual/sed.html#Reporting-Bugs We can read: [a-z] is case insensitive You are encountering problems with locales. POSIX mandates that [a-z] uses the current locale’s collation order – in C parlance, that means using strcoll(3) instead of strcmp(3). Some locales have a case- insensitive collation order, others don’t. It seems to say: "current locale's collation order" !! > https://www.gnu.org/software/grep/manual/html_node/ Character-Classes-and-Bracket-Expressions.html > > Furthermore, in POSIX 2008 standard range expressions are > undefined for locales other than "C/POSIX", see this comment by Eric Blake > (also the entire bug report might be of interest to this topic): > https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24 Yes, however: Does undefined also mean invalid, forbidden, banned or illegal? At the moment, it is not illegal to use a bracket range in some other locale. Such use does not raise any error (or even warning). As it is not illegal, the only aspect that remains to be clearly defined is what is the range order that we should expect in every other locale than C. Also, We rely everyday on "not specified" behavior (for some spec): The -E option is not (yet) defined in current POSIX (The Open Group Base Specifications Issue 7, 2018 edition) for sed. Yes, It is believed that it will be accepted for the next POSIX version. http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html But it is defined (and used) in GNU sed. Some elements are undefined in POSIX just to allow implementations to be diverse: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html The results of giving <tilde> with an unknown login name are undefined because the KornShell "˜+" and "˜-" constructs make use of this condition … Read carefully: undefined because it is used !. That is, it is undefined in the spec to allow implementations to resolve in practical ways that might be diferent than the specification (or other implementations). In the same "comment by Eric Blake" we can read this: The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_ "undefined". A compliant app cannot guarantee what the behavior will be, but the behavior should at least be explainable, and as a QoI point, glibc should document and define this behavior as an extension to POSIX, so that apps relying on glibc can take advantage of this extension for known behavior. Exactly the same I was meaning: "unspecified", but _not_ "invalid". And, exactly, what I am asking for: "glibc should document and define this behavior" > > > However, the range [a-Z] does match all letters, lower or upper: > > > > $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g' > > ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz > > I would recommend avoiding mixing upper-lower case in regex > ranges, as the result might be unexpected. Compare the following: In the "comment by Eric Blake" we can also read: That is, [A-z] is well-defined in the POSIX locale, and in all other locales where A collates before z (which includes en_US.UTF-8) Again: "[A-z] is well-defined … " Frankly, if I were to follow both main recommendations: - Any other locale than C is unspecified: do not use them. - Any range that does not match the previously known ranges: "recommend avoiding mixing upper-lower case in regex ranges" The usefulness of a bracket range is reduced to almost nothing. Only C and only either [a-z] or [A-Z]. Is it not possible to declare and document what the collation order is/should be for other locales? ********************************************************************** 3.- Corect exactly how. > > If this is the correct way in which sed should work, then, if you please: > > Yes, it is. Thanks, but: What does it mean exactly? My opinion in the right. - That [a-z] will always mean 'abcdefghijklmnopqrstuvwxyz' in the C locale?. (Yes) - That the order in C locale follows the ASCII numeric order?. (Yes) - That no other locale should be used? (No?) - That the order in any other locale is secret? (Yes) - That ranges like [A-z] (valid in C) can not be used in other locales? (No?) - That other ranges like [*-d] (valid in C) are a crazy idea? (No?) - References to collation order in the manuals must be stricken out? (No?) And we have not even started with more characters as they are possible in UNICODE. - Is this valid: $ LC_ALL=en_CA.utf8 ./collorder '[^a-z]' 32 255 abcdefghijklmnopqrstuvwxyzªºßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýÿ Does it mean that [a-z] is closer to [[:lower:]] than ASCII a-z? - Is this expected? (phonetic symbols) $ LC_ALL=en_CA.utf8 ./collorder '[^a-z]' 0x250 0x2af ɓɔɖɗəɛɠɵ - Should this work? In what order? (phonetic symbols) $ LC_ALL=en_CA.utf8 ./collorder '[^ɖ-ɛ]' 0x250 0x2af ɖɗəɛ - Why all Latin characters are being included? (Latin extended) $ LC_ALL=en_CA.utf8 ./collorder '[^a-z]' 0x1e00 0x1fff ḁḃḅḇḉḋḍḏḑḓḕḗḙḛḝḟḡḣḥḧḩḫḭḯḱḳḵḷḹḻḽḿṁṃṅṇṉṋṍṏṑṓṕṗṙṛṝṟṡṣṥṧṩṫṭṯṱṳṵṷṹṻṽṿẁẃẅẇẉ ẋẍẏẖẗẘẙẚẛạảấầẩẫậắằẳẵặẹẻẽếềểễệỉịọỏốồổỗộớờởỡợụủứừửữựỳỵỷỹ > > - What is the rationale leading to such decision?. > > The bug reports linked above contain long discussions about it. Yes, there are discussions about what was relevant at the time. But none explain in clear simple words what order the characters in a bracket range will follow in a locale that is NOT C. (see some simple examples above). > Please also see the following thread, which promoted the restriction > of "sane regex ranges" - meaning ASCII order alone (and applies to gawk, > grep, sed and other programs using gnulib's regex engine): > > https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html ASCII order alone? Only for characters in numeric range 0x00-0x7f ???? - How comes that an á gets included in the very limited [a-b]? $ LC_ALL=en_CA.utf8 ./collorder '[^a-b]' 0x00 0xff abªàáâãäåæ > > - Where is it documented?. > > The links above to the sed and grep manuals. None of the linked documents explain the above result for [^a-b]. > > - Where is it implemented in the code?. > > I think a good place to start is gnulib's DFA regex engine, > here: > https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c > or here: > http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c I have to recognize that I am unable to understand any of those 4000 lines of code without some detailed help of how it works. I am really sorry. > Search for the comment 'build range characters' for a starting point. > > Both gnu grep and sed use this code. > > > - Why does the manual document otherwise?. > > Errors in the manual are always a possibility. > If you spot such an error, or an example showing incorrect > usage/output - please let us know where it is (e.g. a link > to a manual page / section). I have provided a couple of points where "collating order" is used. But I suspect that those are not mistakes from your point of view and that what is missing is a more detailed description of which collating order is being used. I may be perfectly wrong, of course. > As such, I'm marking this as "not a bug" and closing the ticket, > but discussion can continue by replying to this thread. I still remain in doubt, at the very minimum. > regards, > - assaf Many thanks and regards - Bize
[Message part 2 (text/html, inline)]
bug-sed <at> gnu.org
:bug#31526
; Package sed
.
(Fri, 25 May 2018 04:49:02 GMT) Full text and rfc822 format available.Message #21 received at 31526 <at> debbugs.gnu.org (full text, mbox):
From: Bize Ma <binaryzebra <at> gmail.com> To: Assaf Gordon <assafgordon <at> gmail.com> Cc: 31526 <at> debbugs.gnu.org Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale. Date: Fri, 25 May 2018 00:48:33 -0400
[Message part 1 (text/plain, inline)]
I believe that this lines carry the esence of the answer: > It is outside the scope of 'sed' to define the collation order. > Yes, there is a locale collation order. > It is defined in libc not in sed, and it is not well documented. > GNU sed has no way to change/determine it, or document what it is. >> - That the order in any other locale is secret? > > Not "secret" as in someone actively trying to hide it, > but unknown/undocumented because the developers of GLIBC have not > documented it. >> But none explain in clear simple words what order the characters >> in a bracket range will follow in a locale that is NOT C. (see >> some simple examples above). > > Correct - that is not documented anywhere at the moment. So: - This is not a bug that sed developers could or would resolve. - The sort order needs to be documented by glibc. In fact, sed developers do not support bracket ranges in a locale that is not C: > Any other locale than C is unspecified: do not use them. Best Regards Bize Ma ----------------------------------------------------------------------------- Some general clarifications follow: >> In range definitions I believe that there are two goals in conflict: >> >> - An stable, simple, range description for programmers. >> - A clear descrition (even if long) for multilanguage users. >> > Why are they in conflict? … Because if a long description is required, then, it is "not simple". > Exactly because regex ranges in multibyte locales are not well-defined, > the recommendation is not to use them in portable sed scripts. Portable? That is new word. It did not appeared in previous e-mails. Why do you assume that I want/need to have only "portable" ranges? >> ********************************************************************** >> 1.- About ASCII character numeric ranges: [...] > In "C/POSIX" locale, regex range [a-d] matches a,b,c,d. > In other locales, it is not well defined (and can match many variations, > depending on your operating system/libc). Yes, Simple: sed defers to glibc (or other libc) the responsability to define and implement such order. Thus: sed developers could not support any specific range order. [...] >> The -E option is not (yet) defined in current POSIX (The Open Group >> Base Specifications Issue 7, 2018 edition) for sed. >> Yes, It is believed that it will be accepted for the next POSIX version. >> > Technically speaking, the "-E" option is not "unspecified". I did not use the word "unspecified", I said: "not (yet) defined". Please do not put words in my mouth. > It is an extension beyond the current POSIX standard, and GNU programs > have many such extensions. And, as an extension, is something that the POSIX standard has not (yet) defined. [...] > But how do you treat range "[a-Z]" ? If the collating order sorts `a` before `Z`, the range is valid and should give a "resonable" result. $ echo '0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz' | > LC_ALL=en_CA.utf8 sed 's/[^a-Z]//g' ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz As you can see above, glibc (and thus sed) does not claim the range to be invalid and thus it returns a "reasonable" result (whatever "reasonable" is meaning here). > This is range ASCII 97 to ASCII 90 ... is an implementation expected > to swap the min/max values, and treat it as ASCII range 90-97 ? > or somehow understand these are letters, and change it to ASCII 65 to 122 ? ASCII values only have an exact meaning in C locale (and (maybe) in C.UTF-8). And that is only because that is the collating sort order of C locale. In other locales, the sort order is usually (very) diferent than ASCII numeric values. [...] > 2. In multibyte locales, ranges of specific letters (e.g. "[A-D]") > are not well specified and should be avoided in portable scripts. That word again: portable. Only in portable scripts? What should happen in all other scripts? [...] >> ********************************************************************** >> 3.- Correct exactly how. [...] >> - That other ranges like [*-d] (valid in C) are a crazy idea? (No?) > > Instead of "crazy" let's call it "unspecified" … Let's call it what it is: unsupported by sed. >> - References to collation order in the manuals must be stricken out? (No?) > I'm not sure I understand this... You said: I don't think it is documented to be so anywhere in GNU programs. [...] > The term "collation order" is defined in POSIX, e.g. here: > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02 I have NOT asked what the term means in POSIX, but what it means to sed. [...] > Here's an example of glibc's strange behavior (or at least > strange to me, as I found no explanation for it): > > In most multibyte UTF-8 locales the punctuation order > differs from ASCII order, Collation order is a language issue, each language has special and many times conflicting views of what "the correct order" should be. That is how we humans think. Consider very "simple" everyday dates, there are as many month names as languages there are. As many week day names as languages there are. That is what an individual of any culture has learnt to expect as the "natural order". All we can do, if confronted with diverse expectations, is to accept that they do exist and addapt to accept them. Please take a look at the Unicode Collation page: http://unicode.org/reports/tr10/ > … but is consistently the same (e.g. en_CA.UTF-8 and fr_FR.UTF-8). > For some reason, ja_JP.UTF-8 order is more like ASCII. > > Compare the following: > > $ printf "%s\n" a A b B "á" "あ" "ひ" . , : - = > in > $ LC_ALL=C sort in > out-C > $ LC_ALL=en_CA.UTF-8 sort in > out-CA > $ LC_ALL=ja_JP.UTF-8 sort in > out-JA > $ paste out-C out-CA out-JA > , = , > - - - > . , . > : : : > = . = > A あ A > B ひ B > a A a > b a b > á á あ > あ B ひ > ひ b á What all the above reveals is one order, the order that sort follows. But you are still failing to get it: That is entirelly diferent than what glic follows. Try: $ LC_ALL=C sed 's/[A-B]/x/g' out-C >out-C-sed $ LC_ALL=en_CA.utf8 sed 's/[A-B]/x/g' out-CA >out-CA-sed $ LC_ALL=en_JP.utf8 sed 's/[A-B]/x/g' out-JA >out-JA-sed $ paste out-C-sed out-CA-sed out-JA-sed , = , - - - . , . : : : = . = あ ひ a a b a b á á あ あ ひ ひ b á The `a` and the `á` were sorted between `A` and `B`in the en_CA.utf8 locale. But sed did NOT match them. Yes, just one particular example in en_CA.utf8 locale. [...] >>> As such, I'm marking this as "not a bug" and closing the ticket, >>> but discussion can continue by replying to this thread. >> >> I still remain in doubt, at the very minimum. > > I hope this helps clears things out, but I'm happy to continue > this discussion if there are other questions. I am clear now that this is unsupported by sed, thanks.
[Message part 2 (text/html, inline)]
Debbugs Internal Request <help-debbugs <at> gnu.org>
to internal_control <at> debbugs.gnu.org
.
(Fri, 22 Jun 2018 11:24:03 GMT) Full text and rfc822 format available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.