From unknown Sat Sep 20 10:49:24 2025
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.509 (Entity 5.509)
Content-Type: text/plain; charset=utf-8
From: bug#31526 <31526@debbugs.gnu.org>
To: bug#31526 <31526@debbugs.gnu.org>
Subject: Status: Range [a-z] does not follow collate order from locale.
Reply-To: bug#31526 <31526@debbugs.gnu.org>
Date: Sat, 20 Sep 2025 17:49:24 +0000

retitle 31526 Range [a-z] does not follow collate order from locale.
reassign 31526 sed
submitter 31526 Bize Ma <binaryzebra@gmail.com>
severity 31526 important
tag 31526 notabug

thanks


From debbugs-submit-bounces@debbugs.gnu.org Sat May 19 03:38:40 2018
Received: (at submit) by debbugs.gnu.org; 19 May 2018 07:38:40 +0000
Received: from localhost ([127.0.0.1]:40426 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fJwRr-0003uL-MW
	for submit@debbugs.gnu.org; Sat, 19 May 2018 03:38:39 -0400
Received: from eggs.gnu.org ([208.118.235.92]:36213)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <binaryzebra@gmail.com>) id 1fJnOB-0007hf-Rw
 for submit@debbugs.gnu.org; Fri, 18 May 2018 17:58:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <binaryzebra@gmail.com>) id 1fJnO5-0003dY-MO
 for submit@debbugs.gnu.org; Fri, 18 May 2018 17:58:10 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 HTML_MESSAGE,HTML_NONELEMENT_30_40,T_DKIM_INVALID autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:33633)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <binaryzebra@gmail.com>)
 id 1fJnO5-0003dP-J6
 for submit@debbugs.gnu.org; Fri, 18 May 2018 17:58:09 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55408)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <binaryzebra@gmail.com>) id 1fJnO4-0000cI-B5
 for bug-sed@gnu.org; Fri, 18 May 2018 17:58:09 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <binaryzebra@gmail.com>) id 1fJnO3-0003c2-J5
 for bug-sed@gnu.org; Fri, 18 May 2018 17:58:08 -0400
Received: from mail-ot0-x22d.google.com ([2607:f8b0:4003:c0f::22d]:46742)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <binaryzebra@gmail.com>)
 id 1fJnO3-0003bw-Bv
 for bug-sed@gnu.org; Fri, 18 May 2018 17:58:07 -0400
Received: by mail-ot0-x22d.google.com with SMTP id t1-v6so10765681ott.13
 for <bug-sed@gnu.org>; Fri, 18 May 2018 14:58:06 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:from:date:message-id:subject:to;
 bh=CL3TPITxOpqQ1YZXnCoD3S6Um4kVRW45Z14ENAWFSEk=;
 b=nhh664CUISHxllaxKGfdo8znLlj+oHXKjKbSF8xijHmP7mLtFww67yw+6+opvt7UkC
 q+RodwKIfXmw4oy2xotTiFOIDnHO4pNPFWeyjh/K9/XcNGonLnocpr1FcdITQNObnRp7
 C87RV6x7SS8yX6ztveuNBh8li0Vv5BpKBMfKu9a86HvC3xT5iWIVqtCITRTfuikuUd8q
 zZglrVvJrFrz5QzQ17eE2KWBr1TEnJm3Kz2NfCnygpPseqGsuSNyS7QsZTSD8FXmyZub
 L6KOM1hKFUCEHUNV4Ti6gw1R8x3QztVn7hvzlDx94D+Xa/FeLsUth98I/27lRXEUIGOA
 SQbw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=CL3TPITxOpqQ1YZXnCoD3S6Um4kVRW45Z14ENAWFSEk=;
 b=E89eD79JUmHvBrDnuo/b7wtgswysFAwJSctPrOl/RSTnJDKGSP8jaWGmee9JqO7cNZ
 TifQ5z84+f2d3mV9UZ75b2jhMyBixZZlr1nnu3Fz64NpGOaf6pg2yjIrPecqqsnfdp3P
 Af8paMMGQbYWRM9iDK3axbHB/YKrFwkll3h749mQG5KGnF2OcnSJlu5BF3tzWNC0jSTH
 9FLwu/HrOPIiElkTz3dLmiU9DDWg9wVNFdEbSoF3kxQvLnvl87Z80T3vzz1Ci9a7uK3f
 ojyb5+W1eEWYPu2XO/3xD9gfuDX+Z8/Wd8kokerhlu2qg/YKyDSmT4j4HIvzh3FOp3+s
 ziMA==
X-Gm-Message-State: ALKqPwdvDMvEkXQRFtrIEij55+vWGJ3GYPVKPikM/OjVECDLh054KeV2
 VpB3vrigVUHsCtVRM2z60GWAn3ns1JHHNQlh6oicj/zT
X-Google-Smtp-Source: AB8JxZrSosCk9SMt90ayfE1YB5eVhWep7jqj7U0EngxfCRf5TFsU3RYpfTQCKnWkBDYlxA81mW0bhmNQxD/wVHOAE1g=
X-Received: by 2002:a9d:3c3b:: with SMTP id
 q56-v6mr8004477otc.226.1526680686142; 
 Fri, 18 May 2018 14:58:06 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a9d:2f04:0:0:0:0:0 with HTTP; Fri, 18 May 2018 14:58:05
 -0700 (PDT)
From: Bize Ma <binaryzebra@gmail.com>
Date: Fri, 18 May 2018 17:58:05 -0400
Message-ID: <CAFra36hoTNH0s6oOHtNQfk6M_67Zk24tVZ5QTN67NfVwhpdD6A@mail.gmail.com>
Subject: Range [a-z] does not follow collate order from locale.
To: bug-sed@gnu.org
Content-Type: multipart/alternative; boundary="000000000000fba835056c820b20"
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Sat, 19 May 2018 03:38:37 -0400
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

--000000000000fba835056c820b20
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Package: sed
Version: 4.4-2
Severity: important

Dear Maintainer,

With a locale set to en_US.utf8 it is expected that the collating order is
this:

    $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n'
    `^~<=3D>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ
kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

It is expected that a range [a-z] will match 'aAbBcCdD=E2=80=A6', all lower=
 and
upper letters.
But it isn't:

    $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-z]//g'
    abcdefghijklmnopqrstuvwxyz

However, the range [a-Z] does match all letters, lower or upper:

    $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

If this is the correct way in which sed should work, then, if you please:

    - What is the rationale leading to such decision?.
    - Where is it documented?.
    - Where is it implemented in the code?.
    - Why does the manual document otherwise?.

--000000000000fba835056c820b20
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div style=3D"color:rgb(34,34,34);font-family:arial,sans-s=
erif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-=
variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;=
text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;bac=
kground-color:rgb(255,255,255);text-decoration-style:initial;text-decoratio=
n-color:initial">Package: sed</div><div style=3D"color:rgb(34,34,34);font-f=
amily:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-liga=
tures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal=
;text-align:start;text-indent:0px;text-transform:none;white-space:normal;wo=
rd-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:init=
ial;text-decoration-color:initial">Version: 4.4-2</div><div style=3D"color:=
rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:norm=
al;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;l=
etter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;w=
hite-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-d=
ecoration-style:initial;text-decoration-color:initial">Severity: important<=
/div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-si=
ze:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps=
:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:=
0px;text-transform:none;white-space:normal;word-spacing:0px;background-colo=
r:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:init=
ial"><br></div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-ser=
if;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-va=
riant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;te=
xt-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;backg=
round-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-=
color:initial">Dear Maintainer,</div><div style=3D"color:rgb(34,34,34);font=
-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-li=
gatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:norm=
al;text-align:start;text-indent:0px;text-transform:none;white-space:normal;=
word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:in=
itial;text-decoration-color:initial"><br></div><div style=3D"color:rgb(34,3=
4,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-=
variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-sp=
acing:normal;text-align:start;text-indent:0px;text-transform:none;white-spa=
ce:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoratio=
n-style:initial;text-decoration-color:initial">With a locale set to en_US.u=
tf8 it is expected that the collating order is this:</div><div style=3D"col=
or:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:n=
ormal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:40=
0;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:non=
e;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);tex=
t-decoration-style:initial;text-decoration-color:initial"><br></div><div st=
yle=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;fo=
nt-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font=
-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-tra=
nsform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,25=
5,255);text-decoration-style:initial;text-decoration-color:initial">=C2=A0 =
=C2=A0 $ printf &#39;%b&#39; $(printf &#39;\\U%x\\n&#39; {32..127}) | sort =
| tr -d &#39;\n&#39;</div><div style=3D"color:rgb(34,34,34);font-family:ari=
al,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:nor=
mal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-ali=
gn:start;text-indent:0px;text-transform:none;white-space:normal;word-spacin=
g:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-=
decoration-color:initial">=C2=A0 =C2=A0 `^~&lt;=3D&gt;| _-,;:!?/.&#39;&quot=
;()[]{}@$*\&amp;#%+<wbr>0123456789aAbBcCdDeEfFgGhHiIjJ<wbr>kKlLmMnNoOpPqQrR=
sStTuUvVwWxXyY<wbr>zZ</div><div style=3D"color:rgb(34,34,34);font-family:ar=
ial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:no=
rmal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-al=
ign:start;text-indent:0px;text-transform:none;white-space:normal;word-spaci=
ng:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text=
-decoration-color:initial"><br></div><div style=3D"color:rgb(34,34,34);font=
-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-li=
gatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:norm=
al;text-align:start;text-indent:0px;text-transform:none;white-space:normal;=
word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:in=
itial;text-decoration-color:initial">It is expected that a range [a-z] will=
 match &#39;aAbBcCdD=E2=80=A6&#39;, all lower and upper letters.</div><div =
style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;=
font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;fo=
nt-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-t=
ransform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,=
255,255);text-decoration-style:initial;text-decoration-color:initial">But i=
t isn&#39;t:</div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-=
serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font=
-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start=
;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;ba=
ckground-color:rgb(255,255,255);text-decoration-style:initial;text-decorati=
on-color:initial"><br></div><div style=3D"color:rgb(34,34,34);font-family:a=
rial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:n=
ormal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-a=
lign:start;text-indent:0px;text-transform:none;white-space:normal;word-spac=
ing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;tex=
t-decoration-color:initial">=C2=A0 =C2=A0 $ printf &#39;%b&#39; $(printf &#=
39;\\U%x&#39; {32..127}) | sed &#39;s/[^a-z]//g&#39;</div><div style=3D"col=
or:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:n=
ormal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:40=
0;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:non=
e;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);tex=
t-decoration-style:initial;text-decoration-color:initial">=C2=A0 =C2=A0 abc=
defghijklmnopqrstuvwxyz</div><div style=3D"color:rgb(34,34,34);font-family:=
arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:=
normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-=
align:start;text-indent:0px;text-transform:none;white-space:normal;word-spa=
cing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;te=
xt-decoration-color:initial"><br></div><div style=3D"color:rgb(34,34,34);fo=
nt-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-=
ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:no=
rmal;text-align:start;text-indent:0px;text-transform:none;white-space:norma=
l;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:=
initial;text-decoration-color:initial">However, the range [a-Z] does match =
all letters, lower or upper:</div><div style=3D"color:rgb(34,34,34);font-fa=
mily:arial,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligat=
ures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;=
text-align:start;text-indent:0px;text-transform:none;white-space:normal;wor=
d-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initi=
al;text-decoration-color:initial"><br></div><div style=3D"color:rgb(34,34,3=
4);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;font-var=
iant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spaci=
ng:normal;text-align:start;text-indent:0px;text-transform:none;white-space:=
normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-s=
tyle:initial;text-decoration-color:initial">=C2=A0 =C2=A0 $ printf &#39;%b&=
#39; $(printf &#39;\\U%x&#39; {32..127}) | sed &#39;s/[^a-Z]//g&#39;</div><=
div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.=
8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:norma=
l;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;te=
xt-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(=
255,255,255);text-decoration-style:initial;text-decoration-color:initial">=
=C2=A0 =C2=A0 ABCDEFGHIJKLMNOPQRSTUVWXYZabcd<wbr>efghijklmnopqrstuvwxyz</di=
v><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:=
12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:no=
rmal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px=
;text-transform:none;white-space:normal;word-spacing:0px;background-color:r=
gb(255,255,255);text-decoration-style:initial;text-decoration-color:initial=
"><br></div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;=
font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-varia=
nt-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-=
indent:0px;text-transform:none;white-space:normal;word-spacing:0px;backgrou=
nd-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-col=
or:initial">If this is the correct way in which sed should work, then, if y=
ou please:</div><div style=3D"color:rgb(34,34,34);font-family:arial,sans-se=
rif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-v=
ariant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;t=
ext-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;back=
ground-color:rgb(255,255,255);text-decoration-style:initial;text-decoration=
-color:initial"><br></div><div style=3D"color:rgb(34,34,34);font-family:ari=
al,sans-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:nor=
mal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-ali=
gn:start;text-indent:0px;text-transform:none;white-space:normal;word-spacin=
g:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-=
decoration-color:initial">=C2=A0 =C2=A0 - What is the rationale leading to =
such decision?.</div><div style=3D"color:rgb(34,34,34);font-family:arial,sa=
ns-serif;font-size:12.8px;font-style:normal;font-variant-ligatures:normal;f=
ont-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:st=
art;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px=
;background-color:rgb(255,255,255);text-decoration-style:initial;text-decor=
ation-color:initial">=C2=A0 =C2=A0 - Where is it documented?.</div><div sty=
le=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font-size:12.8px;fon=
t-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-=
weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-tran=
sform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255=
,255);text-decoration-style:initial;text-decoration-color:initial">=C2=A0 =
=C2=A0 - Where is it implemented in the code?.</div><div style=3D"color:rgb=
(34,34,34);font-family:arial,sans-serif;font-size:12.8px;font-style:normal;=
font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;lett=
er-spacing:normal;text-align:start;text-indent:0px;text-transform:none;whit=
e-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-deco=
ration-style:initial;text-decoration-color:initial">=C2=A0 =C2=A0 - Why doe=
s the manual document otherwise?.</div><br class=3D"gmail-Apple-interchange=
-newline"><br></div>

--000000000000fba835056c820b20--


From debbugs-submit-bounces@debbugs.gnu.org Sat May 19 22:13:12 2018
Received: (at control) by debbugs.gnu.org; 20 May 2018 02:13:12 +0000
Received: from localhost ([127.0.0.1]:41511 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fKDqS-0008Lz-4s
	for submit@debbugs.gnu.org; Sat, 19 May 2018 22:13:12 -0400
Received: from mail-pg0-f41.google.com ([74.125.83.41]:43100)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@gmail.com>)
 id 1fKDqQ-0008Lh-2H; Sat, 19 May 2018 22:13:10 -0400
Received: by mail-pg0-f41.google.com with SMTP id p8-v6so4862926pgq.10;
 Sat, 19 May 2018 19:13:10 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=QK1RZZBNuZ6FBQPlyilH1exbeZd1HXJ22kpZrk0biuE=;
 b=c7sbOgA8lPqr4M/+cQySIlIxWW5DvLAjMt8bmD9rOlYHrwGEJFltucByl1KWpyhEB1
 nIpzjNpUi9XeKgoeUACBS/DFzbRSzmvmIjr+Prf1ZCLdY6fN9v5jhO1PHzG6lib6S7Bu
 V4V9bxzkYXQeSKsowK6JH0YEKdGMPZ+bJBdE7sKuPu/CG226RSlDLqklprnszv6AwXk1
 iXiskKsBYojhpV1FY2jvfFnP1GPVmwaIW/P1pi5R06KIv9+cJfZz3zI8DkZB5n/oIvHi
 vWdLKW8kyC48qGvee9isY7SXRAkPLlARUco4opxugH6lW0ErhLa8mINV46QBeJjEW0Su
 i9yA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=QK1RZZBNuZ6FBQPlyilH1exbeZd1HXJ22kpZrk0biuE=;
 b=O7a2tfKR9UR/mYcuAoY2l95X3dn9tCBI7X3IAvhjPsIqc+F3WFgWUoKdGHbV2AN7am
 ry0s+X1vdq646tfp4dO9BtE/uuVIpCM582/sM/U9z9/EUlsNIuJF3gmMweICKMNjaTgi
 8sxCuBk+fLW0uRAZ9iGdxP5vYvdkRUH2O9XfYQjTIosVSwUtnnMEDi47nWcwMARXskHc
 DamJh+uWgnUVZGngNqwhuIfXt9eNavV5ARtIQr590yWdOlevcOZzG+BpUFb4DIbY59DP
 Zv8lqK9qNgDELnER54XOpSNobQNIu+LpVHYS10tlbJ7+oiUc9A0qGdb8GLgOvZGUA+Yw
 /YBw==
X-Gm-Message-State: ALKqPweOft2hMIWZNjMHeN3WabdHTMsHnPlb69hJkLNJJXP5yjHgYaQh
 WbgJscy1Y9MWgIfeti+/flY=
X-Google-Smtp-Source: AB8JxZoTJYMm/trzdydjUYCYzKxtPD4mZ1Y2kOKe+sB2kjXq9gpKEkSGS+PzqXsFPH9QS8CyLVTgwg==
X-Received: by 2002:a63:7314:: with SMTP id
 o20-v6mr7056047pgc.156.1526782384254; 
 Sat, 19 May 2018 19:13:04 -0700 (PDT)
Received: from tomato (moose.housegordon.com. [184.68.105.38])
 by smtp.gmail.com with ESMTPSA id g64-v6sm20448830pfd.50.2018.05.19.19.13.02
 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256);
 Sat, 19 May 2018 19:13:03 -0700 (PDT)
Date: Sat, 19 May 2018 20:13:00 -0600
From: Assaf Gordon <assafgordon@gmail.com>
To: Bize Ma <binaryzebra@gmail.com>
Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale.
Message-ID: <20180520021300.myh2m4njtaq4nz3u@tomato>
References: <CAFra36hoTNH0s6oOHtNQfk6M_67Zk24tVZ5QTN67NfVwhpdD6A@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CAFra36hoTNH0s6oOHtNQfk6M_67Zk24tVZ5QTN67NfVwhpdD6A@mail.gmail.com>
User-Agent: NeoMutt/20170113 (1.7.2)
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: control
Cc: 31526-done@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

tag 31526 notabug
close 31526
thanks

Hello,

On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote:
> With a locale set to en_US.utf8 it is expected that the collating order is
> this:
> 
>     $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n'
>     `^~<=>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ
> kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

While in practice this is correct on all GNU/linux systems which
use glibc, there is no officially documented collation order for
punctuation marks - it might differ on other systems. Please see here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=23677#14

> It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and
> upper letters.
> But it isn't:

It should not be "expected". I don't think it is documented to be
so anywhere in GNU programs. Both sed's and grep's manuals contain
the following text:

    In other locales, the sorting sequence is not specified, and ‘[a-d]’
    might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to
    match any character, or the set of characters that it matches might
    even be erratic.

https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes
https://www.gnu.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressions.html

Furthermore, in POSIX 2008 standard range expressions are
underfined for locales other than "C/POSIX", see this comment by Eric Blake
(also the entire bug report might be of interest to this topic):
https://bugzilla.redhat.com/show_bug.cgi?id=583011#c24

> However, the range [a-Z] does match all letters, lower or upper:
> 
>     $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
>     ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

I would recommend avoiding mixing upper-lower case in regex
ranges, as the result might be unexpected. Compare the following:

  $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[a-Z]/p'
  [[ no output, no failure ]]

  $ echo '[' | LC_ALL=C sed -n '/[a-Z]/p'
  sed: -e expression #1, char 7: Invalid range end

  $ echo '[' | LC_ALL=en_CA.utf8 sed -n '/[A-z]/p'
  sed: -e expression #1, char 7: Invalid range end

  $ echo '[' | LC_ALL=C sed -n '/[A-z]/p'
  [


> If this is the correct way in which sed should work, then, if you please:

Yes, it is.

>     - What is the rationale leading to such decision?.

The bug reports linked above contain long discussions about it.

Please also see the following thread, which promoted the restriction
of "sane regex ranges" - meaning ASCII order alone (and applies to gawk,
grep, sed and other programs using gnulib's regex engine):

https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html

>     - Where is it documented?.

The links above to the sed and grep manuals.

>     - Where is it implemented in the code?.

I think a good place to start is gnulib's DFA regex engine,
here:
https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c
or here:
http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c

Search for the comment 'build range characters' for a starting point.

Both gnu grep and sed use this code.

>     - Why does the manual document otherwise?.

Errors in the manual are always a possibility.
If you spot such an error, or an example showing incorrect
usage/output - please let us know where it is (e.g. a link
to a manual page  / section).

As such, I'm marking this as "not a bug" and closing the ticket,
but discussion can continue by replying to this thread.

regards,
 - assaf


From debbugs-submit-bounces@debbugs.gnu.org Wed May 23 04:49:54 2018
Received: (at 31526) by debbugs.gnu.org; 23 May 2018 08:49:54 +0000
Received: from localhost ([127.0.0.1]:44779 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fLPSz-0006JE-Sb
	for submit@debbugs.gnu.org; Wed, 23 May 2018 04:49:54 -0400
Received: from mail-pl0-f52.google.com ([209.85.160.52]:43954)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@gmail.com>) id 1fLPSy-0006Ix-Gj
 for 31526@debbugs.gnu.org; Wed, 23 May 2018 04:49:53 -0400
Received: by mail-pl0-f52.google.com with SMTP id c41-v6so12598273plj.10
 for <31526@debbugs.gnu.org>; Wed, 23 May 2018 01:49:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=subject:to:references:cc:from:message-id:date:user-agent
 :mime-version:in-reply-to:content-language:content-transfer-encoding;
 bh=SLAFkE9tQg7hB8ev5ywFqisO1oDQP2IAEfZqjEstXHY=;
 b=HL33A+2V+dY3MepKuPszDkrncnCSvTtAd0uHoG3M4P/G+2XrdovpoLAa/sJFnTFSZG
 n+JFTrIvSvKY6RRlcU8tPIiD7f9IP1TloL7Kb+11H0pDgmefO4+ij4is1nxFcGmR00UC
 tbOd82C/NP+lg4IU5ueRUNieg8lQjKBL+JneIAzXiYKwEqYSbVQYcp8OvLgW69p8SquK
 TOQl8qCu+28xfUU7sdEbY/vWTdqpF7wGeXuudlKHUX5WhZ/BgR7d3epyBIjH3EzD5HtJ
 DOCj0Sf/eNrnx8YWtSnmHO/91zSPXjacvHWgv3fN3x2yMWX/WLx83crdcqL/a8VAiJTN
 Enxw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:subject:to:references:cc:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-language
 :content-transfer-encoding;
 bh=SLAFkE9tQg7hB8ev5ywFqisO1oDQP2IAEfZqjEstXHY=;
 b=VDgVFcTMY3FI+qhmnZ8VdAOGgE1+rXxTBd96RtLhPCh3TYlI6olN8Thsh8IYMXRNm0
 4aadL13O8vFssA2OTqEOwxFsJs8RbBg2KKIukZT/oGxk2RgHmJ3it751Wui/uvk1wpn3
 A9TRHyekOSnx8P0cUXuodIOy2h7inO2SFfE99p9Roepr1j7mXP5kLmSKx7tlAl8aVe4w
 jKPatVZJ/Zo8B4PJEfHts1REDtnm3VwxYeHH1oTe8UH05auzN+vMnfvIwYAEEguj4wY1
 XFtE5495WQvRwQGAlwC/T680CY+atTHomZD///lMpyTCDBgYsBcV11iKI0so7pcN11n/
 82pA==
X-Gm-Message-State: ALKqPwf5tqy6QgStgKInZ5Xung1btdL6buQxkbdez4onrqi1nphlJFGS
 sj2u06YgEyV8fFLYL6GeQm1S2w0R
X-Google-Smtp-Source: AB8JxZr8f7hxrW6E8FRpE9WGNN6+NL4ozcDIunL4DzB+EkNL+emKGBthuj/a6hKBaI0aCgcDbQg4sg==
X-Received: by 2002:a17:902:6686:: with SMTP id
 e6-v6mr2052132plk.35.1527065385722; 
 Wed, 23 May 2018 01:49:45 -0700 (PDT)
Received: from [192.168.88.239] (moose.housegordon.com. [184.68.105.38])
 by smtp.googlemail.com with ESMTPSA id
 a11-v6sm26035321pgn.64.2018.05.23.01.49.44
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 23 May 2018 01:49:44 -0700 (PDT)
Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale.
To: Bize Ma <binaryzebra@gmail.com>
References: <CAFra36hoTNH0s6oOHtNQfk6M_67Zk24tVZ5QTN67NfVwhpdD6A@mail.gmail.com>
 <20180520021300.myh2m4njtaq4nz3u@tomato>
 <CAFra36hQ_aj518zxwXgJQ4b9FDGiFWobS4ZoteLn6FZW0KfeTg@mail.gmail.com>
From: Assaf Gordon <assafgordon@gmail.com>
Message-ID: <fc1793e7-0344-8c6a-3690-80398b8744d7@gmail.com>
Date: Wed, 23 May 2018 02:49:43 -0600
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.7.0
MIME-Version: 1.0
In-Reply-To: <CAFra36hQ_aj518zxwXgJQ4b9FDGiFWobS4ZoteLn6FZW0KfeTg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 31526
Cc: 31526@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

(adding debbugs mailing list, please use "reply all" to
ensure the thread is public and archived).


Hello,

On 22/05/18 07:48 PM, Bize Ma wrote:
>  > 2018-05-19 22:13 GMT-04:00 Assaf Gordon <assafgordon@gmail.com 
> 
> Hi!, thanks for your answer, time and detailed references.
> 
> In range definitions I believe that there are two goals in conflict:
> 
>      - An stable, simple, range description for programmers.
>      - A clear descrition (even if long) for multilanguage users.

Why are they in conflict? users of sed (programmers or not, using
multibyte locale or not) should understand that regex ranges are tricky 
in multibyte locales.

> For a programmer:
>      The old wisdom is that [a-d] should match only `abcd` (in C locale).
>      The usual recommendation is: "do not use other locales".
>      That is making the use of any other locale almost invalid.
>      However, [a-z] may also match many accented (Latin) characters.
> 
> For a multi language user:
>      But if other locales are used, as is a must to allow for most 
> languages used
>      on this world, the range has never been clearly defined, much less 
> the order
>      in which a range will match. There are some clues about "collation 
> order" in
>      GNU sed, but it remains unclear as which collation sort order apply 
> to that.
[...]
> Then, the real question is: What order does sed follow?

Exactly because regex ranges in multibyte locales are not well-defined,
the recommendation is not to use them in portable sed scripts.


> **********************************************************************
> 1.- About ASCII character numeric ranges:
> 
> Yes, I agree that it may be conceptually unnecessary to give a collation
> order to "punctuation marks".
> However, that it may be "conceptually unnecessary" does not mean that
> such order is "invalid". A practical inplementation may define some
> such order.
> Please understand that the goal of the code above is to show the practical
> result of using some (locale defined) collation order equivalent to what
> is given by the c function strcoll().

exactly - and strcoll() is implemented in glibc (with possible
replacement in gnulib). It is outside the scope of 'sed' to define the
collation order. And the order could change from one operating system
to the other.

> **********************************************************************
> 2.- About using collating order.
> 
>  > > It is expected that a range [a-z] will match 'aAbBcCdD…', all lower and
>  > > upper letters.
>  > > But it isn't:
>  >
>  > It should not be "expected". I don't think it is documented to be
>  > so anywhere in GNU programs.
> 
> Well, yes, 'info sed', in section `5 Regular Expressions: selecting text`
> sub-section `5.5 Character Classes and Bracket Expressions` include:
> 
>      Within a bracket expression, a "range expression" consists of two
>      characters separated by a hyphen.  It matches any single character
>      that sorts between the two characters, inclusive.  In the default
>      C locale, the sorting sequence is the native character order; for
>      example, '[a-d]' is equivalent to '[abcd]'.
> 
>  From 'info sed' (not man sed) sub-section `5.9 Locale Considerations`:
> 
>      In other locales, the sorting sequence is not specified, and '[a-d]'
>      might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail
>      to match any character, or the set of characters that it matches
>      might even be erratic.
> 
> So, the `[a-d]` expression match characters that sort between `a` and `d`.
> That is defined above for the C locale. In other locales the sorting is
> "undefined".
> 
> 
>  > … Both sed's and grep's manuals contain
>  > the following text:
>  >
>  >     In other locales, the sorting sequence is not specified, and ‘[a-d]’
>  >     might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might 
> fail to
>  >     match any character, or the set of characters that it matches might
>  >     even be erratic.
> 
> Yes, It is the exact same text that I also quoted above. But all it
> clearly defines is that the order is based on the definition of each
> locale "in some unspecified way". When the locale change, the order
> may also change.
> 
>  > 
> https://www.gnu.org/software/sed/manual/sed.html#Multibyte-regexp-character-classes


I'm not sure I understand if are you agreeing with me or not?
It seems (to me) that the text is clear:
In "C/POSIX" locale, regex range [a-d] matches a,b,c,d.
In other locales, it is not well defined (and can match many variations,
depending on your operating system/libc).


> Yes, At the same page, but at Reporting-Bugs, under the heading
>       [a-z] is case insensitive
> 
> https://www.gnu.org/software/sed/manual/sed.html#Reporting-Bugs
> 
> We can read:
> 
>      [a-z] is case insensitive
>      You are encountering problems with locales. POSIX mandates that [a-z]
>      uses the current locale’s collation order – in C parlance, that means
>      using strcoll(3) instead of strcmp(3). Some locales have a case-
>      insensitive collation order, others don’t.
> 
> It seems to say: "current locale's collation order" !!

Yes, there is a locale collation order.

It is defined in libc (e.g. glibc, but there are other libc's out 
there), not in sed, and it is not well documented.

It can also change from one locale to the next (see example below).
GNU sed has no way to change/determine it, or document what it is.

>  > Furthermore, in POSIX 2008 standard range expressions are
>  > undefined for locales other than "C/POSIX"
> 
> Yes, however: Does undefined also mean invalid, forbidden, banned or 
> illegal?

I should have used a more accurate term: "Unspecified" instead of 
"undefined" (and thank you for quoting Eric Blake's message about it).

Both terms are explained here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap01.html#tag_01_05_07


In this context, saying "unspecified" means that the results are not
specified by the standard. It could work reliably, it could work and
return unexpected results, it might not work.
It does not mean it is forbidden, but it does mean some implementation
can choose to reject such ranges completely and it would not be
considered a violation of POSIX standard.

> At the moment, it is not illegal to use a bracket range in some other 
> locale.
> Such use does not raise any error (or even warning). As it is not 
> illegal, the
> only aspect that remains to be clearly defined is what is the range 
> order that
> we should expect in every other locale than C.

This is exactly the point of saying "unspecified" - there is (currently)
no definition which GNU sed developers can guarantee will always work in
the specified manner.

> Also, We rely everyday on "not specified" behavior (for some spec):
> 
> The -E option is not (yet) defined in current POSIX (The Open Group
> Base Specifications Issue 7, 2018 edition) for sed.
> Yes, It is believed that it will be accepted for the next POSIX version.

Technically speaking, the "-E" option is not "unspecified".
It is an extension beyond the current POSIX standard, and GNU programs
have many such extensions.

But there are two strong cases for "-E":
First, there is an extremely high likelihood it will be accepted to the 
next version of the standard.
Second, several other sed implementations (non-gnu) support "-E" with
the same semantics.


> Some elements are undefined in POSIX just to allow implementations to be 
> diverse:
> 
> http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html
> 
>      The results of giving <tilde> with an unknown login name are undefined
>      because the KornShell "˜+" and "˜-" constructs make use of this 
> condition …
> 
> Read carefully: undefined because it is used !.
> That is, it is undefined in the spec to allow implementations to resolve in
> practical ways that might be diferent than the specification (or other
> implementations).

While this does not relate directly to sed,
"undefined" here means that according to the POSIX standard,
the described input is *invalid*, and implementations can decide
how they want to handle it.

You are correct in saying that often POSIX says something is
"unspecified" or "undefined" because existing systems have had
their own behavior long before POSIX even existed, and POSIX does
not want to contradict or forbid existing behavior.

> In the same "comment by Eric Blake" we can read this:
> 
>      The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_ 
> "undefined".

What "unspecified" means is: POSIX standard deems the input *valid*,
but does not force implementations to return specific results.
(had the input been *invalid*, it would be "undefined" instead of 
"unspecified").

[BTW, I welcome corrections and clarifications if the above is
inaccurate].


> Exactly the same I was meaning:  "unspecified", but _not_ "invalid".
> 
> And, exactly, what I am asking for: "glibc should document and define 
> this behavior"

I fully support this: it would be beneficial of GLIBC developers to
documented exactly how collation order works in various multibyte
locales.

However, GNU Sed developers have no way to do so.

This issue should be sent to GLIBC developers
(on their mailing list or bug-tracker website).

>  >
>  > > However, the range [a-Z] does match all letters, lower or upper:
>  > >
>  > >     $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
>  > >     ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
>  >
>  > I would recommend avoiding mixing upper-lower case in regex
>  > ranges, as the result might be unexpected. Compare the following:
> 
> In the "comment by Eric Blake" we can also read:
> 
>      That is, [A-z] is well-defined in the POSIX locale, and in all other
>      locales where A collates before z (which includes en_US.UTF-8)
> 
> Again: "[A-z] is well-defined … "

Yes, in "C" locale, the range "[A-z]" means ASCII value 65 ("A")
to ASCII value 122 ("z"). It means the range also includes
backslash (ASCII 92) and underscore (ASCII 95).

But how do you treat range "[a-Z]" ?
This is range ASCII 97 to ASCII 90 ... is an implementation expected
to swap the min/max values, and treat it as ASCII range 90-97 ?
or somehow understand these are letters, and change it to ASCII 65 to 122 ?


Here's a simpler and more obvious case: The range [3-8] is intuitively
clear, but the reverse is not valid:

   $ echo 7 | LC_ALL=C grep '[3-8]'
   7

   $ echo 7 | LC_ALL=C grep '[8-3]'
   grep: Invalid range end


> Frankly, if I were to follow both main recommendations:
> 
>      - Any other locale than C is unspecified: do not use them.
>      - Any range that does not match the previously known ranges:
>        "recommend avoiding mixing upper-lower case in regex ranges"
> 
> The usefulness of a bracket range is reduced to almost nothing.
> Only C and only either [a-z] or [A-Z].

"Almost nothing" is a strong statement... I would say the following:

1. In "C" locale, where each character is a single byte (and assuming
an ASCII environment) - ranges are very well defined and easy to use,
not just [a-z] [A-Z], but any ASCII value (including octal values, etc.).

2. In multibyte locales, ranges of specific letters (e.g. "[A-D]")
are not well specified and should be avoided in portable scripts.
However, the character classes are very usable in multibyte locale,
and can be used to match all letters or all digits, etc.
Example:

   $ echo "Γειά σου 123" | LC_ALL=en_CA.UTF-8 sed 's/[[:alpha:]]/*/g'
   **** *** 123

3. If you always use the same environment (e.g. always GLIBC, always GNU 
SED, always the same locale) - then it is very likely (but still not 
guaranteed) that the collation order you observe in regex ranges will
remain the same in the future.


> Is it not possible to declare and document what the collation
> order is/should be for other locales?

Again, this is a glibc issue (or any other library that implements
collation order) - outside the scope of SED.


> **********************************************************************
> 3.- Corect exactly how.
> 
>  > > If this is the correct way in which sed should work, then, if you 
> please:
>  >
>  > Yes, it is.
> 
> Thanks, but: What does it mean exactly?   My opinion in the right.
> 
>    - That [a-z] will always mean 'abcdefghijklmnopqrstuvwxyz' in the C 
> locale?. (Yes)

Correct.

>    - That the order in C locale follows the ASCII numeric order?.        
>         (Yes)

Correct.

>    - That no other locale should be used?                                
>         (No?)

Non-C locales can be used if one understands the limitations
as shown above.
Specifically, portable SED scripts should not use regex
ranges in non-C locale.
If you are absolutely certain you will always run your SED scripts
under GLIBC, it is very very likely the collation order you observe
now will remain for a long time.

>    - That the order in any other locale is secret?                      
>          (Yes)

Not "secret" as in someone actively trying to hide it,
but unknown/undocumented because the developers of GLIBC have not
documented it.

>    - That ranges like [A-z] (valid in C) can not be used in other 
> locales?      (No?)

Should not be used in portable SED scripts.

>    - That other ranges like [*-d] (valid in C) are a crazy idea?        
>          (No?)

Instead of "crazy" let's call it "unspecified" - meaning that each 
program can return different results, and there is no single "correct"
result according to the POSIX standard.

In practice, if you always use GLIBC systems, you will very very likely
see the same results every time.

>    - References to collation order in the manuals must be stricken out?  
>         (No?)

I'm not sure I understand this...


> And we have not even started with more characters as they are possible 
> in UNICODE.
[...]
> Yes, there are discussions about what was relevant at the time.
> But none explain in clear simple words what order the characters
> in a bracket range will follow in a locale that is NOT C. (see
> some simple examples above).

Correct - that is not documented anywhere at the moment.

>  > >     - Why does the manual document otherwise?.
>  >
>  > Errors in the manual are always a possibility.
>  > If you spot such an error, or an example showing incorrect
>  > usage/output - please let us know where it is (e.g. a link
>  > to a manual page  / section).
> 
> I have provided a couple of points where "collating order" is used.
> But I suspect that those are not mistakes from your point of view and
> that what is missing is a more detailed description of which collating
> order is being used.
That is a good way to describe the issue.

The term "collation order" is defined in POSIX, e.g. here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag_07_03_02

But the actual order (which character comes before/after another) is
left to implementations to decide.

GLIBC is one such implementations, and GLIBC developers have decided
on such order. Sadly they have not documented it well.

Here's an example of glibc's strange behavior (or at least
strange to me, as I found no explanation for it):

In most multibyte UTF-8 locales the punctuation order
differs from ASCII order, but is consistently the same
(e.g. en_CA.UTF-8 and fr_FR.UTF-8).
For some reason, ja_JP.UTF-8 order is more like ASCII.

Compare the following:

   $ printf "%s\n" a A b B "á" "あ" "ひ" . , : - = > in
   $ LC_ALL=C           sort in > out-C
   $ LC_ALL=en_CA.UTF-8 sort in > out-CA
   $ LC_ALL=ja_JP.UTF-8 sort in > out-JA
   $ paste out-C out-CA out-JA
   ,	=	,
   -	-	-
   .	,	.
   :	:	:
   =	.	=
   A	あ	A
   B	ひ	B
   a	A	a
   b	a	b
   á	á	あ
   あ	B	ひ
   ひ	b	á


And that is an example of why we simply can not tell you
what is the "correct" order that you'll get, even if it
seems that in all of your testing you see the same order.

Another example:

   $ echo "あáb" | LC_ALL=ja_JP.utf8 sed 's/[a-z]/*/g'
   あá*
   $ echo "あáb" | LC_ALL=en_CA.utf8 sed 's/[a-z]/*/g'
   あ**

(This is at least the case with GLIBC 2.24-11+deb9u3 on Debian 9).

>  > As such, I'm marking this as "not a bug" and closing the ticket,
>  > but discussion can continue by replying to this thread.
> 
> I still remain in doubt, at the very minimum.

I hope this helps clears things out, but I'm happy to continue
this discussion if there are other questions.

regards,
  - assaf


From debbugs-submit-bounces@debbugs.gnu.org Wed May 23 19:14:05 2018
Received: (at 31526-done) by debbugs.gnu.org; 23 May 2018 23:14:05 +0000
Received: from localhost ([127.0.0.1]:45764 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fLcxI-0007RW-DA
	for submit@debbugs.gnu.org; Wed, 23 May 2018 19:14:05 -0400
Received: from mail-ot0-f171.google.com ([74.125.82.171]:41214)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <binaryzebra@gmail.com>) id 1fLcxF-0007Qv-MJ
 for 31526-done@debbugs.gnu.org; Wed, 23 May 2018 19:14:03 -0400
Received: by mail-ot0-f171.google.com with SMTP id t1-v6so27223580oth.8
 for <31526-done@debbugs.gnu.org>; Wed, 23 May 2018 16:14:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to;
 bh=sU0Cd4VoimyjSjrpI533Q5F44nC/cGWaRTwk0L/CZ0Y=;
 b=AKVNULS3lUlXF+kVxmZW6FyLxMkPqQIakCq9ehF1PmmzBi3Aj397N8cJq6dF2MIljL
 LaWx0pJ9W/R61hvnTlXIAOv8jSSCaePSucN1tW7BdWvaH9ImB1ii8kfOW5O2R7gGrZXq
 hnJcULI94U9xNC7W/saTbNMx9zEr15+vFJcGNusX3Mr1SOYm/TUaT4KeamAAiLNJB4/N
 80kBcZ1UHaTIlIWULVVb33R1KrOoXTHw1SwUgJ8sIFC+NXjunowjxoqgm7nuurH++fAN
 r2bCo/mf5oqRcgkguOYRZC5ZvpC6jgaSJsetKQPJob2MrlgpnYQ7yIh8SDtlySSjQ5m1
 HfqA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to;
 bh=sU0Cd4VoimyjSjrpI533Q5F44nC/cGWaRTwk0L/CZ0Y=;
 b=LcKV0zxamjT22bHmQsxVPggQ2QIolmYhW5GnmxlI/Ag6XC28q1rwUX57ldccM0JvFh
 F3Dsv1MTmp7oxiza3wzTPkSrqbvFawmFxXC8tlOMO90GpW4AuephtaGtDquI0FHDwR+I
 mKDGUnf9/q1zjGLJLTyM030mOs/mJtSh7tZX5tdwqcGCVPnh4aYp1+FHRBx5K1Dw3RM1
 NYXbNG+yHIu8myJlgItvhpkVzid/+2WHoF8YFYZuK92o4cR4uPpg1+E7qPXmUkAaYwQl
 YgPKihCTjJ6cDTFRX01fRHJZzpOHPoYl0NQ8RCyYPDJhdYuTXiKRYaIAz/4/4T/a+cuv
 gUuQ==
X-Gm-Message-State: ALKqPwdemCOVgY6lwi4t0n7fPtPF7+NOyoLigpIpVQbKdP/w6sZSAInC
 wlUlOsesqr3+eHHcDiLXDA80H+IwVP6+kscjV/HBOPOt
X-Google-Smtp-Source: AB8JxZob9AWssUGrXFJRjEa6afhM1YIx6SmBU7n9u304H3BrQmqQJPwM9JMqC4rsEBm+/j7IBstmxdmL5GVUdHi0mfo=
X-Received: by 2002:a9d:4a9c:: with SMTP id
 i28-v6mr2903571otf.19.1527117235592; 
 Wed, 23 May 2018 16:13:55 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a9d:2f04:0:0:0:0:0 with HTTP; Wed, 23 May 2018 16:13:55
 -0700 (PDT)
In-Reply-To: <CAFra36hQ_aj518zxwXgJQ4b9FDGiFWobS4ZoteLn6FZW0KfeTg@mail.gmail.com>
References: <CAFra36hoTNH0s6oOHtNQfk6M_67Zk24tVZ5QTN67NfVwhpdD6A@mail.gmail.com>
 <20180520021300.myh2m4njtaq4nz3u@tomato>
 <CAFra36hQ_aj518zxwXgJQ4b9FDGiFWobS4ZoteLn6FZW0KfeTg@mail.gmail.com>
From: Bize Ma <binaryzebra@gmail.com>
Date: Wed, 23 May 2018 19:13:55 -0400
Message-ID: <CAFra36gTC23D5PPzQ-_eFYxRt_emRUVOg5gJ9c6uRfh1hsfFBw@mail.gmail.com>
Subject: Fwd: bug#31526: Range [a-z] does not follow collate order from locale.
To: 31526-done@debbugs.gnu.org
Content-Type: multipart/alternative; boundary="0000000000005ba3e4056ce7b0a7"
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 31526-done
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--0000000000005ba3e4056ce7b0a7
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Following your request:

>  From: Assaf Gordon
*> *(adding debbugs mailing list, please use "reply all" to
>  ensure the thread is public and archived).

 I am sending the message to which you just have answered
to the debbugs mailing list, Sorry for my mistake.


---------- Forwarded message ----------
From: Bize Ma <binaryzebra@gmail.com>
Date: 2018-05-22 21:48 GMT-04:00
Subject: Re: bug#31526: Range [a-z] does not follow collate order from
locale.
To: Assaf Gordon <assafgordon@gmail.com>


> 2018-05-19 22:13 GMT-04:00 Assaf Gordon <assafgordon@gmail.com>:
> Hello,

Hi!, thanks for your answer, time and detailed references.

In range definitions I believe that there are two goals in conflict:

    - An stable, simple, range description for programmers.
    - A clear descrition (even if long) for multilanguage users.

For a programmer:
    The old wisdom is that [a-d] should match only `abcd` (in C locale).
    The usual recommendation is: "do not use other locales".
    That is making the use of any other locale almost invalid.
    However, [a-z] may also match many accented (Latin) characters.

For a multi language user:
    But if other locales are used, as is a must to allow for most languages
used
    on this world, the range has never been clearly defined, much less the
order
    in which a range will match. There are some clues about "collation
order" in
    GNU sed, but it remains unclear as which collation sort order apply to
that.

    Using a range in other locale does not follow ASCII numeric order:

        printf '%b' "$(printf '\\U%x\\n' {32..255})" |
            LC_ALL=3DC sort |
        tr -d '\n' |
            sed 's/[^a-=C3=A4]//g'; echo

        abcd=C2=AA=C3=A0=C3=A1=C3=A2=C3=A3=C3=A4=C3=A5=C3=A6=C3=A7

    The result above should have ended in a `d`, but `d` falls in the
middle.
    Nor it follows the locale collate order in effect (it should end in =C3=
=A4):

        printf '%b' "$(printf '\\U%x\\n' {32..255})" |
            LC_ALL=3Den_CA.utf8 sort |
        tr -d '\n' |
                    sed 's/[^a-=C3=A4]//g'; echo

        a=C3=A1=C3=A0=C3=A2=C3=A4=C3=A3=C2=AA

Then, the real question is: What order does sed follow?


> On Fri, May 18, 2018 at 05:58:05PM -0400, Bize Ma wrote:
> >
> >     $ printf '%b' $(printf '\\U%x\\n' {32..127}) | sort | tr -d '\n'
> >     `^~<=3D>| _-,;:!?/.'"()[]{}@$*\&#%+0123456789aAbBcCdDeEfFgGhHiIjJ
> > kKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ

> While in practice this is correct on all GNU/linux systems which
> use glibc, there is no officially documented collation order for
> punctuation marks - it might differ on other systems. Please see here:
> https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D23677#14

**********************************************************************
1.- About ASCII character numeric ranges:

Yes, I agree that it may be conceptually unnecessary to give a collation
order to "punctuation marks".
However, that it may be "conceptually unnecessary" does not mean that
such order is "invalid". A practical inplementation may define some
such order.
Please understand that the goal of the code above is to show the practical
result of using some (locale defined) collation order equivalent to what
is given by the c function strcoll().

The range may be more limited to only letters and numbers:
{48..57} {65..90} {97..122} (in hex: 0x30-0x39 0x41-0x5a 0x61-0x7a).

Let us define and use a function that should work on bash 4.2+:

collorder(){
    a=3D$1; shift 1;
    until (($#<2)); do
        printf '%b' $(printf '\\U%x\\n' $(seq "$1" "$2"))
shift 2
    done | sort | tr -d '\n' | sed 's/'"$a"'//g'
    echo
    }

That function will allow us to do:

    $ LC_ALL=3Den_CA.utf8   collorder   ' '    48 57   65 90   97 122
    0123456789AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz

And (In C locale the sort is identical to ASCII numeric sort):

    $ LC_ALL=3DC            collorder   ' '    48 57   65 90   97 122
    0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

And filtering by a bracket range:

    $ LC_ALL=3DC           collorder  '[^a-z]' 48 57   65 90   97 122
    abcdefghijklmnopqrstuvwxyz

But those ranges avoid the character that you use latter (`[`).
Including the characters between Upper-Case and lowercase ASCII:

    $ LC_ALL=3DC   collorder   '[^Y-d]'  48 57   65 122
    YZ[\]^_`abcd

That was the reason to include all 95 (126-32+1) ASCII that are not control=
.
One simple range. Including such characters allow (perfectly valid) mixed
bracket ranges:

    $ LC_ALL=3DC   collorder   '[^+-d]'  32 126
    +,-./0123456789:;<=3D>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcd

Not because I was interested to deviate the discusion to "punctuation
marks". Just because it was one simple character numeric range.
That is all, the bash function defined here: collorder, is a tool to reveal
the (practical) collation order valid for the applied locale.


**********************************************************************
2.- About using collating order.

> > It is expected that a range [a-z] will match 'aAbBcCdD=E2=80=A6', all l=
ower and
> > upper letters.
> > But it isn't:
>
> It should not be "expected". I don't think it is documented to be
> so anywhere in GNU programs.

Well, yes, 'info sed', in section `5 Regular Expressions: selecting text`
sub-section `5.5 Character Classes and Bracket Expressions` include:

    Within a bracket expression, a "range expression" consists of two
    characters separated by a hyphen.  It matches any single character
    that sorts between the two characters, inclusive.  In the default
    C locale, the sorting sequence is the native character order; for
    example, '[a-d]' is equivalent to '[abcd]'.

>From 'info sed' (not man sed) sub-section `5.9 Locale Considerations`:

    In other locales, the sorting sequence is not specified, and '[a-d]'
    might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail
    to match any character, or the set of characters that it matches
    might even be erratic.

So, the `[a-d]` expression match characters that sort between `a` and `d`.
That is defined above for the C locale. In other locales the sorting is
"undefined".


> =E2=80=A6 Both sed's and grep's manuals contain
> the following text:
>
>     In other locales, the sorting sequence is not specified, and =E2=80=
=98[a-d]=E2=80=99
>     might be equivalent to =E2=80=98[abcd]=E2=80=99 or to =E2=80=98[aBbCc=
Dd]=E2=80=99, or it might fail to
>     match any character, or the set of characters that it matches might
>     even be erratic.

Yes, It is the exact same text that I also quoted above. But all it
clearly defines is that the order is based on the definition of each
locale "in some unspecified way". When the locale change, the order
may also change.

> https://www.gnu.org/software/sed/manual/sed.html#Multibyte-
regexp-character-classes

Yes, At the same page, but at Reporting-Bugs, under the heading
     [a-z] is case insensitive

  https://www.gnu.org/software/sed/manual/sed.html#Reporting-Bugs

We can read:

    [a-z] is case insensitive
    You are encountering problems with locales. POSIX mandates that [a-z]
    uses the current locale=E2=80=99s collation order =E2=80=93 in C parlan=
ce, that means
    using strcoll(3) instead of strcmp(3). Some locales have a case-
    insensitive collation order, others don=E2=80=99t.

It seems to say: "current locale's collation order" !!


> https://www.gnu.org/software/grep/manual/html_node/
Character-Classes-and-Bracket-Expressions.html
>
> Furthermore, in POSIX 2008 standard range expressions are
> undefined for locales other than "C/POSIX", see this comment by Eric Blak=
e
> (also the entire bug report might be of interest to this topic):
> https://bugzilla.redhat.com/show_bug.cgi?id=3D583011#c24

Yes, however: Does undefined also mean invalid, forbidden, banned or
illegal?

At the moment, it is not illegal to use a bracket range in some other
locale.
Such use does not raise any error (or even warning). As it is not illegal,
the
only aspect that remains to be clearly defined is what is the range order
that
we should expect in every other locale than C.

Also, We rely everyday on "not specified" behavior (for some spec):

The -E option is not (yet) defined in current POSIX (The Open Group
Base Specifications Issue 7, 2018 edition) for sed.
Yes, It is believed that it will be accepted for the next POSIX version.

    http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html

But it is defined (and used) in GNU sed.

Some elements are undefined in POSIX just to allow implementations to be
diverse:

    http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_chap02.html

    The results of giving <tilde> with an unknown login name are undefined
    because the KornShell "=CB=9C+" and "=CB=9C-" constructs make use of th=
is
condition =E2=80=A6

Read carefully: undefined because it is used !.
That is, it is undefined in the spec to allow implementations to resolve in
practical ways that might be diferent than the specification (or other
implementations).


In the same "comment by Eric Blake" we can read this:

    The behavior of [A-z] in en_US.UTF-8 is "unspecified", but _not_
"undefined".
    A compliant app cannot guarantee what the behavior will be, but the
behavior
    should at least be explainable, and as a QoI point, glibc should
document
    and define this behavior as an extension to POSIX, so that apps relying
on
    glibc can take advantage of this extension for known behavior.

Exactly the same I was meaning:  "unspecified", but _not_ "invalid".

And, exactly, what I am asking for: "glibc should document and define this
behavior"

>
> > However, the range [a-Z] does match all letters, lower or upper:
> >
> >     $ printf '%b' $(printf '\\U%x' {32..127}) | sed 's/[^a-Z]//g'
> >     ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
>
> I would recommend avoiding mixing upper-lower case in regex
> ranges, as the result might be unexpected. Compare the following:

In the "comment by Eric Blake" we can also read:

    That is, [A-z] is well-defined in the POSIX locale, and in all other
    locales where A collates before z (which includes en_US.UTF-8)

Again: "[A-z] is well-defined =E2=80=A6 "

Frankly, if I were to follow both main recommendations:

    - Any other locale than C is unspecified: do not use them.
    - Any range that does not match the previously known ranges:
      "recommend avoiding mixing upper-lower case in regex ranges"

The usefulness of a bracket range is reduced to almost nothing.
Only C and only either [a-z] or [A-Z].

Is it not possible to declare and document what the collation
order is/should be for other locales?

**********************************************************************
3.- Corect exactly how.

> > If this is the correct way in which sed should work, then, if you
please:
>
> Yes, it is.

Thanks, but: What does it mean exactly?   My opinion in the right.

  - That [a-z] will always mean 'abcdefghijklmnopqrstuvwxyz' in the C
locale?. (Yes)
  - That the order in C locale follows the ASCII numeric order?.
   (Yes)
  - That no other locale should be used?
   (No?)
  - That the order in any other locale is secret?
    (Yes)
  - That ranges like [A-z] (valid in C) can not be used in other locales?
    (No?)
  - That other ranges like [*-d] (valid in C) are a crazy idea?
    (No?)
  - References to collation order in the manuals must be stricken out?
   (No?)

And we have not even started with more characters as they are possible in
UNICODE.

   - Is this valid:
   $ LC_ALL=3Den_CA.utf8 ./collorder '[^a-z]' 32 255
   abcdefghijklmnopqrstuvwxyz=C2=AA=C2=BA=C3=9F=C3=A0=C3=A1=C3=A2=C3=A3=C3=
=A4=C3=A5=C3=A6=C3=A7=C3=A8=C3=A9=C3=AA=C3=AB=C3=AC=C3=AD=C3=AE=C3=AF=C3=B0=
=C3=B1=C3=B2=C3=B3=C3=B4=C3=B5=C3=B6=C3=B8=C3=B9=C3=BA=C3=BB=C3=BC=C3=BD=C3=
=BF

   Does it mean that [a-z] is closer to [[:lower:]] than ASCII a-z?

   - Is this expected? (phonetic symbols)
   $ LC_ALL=3Den_CA.utf8 ./collorder '[^a-z]' 0x250 0x2af
   =C9=93=C9=94=C9=96=C9=97=C9=99=C9=9B=C9=A0=C9=B5

   - Should this work? In what order? (phonetic symbols)
   $ LC_ALL=3Den_CA.utf8 ./collorder '[^=C9=96-=C9=9B]' 0x250 0x2af
   =C9=96=C9=97=C9=99=C9=9B

   - Why all Latin characters are being included? (Latin extended)
   $ LC_ALL=3Den_CA.utf8 ./collorder '[^a-z]' 0x1e00 0x1fff
   =E1=B8=81=E1=B8=83=E1=B8=85=E1=B8=87=E1=B8=89=E1=B8=8B=E1=B8=8D=E1=B8=8F=
=E1=B8=91=E1=B8=93=E1=B8=95=E1=B8=97=E1=B8=99=E1=B8=9B=E1=B8=9D=E1=B8=9F=E1=
=B8=A1=E1=B8=A3=E1=B8=A5=E1=B8=A7=E1=B8=A9=E1=B8=AB=E1=B8=AD=E1=B8=AF=E1=B8=
=B1=E1=B8=B3=E1=B8=B5=E1=B8=B7=E1=B8=B9=E1=B8=BB=E1=B8=BD=E1=B8=BF=E1=B9=81=
=E1=B9=83=E1=B9=85=E1=B9=87=E1=B9=89=E1=B9=8B=E1=B9=8D=E1=B9=8F=E1=B9=91=E1=
=B9=93=E1=B9=95=E1=B9=97=E1=B9=99=E1=B9=9B=E1=B9=9D=E1=B9=9F=E1=B9=A1=E1=B9=
=A3=E1=B9=A5=E1=B9=A7=E1=B9=A9=E1=B9=AB=E1=B9=AD=E1=B9=AF=E1=B9=B1=E1=B9=B3=
=E1=B9=B5=E1=B9=B7=E1=B9=B9=E1=B9=BB=E1=B9=BD=E1=B9=BF=E1=BA=81=E1=BA=83=E1=
=BA=85=E1=BA=87=E1=BA=89
   =E1=BA=8B=E1=BA=8D=E1=BA=8F=E1=BA=96=E1=BA=97=E1=BA=98=E1=BA=99=E1=BA=9A=
=E1=BA=9B=E1=BA=A1=E1=BA=A3=E1=BA=A5=E1=BA=A7=E1=BA=A9=E1=BA=AB=E1=BA=AD=E1=
=BA=AF=E1=BA=B1=E1=BA=B3=E1=BA=B5=E1=BA=B7=E1=BA=B9=E1=BA=BB=E1=BA=BD=E1=BA=
=BF=E1=BB=81=E1=BB=83=E1=BB=85=E1=BB=87=E1=BB=89=E1=BB=8B=E1=BB=8D=E1=BB=8F=
=E1=BB=91=E1=BB=93=E1=BB=95=E1=BB=97=E1=BB=99=E1=BB=9B=E1=BB=9D=E1=BB=9F=E1=
=BB=A1=E1=BB=A3=E1=BB=A5=E1=BB=A7=E1=BB=A9=E1=BB=AB=E1=BB=AD=E1=BB=AF=E1=BB=
=B1=E1=BB=B3=E1=BB=B5=E1=BB=B7=E1=BB=B9

> >     - What is the rationale leading to such decision?.
>
> The bug reports linked above contain long discussions about it.

Yes, there are discussions about what was relevant at the time.
But none explain in clear simple words what order the characters
in a bracket range will follow in a locale that is NOT C. (see
some simple examples above).

> Please also see the following thread, which promoted the restriction
> of "sane regex ranges" - meaning ASCII order alone (and applies to gawk,
> grep, sed and other programs using gnulib's regex engine):
>
> https://lists.gnu.org/archive/html/bug-gnulib/2011-06/msg00200.html

ASCII order alone? Only for characters in numeric range 0x00-0x7f ????

    - How comes that an =C3=A1 gets included in the very limited [a-b]?
    $ LC_ALL=3Den_CA.utf8 ./collorder '[^a-b]' 0x00 0xff
    ab=C2=AA=C3=A0=C3=A1=C3=A2=C3=A3=C3=A4=C3=A5=C3=A6

> >     - Where is it documented?.
>
> The links above to the sed and grep manuals.

None of the linked documents explain the above result for [^a-b].

> >     - Where is it implemented in the code?.
>
> I think a good place to start is gnulib's DFA regex engine,
> here:
> https://opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c
> or here:
> http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.c

I have to recognize that I am unable to understand any of those
4000 lines of code without some detailed help of how it works.
I am really sorry.

> Search for the comment 'build range characters' for a starting point.
>
> Both gnu grep and sed use this code.
>
> >     - Why does the manual document otherwise?.
>
> Errors in the manual are always a possibility.
> If you spot such an error, or an example showing incorrect
> usage/output - please let us know where it is (e.g. a link
> to a manual page  / section).

I have provided a couple of points where "collating order" is used.
But I suspect that those are not mistakes from your point of view and
that what is missing is a more detailed description of which collating
order is being used. I may be perfectly wrong, of course.

> As such, I'm marking this as "not a bug" and closing the ticket,
> but discussion can continue by replying to this thread.

I still remain in doubt, at the very minimum.

> regards,
>  - assaf

Many thanks and regards
- Bize

--0000000000005ba3e4056ce7b0a7
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Following your request:<div><br></div><div>&gt;=C2=A0 From=
:=C2=A0<span style=3D"color:rgb(34,34,34);font-family:arial,sans-serif;font=
-size:12.8px;font-style:normal;font-variant-ligatures:normal;font-variant-c=
aps:normal;font-weight:700;letter-spacing:normal;text-align:left;text-inden=
t:0px;text-transform:none;white-space:nowrap;word-spacing:0px;background-co=
lor:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:in=
itial;float:none;display:inline">Assaf Gordon</span></div><div><span style=
=3D"font-size:12.8px"><span style=3D"white-space:nowrap"><b>&gt;=C2=A0</b><=
/span>(adding debbugs mailing list, please use &quot;reply all&quot; to</sp=
an></div><div><div class=3D"gmail-" style=3D"color:rgb(34,34,34);font-famil=
y:arial,sans-serif;font-size:medium;font-style:normal;font-variant-ligature=
s:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;tex=
t-align:start;text-indent:0px;text-transform:none;white-space:normal;word-s=
pacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;=
text-decoration-color:initial"><div id=3D"gmail-:88" class=3D"gmail-ii gmai=
l-gt" style=3D"font-size:12.8px;direction:ltr;margin:5px 15px 0px 0px;paddi=
ng-bottom:5px"><div id=3D"gmail-:87" class=3D"gmail-a3s gmail-aXjCH gmail-m=
1638c309c6c14be5" style=3D"overflow:hidden">&gt;=C2=A0 ensure the thread is=
 public and archived).<br></div></div></div></div><div><br></div><div>=C2=
=A0I am sending the message to which you just have answered<div>to the debb=
ugs mailing list, Sorry for my mistake.</div><div><br></div><div><br></div>=
<div><br><div class=3D"gmail_quote">---------- Forwarded message ----------=
<br>From: <b class=3D"gmail_sendername">Bize Ma</b> <span dir=3D"ltr">&lt;<=
a href=3D"mailto:binaryzebra@gmail.com">binaryzebra@gmail.com</a>&gt;</span=
><br>Date: 2018-05-22 21:48 GMT-04:00<br>Subject: Re: bug#31526: Range [a-z=
] does not follow collate order from locale.<br>To: Assaf Gordon &lt;<a hre=
f=3D"mailto:assafgordon@gmail.com">assafgordon@gmail.com</a>&gt;<br><br><br=
><div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">&gt=
; 2018-05-19 22:13 GMT-04:00 Assaf Gordon <span dir=3D"ltr">&lt;<a href=3D"=
mailto:assafgordon@gmail.com" target=3D"_blank">assafgordon@gmail.com</a>&g=
t;</span>:</div><div class=3D"gmail_quote"><div class=3D"gmail_quote">&gt; =
Hello,</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote"=
>Hi!, thanks for your answer, time and detailed references.</div><div class=
=3D"gmail_quote"><br></div><div class=3D"gmail_quote">In range definitions =
I believe that there are two goals in conflict:</div><div class=3D"gmail_qu=
ote"><br></div><div class=3D"gmail_quote">=C2=A0 =C2=A0 - An stable, simple=
, range description for programmers.</div><div class=3D"gmail_quote">=C2=A0=
 =C2=A0 - A clear descrition (even if long) for multilanguage users.</div><=
div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">For a progra=
mmer:</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 The old wisdom is that =
[a-d] should match only `abcd` (in C locale).</div><div class=3D"gmail_quot=
e">=C2=A0 =C2=A0 The usual recommendation is: &quot;do not use other locale=
s&quot;.</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 That is making the u=
se of any other locale almost invalid.</div><div class=3D"gmail_quote">=C2=
=A0 =C2=A0 However, [a-z] may also match many accented (Latin) characters.<=
/div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">For a =
multi language user:</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 But if o=
ther locales are used, as is a must to allow for most languages used</div><=
div class=3D"gmail_quote">=C2=A0 =C2=A0 on this world, the range has never =
been clearly defined, much less the order</div><div class=3D"gmail_quote">=
=C2=A0 =C2=A0 in which a range will match. There are some clues about &quot=
;collation order&quot; in</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 GNU=
 sed, but it remains unclear as which collation sort order apply to that.</=
div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=A0 =
=C2=A0 Using a range in other locale does not follow ASCII numeric order:</=
div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 printf &#39;%b&#39; &quot;$(printf &#39;\\U%x\\n&#39; =
{32..255})&quot; |</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 LC_ALL=3DC sort |</div><div class=3D"gmail_quote"><span s=
tyle=3D"white-space:pre-wrap">	</span>=C2=A0 =C2=A0 =C2=A0 =C2=A0 tr -d =
9;\n&#39; |</div><div class=3D"gmail_quote"><span style=3D"white-space:pre-=
wrap">	</span>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 sed &#39;s/[^a-=C3=
=A4]//g&#39;; echo</div><div class=3D"gmail_quote"><br></div><div class=3D"=
gmail_quote">=C2=A0 =C2=A0 =C2=A0 =C2=A0 abcd=C2=AA=C3=A0=C3=A1=C3=A2=C3=A3=
=C3=A4=C3=A5=C3=A6=C3=A7</div><div class=3D"gmail_quote">=C2=A0 =C2=A0=C2=
=A0</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 The result above should h=
ave ended in a `d`, but `d` falls in the middle.</div><div class=3D"gmail_q=
uote">=C2=A0 =C2=A0 Nor it follows the locale collate order in effect (it s=
hould end in =C3=A4):</div><div class=3D"gmail_quote"><br></div><div class=
=3D"gmail_quote">=C2=A0 =C2=A0 =C2=A0 =C2=A0 printf &#39;%b&#39; &quot;$(pr=
intf &#39;\\U%x\\n&#39; {32..255})&quot; |</div><div class=3D"gmail_quote">=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 LC_ALL=3Den_CA.utf8 sort |</div><=
div class=3D"gmail_quote"><span style=3D"white-space:pre-wrap">	</span>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 tr -d &#39;\n&#39; |</div><div class=3D"gmail_quot=
e">=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 se=
d &#39;s/[^a-=C3=A4]//g&#39;; echo</div><div class=3D"gmail_quote">=C2=A0 =
=C2=A0=C2=A0</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 =C2=A0 =C2=A0 a=
=C3=A1=C3=A0=C3=A2=C3=A4=C3=A3=C2=AA</div><div class=3D"gmail_quote"><br></=
div><div class=3D"gmail_quote">Then, the real question is: What order does =
sed follow?</div><span class=3D""><div class=3D"gmail_quote"><br></div><div=
 class=3D"gmail_quote"><br></div><div class=3D"gmail_quote"><span style=3D"=
white-space:pre-wrap">	</span></div><div class=3D"gmail_quote">=C2=A0 =C2=
=A0=C2=A0</div><div class=3D"gmail_quote">&gt; On Fri, May 18, 2018 at 05:5=
8:05PM -0400, Bize Ma wrote:</div><div class=3D"gmail_quote">&gt; &gt;=C2=
=A0</div></span><span class=3D""><div class=3D"gmail_quote">&gt; &gt;=C2=A0=
 =C2=A0 =C2=A0$ printf &#39;%b&#39; $(printf &#39;\\U%x\\n&#39; {32..127}) =
| sort | tr -d &#39;\n&#39;</div><div class=3D"gmail_quote">&gt; &gt;=C2=A0=
 =C2=A0 =C2=A0`^~&lt;=3D&gt;| _-,;:!?/.&#39;&quot;()[]{}@$*\&amp;#%+<wbr>01=
23456789aAbBcCdDeEfFgGhHiIjJ</div><div class=3D"gmail_quote">&gt; &gt; kKlL=
mMnNoOpPqQrRsStTuUvVwWxXyY<wbr>zZ</div><div class=3D"gmail_quote">=C2=A0</d=
iv><div class=3D"gmail_quote">&gt; While in practice this is correct on all=
 GNU/linux systems which</div><div class=3D"gmail_quote">&gt; use glibc, th=
ere is no officially documented collation order for</div><div class=3D"gmai=
l_quote">&gt; punctuation marks - it might differ on other systems. Please =
see here:</div><div class=3D"gmail_quote">&gt; <a href=3D"https://debbugs.g=
nu.org/cgi/bugreport.cgi?bug=3D23677#14" target=3D"_blank">https://debbugs.=
gnu.org/cgi/<wbr>bugreport.cgi?bug=3D23677#14</a></div><div class=3D"gmail_=
quote"><br></div></span><div class=3D"gmail_quote">************************=
******<wbr>******************************<wbr>**********</div><div class=3D=
"gmail_quote">1.- About ASCII character numeric ranges:</div><div class=3D"=
gmail_quote"><br></div><div class=3D"gmail_quote">Yes, I agree that it may =
be conceptually unnecessary to give a collation</div><div class=3D"gmail_qu=
ote">order to &quot;punctuation marks&quot;.</div><div class=3D"gmail_quote=
">However, that it may be &quot;conceptually unnecessary&quot; does not mea=
n that</div><div class=3D"gmail_quote">such order is &quot;invalid&quot;. A=
 practical inplementation may define some</div><div class=3D"gmail_quote">s=
uch order.</div><div class=3D"gmail_quote">Please understand that the goal =
of the code above is to show the practical</div><div class=3D"gmail_quote">=
result of using some (locale defined) collation order equivalent to what</d=
iv><div class=3D"gmail_quote">is given by the c function strcoll().</div><d=
iv class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">The range may=
 be more limited to only letters and numbers:</div><div class=3D"gmail_quot=
e">{48..57} {65..90} {97..122} (in hex: 0x30-0x39 0x41-0x5a 0x61-0x7a).</di=
v><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Let us de=
fine and use a function that should work on bash 4.2+:</div><div class=3D"g=
mail_quote"><br></div><div class=3D"gmail_quote">collorder(){</div><div cla=
ss=3D"gmail_quote">=C2=A0 =C2=A0 a=3D$1; shift 1;</div><div class=3D"gmail_=
quote">=C2=A0 =C2=A0 until (($#&lt;2)); do</div><div class=3D"gmail_quote">=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 printf &#39;%b&#39; $(printf &#39;\\U%x\\n&#39;=
 $(seq &quot;$1&quot; &quot;$2&quot;))=C2=A0</div><div class=3D"gmail_quote=
"><span style=3D"white-space:pre-wrap">	</span>shift 2</div><div class=3D"g=
mail_quote">=C2=A0 =C2=A0 done | sort | tr -d &#39;\n&#39; | sed &#39;s/=
9;&quot;$a&quot;&#39;//g&#39;</div><div class=3D"gmail_quote">=C2=A0 =C2=A0=
 echo</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 }</div><div class=3D"gm=
ail_quote"><br></div><div class=3D"gmail_quote">That function will allow us=
 to do:</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote=
">=C2=A0 =C2=A0 $ LC_ALL=3Den_CA.utf8=C2=A0 =C2=A0collorder=C2=A0 =C2=A0=
9; &#39;=C2=A0 =C2=A0 48 57=C2=A0 =C2=A065 90=C2=A0 =C2=A097 122</div><div =
class=3D"gmail_quote">=C2=A0 =C2=A0 0123456789AaBbCcDdEeFfGgHhIiJj<wbr>KkLl=
MmNnOoPpQqRrSsTtUuVvWwXxYy<wbr>Zz</div><div class=3D"gmail_quote"><br></div=
><div class=3D"gmail_quote">And (In C locale the sort is identical to ASCII=
 numeric sort):</div><div class=3D"gmail_quote"><br></div><div class=3D"gma=
il_quote">=C2=A0 =C2=A0 $ LC_ALL=3DC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 collorder=C2=A0 =C2=A0&#39; &#39;=C2=A0 =C2=A0 48 57=C2=A0 =C2=A065 90=
=C2=A0 =C2=A097 122</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 012345678=
9ABCDEFGHIJKLMNOPQRST<wbr>UVWXYZabcdefghijklmnopqrstuvwx<wbr>yz</div><div c=
lass=3D"gmail_quote"><br></div><div class=3D"gmail_quote">And filtering by =
a bracket range:</div><div class=3D"gmail_quote"><br></div><div class=3D"gm=
ail_quote">=C2=A0 =C2=A0 $ LC_ALL=3DC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0collorder=C2=A0 &#39;[^a-z]&#39; 48 57=C2=A0 =C2=A065 90=C2=A0 =C2=A097 =
122</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 abcdefghijklmnopqrstuvwxy=
z</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">But =
those ranges avoid the character that you use latter (`[`).</div><div class=
=3D"gmail_quote">Including the characters between Upper-Case and lowercase =
ASCII:</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote"=
>=C2=A0 =C2=A0 $ LC_ALL=3DC=C2=A0 =C2=A0collorder=C2=A0 =C2=A0&#39;[^Y-d]&#=
39;=C2=A0 48 57=C2=A0 =C2=A065 122</div><div class=3D"gmail_quote">=C2=A0 =
=C2=A0 YZ[\]^_`abcd</div><div class=3D"gmail_quote"><br></div><div class=3D=
"gmail_quote">That was the reason to include all 95 (126-32+1) ASCII that a=
re not control.</div><div class=3D"gmail_quote">One simple range. Including=
 such characters allow (perfectly valid) mixed</div><div class=3D"gmail_quo=
te">bracket ranges:</div><div class=3D"gmail_quote"><br></div><div class=3D=
"gmail_quote">=C2=A0 =C2=A0 $ LC_ALL=3DC=C2=A0 =C2=A0collorder=C2=A0 =C2=A0=
&#39;[^+-d]&#39;=C2=A0 32 126</div><div class=3D"gmail_quote">=C2=A0 =C2=A0=
 +,-./0123456789:;&lt;=3D&gt;?@<wbr>ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^<wbr>_`ab=
cd</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Not=
 because I was interested to deviate the discusion to &quot;punctuation</di=
v><div class=3D"gmail_quote">marks&quot;. Just because it was one simple ch=
aracter numeric range.</div><div class=3D"gmail_quote">That is all, the bas=
h function defined here: collorder, is a tool to reveal</div><div class=3D"=
gmail_quote">the (practical) collation order valid for the applied locale.<=
/div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote"><br></=
div><div class=3D"gmail_quote">******************************<wbr>*********=
*********************<wbr>**********</div><div class=3D"gmail_quote">2.- Ab=
out using collating order.</div><span class=3D""><div class=3D"gmail_quote"=
><br></div><div class=3D"gmail_quote">&gt; &gt; It is expected that a range=
 [a-z] will match &#39;aAbBcCdD=E2=80=A6&#39;, all lower and</div><div clas=
s=3D"gmail_quote">&gt; &gt; upper letters.</div><div class=3D"gmail_quote">=
&gt; &gt; But it isn&#39;t:</div><div class=3D"gmail_quote">&gt;=C2=A0</div=
><div class=3D"gmail_quote">&gt; It should not be &quot;expected&quot;. I d=
on&#39;t think it is documented to be</div><div class=3D"gmail_quote">&gt; =
so anywhere in GNU programs.</div><div class=3D"gmail_quote"><br></div></sp=
an><div class=3D"gmail_quote">Well, yes, &#39;info sed&#39;, in section `5 =
Regular Expressions: selecting text`</div><div class=3D"gmail_quote">sub-se=
ction `5.5 Character Classes and Bracket Expressions` include:</div><div cl=
ass=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=A0 =C2=A0 With=
in a bracket expression, a &quot;range expression&quot; consists of two</di=
v><div class=3D"gmail_quote">=C2=A0 =C2=A0 characters separated by a hyphen=
.=C2=A0 It matches any single character</div><div class=3D"gmail_quote">=C2=
=A0 =C2=A0 that sorts between the two characters, inclusive.=C2=A0 In the d=
efault</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 C locale, the sorting =
sequence is the native character order; for</div><div class=3D"gmail_quote"=
>=C2=A0 =C2=A0 example, &#39;[a-d]&#39; is equivalent to &#39;[abcd]&#39;.<=
/div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">From &=
#39;info sed&#39; (not man sed) sub-section `5.9 Locale Considerations`:</d=
iv><span class=3D""><div class=3D"gmail_quote"><br></div><div class=3D"gmai=
l_quote">=C2=A0 =C2=A0 In other locales, the sorting sequence is not specif=
ied, and &#39;[a-d]&#39;</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 migh=
t be equivalent to &#39;[abcd]&#39; or to &#39;[aBbCcDd]&#39;, or it might =
fail</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 to match any character, =
or the set of characters that it matches</div><div class=3D"gmail_quote">=
=C2=A0 =C2=A0 might even be erratic.</div><div class=3D"gmail_quote"><br></=
div></span><div class=3D"gmail_quote">So, the `[a-d]` expression match char=
acters that sort between `a` and `d`.</div><div class=3D"gmail_quote">That =
is defined above for the C locale. In other locales the sorting is</div><di=
v class=3D"gmail_quote">&quot;undefined&quot;.</div><div class=3D"gmail_quo=
te"><br></div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quot=
e">&gt; =E2=80=A6 Both sed&#39;s and grep&#39;s manuals contain</div><span =
class=3D""><div class=3D"gmail_quote">&gt; the following text:</div><div cl=
ass=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt;=C2=A0 =
=C2=A0 =C2=A0In other locales, the sorting sequence is not specified, and =
=E2=80=98[a-d]=E2=80=99</div><div class=3D"gmail_quote">&gt;=C2=A0 =C2=A0 =
=C2=A0might be equivalent to =E2=80=98[abcd]=E2=80=99 or to =E2=80=98[aBbCc=
Dd]=E2=80=99, or it might fail to</div><div class=3D"gmail_quote">&gt;=C2=
=A0 =C2=A0 =C2=A0match any character, or the set of characters that it matc=
hes might</div><div class=3D"gmail_quote">&gt;=C2=A0 =C2=A0 =C2=A0even be e=
rratic.</div><div class=3D"gmail_quote"><br></div></span><div class=3D"gmai=
l_quote">Yes, It is the exact same text that I also quoted above. But all i=
t</div><div class=3D"gmail_quote">clearly defines is that the order is base=
d on the definition of each</div><div class=3D"gmail_quote">locale &quot;in=
 some unspecified way&quot;. When the locale change, the order</div><div cl=
ass=3D"gmail_quote">may also change.</div><div class=3D"gmail_quote"><br></=
div><div class=3D"gmail_quote">&gt; <a href=3D"https://www.gnu.org/software=
/sed/manual/sed.html#Multibyte-regexp-character-classes" target=3D"_blank">=
https://www.gnu.org/software/<wbr>sed/manual/sed.html#Multibyte-<wbr>regexp=
-character-classes</a></div><div class=3D"gmail_quote"><br></div><div class=
=3D"gmail_quote">Yes, At the same page, but at Reporting-Bugs, under the he=
ading</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 =C2=A0[a-z] is case ins=
ensitive</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quot=
e">=C2=A0 <a href=3D"https://www.gnu.org/software/sed/manual/sed.html#Repor=
ting-Bugs" target=3D"_blank">https://www.gnu.org/software/<wbr>sed/manual/s=
ed.html#Reporting-<wbr>Bugs</a></div><div class=3D"gmail_quote"><br></div><=
div class=3D"gmail_quote">We can read:</div><div class=3D"gmail_quote"><br>=
</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 [a-z] is case insensitive</d=
iv><div class=3D"gmail_quote">=C2=A0 =C2=A0 You are encountering problems w=
ith locales. POSIX mandates that [a-z]</div><div class=3D"gmail_quote">=C2=
=A0 =C2=A0 uses the current locale=E2=80=99s collation order =E2=80=93 in C=
 parlance, that means</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 using s=
trcoll(3) instead of strcmp(3). Some locales have a case-</div><div class=
=3D"gmail_quote">=C2=A0 =C2=A0 insensitive collation order, others don=E2=
=80=99t.</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quot=
e">It seems to say: &quot;current locale&#39;s collation order&quot; !!</di=
v><span class=3D""><div class=3D"gmail_quote"><br></div><div class=3D"gmail=
_quote"><br></div><div class=3D"gmail_quote">&gt; <a href=3D"https://www.gn=
u.org/software/grep/manual/html_node/Character-Classes-and-Bracket-Expressi=
ons.html" target=3D"_blank">https://www.gnu.org/software/<wbr>grep/manual/h=
tml_node/<wbr>Character-Classes-and-Bracket-<wbr>Expressions.html</a></div>=
<div class=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt; =
Furthermore, in POSIX 2008 standard range expressions are</div></span><div =
class=3D"gmail_quote">&gt; undefined for locales other than &quot;C/POSIX&q=
uot;, see this comment by Eric Blake</div><span class=3D""><div class=3D"gm=
ail_quote">&gt; (also the entire bug report might be of interest to this to=
pic):</div><div class=3D"gmail_quote">&gt; <a href=3D"https://bugzilla.redh=
at.com/show_bug.cgi?id=3D583011#c24" target=3D"_blank">https://bugzilla.red=
hat.com/<wbr>show_bug.cgi?id=3D583011#c24</a></div><div class=3D"gmail_quot=
e"><br></div></span><div class=3D"gmail_quote">Yes, however: Does undefined=
 also mean invalid, forbidden, banned or illegal?</div><div class=3D"gmail_=
quote"><br></div><div class=3D"gmail_quote">At the moment, it is not illega=
l to use a bracket range in some other locale.</div><div class=3D"gmail_quo=
te">Such use does not raise any error (or even warning). As it is not illeg=
al, the</div><div class=3D"gmail_quote">only aspect that remains to be clea=
rly defined is what is the range order that</div><div class=3D"gmail_quote"=
>we should expect in every other locale than C.</div><div class=3D"gmail_qu=
ote"><br></div><div class=3D"gmail_quote">Also, We rely everyday on &quot;n=
ot specified&quot; behavior (for some spec):</div><div class=3D"gmail_quote=
"><br></div><div class=3D"gmail_quote">The -E option is not (yet) defined i=
n current POSIX (The Open Group</div><div class=3D"gmail_quote">Base Specif=
ications Issue 7, 2018 edition) for sed.</div><div class=3D"gmail_quote">Ye=
s, It is believed that it will be accepted for the next POSIX version.</div=
><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=A0 =C2=
=A0 <a href=3D"http://pubs.opengroup.org/onlinepubs/9699919799/utilities/se=
d.html" target=3D"_blank">http://pubs.opengroup.org/<wbr>onlinepubs/9699919=
799/<wbr>utilities/sed.html</a></div><div class=3D"gmail_quote"><br></div><=
div class=3D"gmail_quote">But it is defined (and used) in GNU sed.</div><di=
v class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Some elements =
are undefined in POSIX just to allow implementations to be diverse:</div><d=
iv class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=A0 =C2=A0=
 <a href=3D"http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xcu_cha=
p02.html" target=3D"_blank">http://pubs.opengroup.org/<wbr>onlinepubs/96999=
19799/xrat/V4_<wbr>xcu_chap02.html</a></div><div class=3D"gmail_quote"><br>=
</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 The results of giving &lt;ti=
lde&gt; with an unknown login name are undefined</div><div class=3D"gmail_q=
uote">=C2=A0 =C2=A0 because the KornShell &quot;=CB=9C+&quot; and &quot;=CB=
=9C-&quot; constructs make use of this condition =E2=80=A6</div><div class=
=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Read carefully: undef=
ined because it is used !.</div><div class=3D"gmail_quote">That is, it is u=
ndefined in the spec to allow implementations to resolve in</div><div class=
=3D"gmail_quote">practical ways that might be diferent than the specificati=
on (or other</div><div class=3D"gmail_quote">implementations).</div><div cl=
ass=3D"gmail_quote"><br></div><div class=3D"gmail_quote"><br></div><div cla=
ss=3D"gmail_quote"><br></div><div class=3D"gmail_quote">In the same &quot;c=
omment by Eric Blake&quot; we can read this:</div><div class=3D"gmail_quote=
"><br></div><div class=3D"gmail_quote">=C2=A0 =C2=A0 The behavior of [A-z] =
in en_US.UTF-8 is &quot;unspecified&quot;, but _not_ &quot;undefined&quot;.=
</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 A compliant app cannot guara=
ntee what the behavior will be, but the behavior</div><div class=3D"gmail_q=
uote">=C2=A0 =C2=A0 should at least be explainable, and as a QoI point, gli=
bc should document</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 and define=
 this behavior as an extension to POSIX, so that apps relying on</div><div =
class=3D"gmail_quote">=C2=A0 =C2=A0 glibc can take advantage of this extens=
ion for known behavior.</div><div class=3D"gmail_quote"><br></div><div clas=
s=3D"gmail_quote">Exactly the same I was meaning:=C2=A0 &quot;unspecified&q=
uot;, but _not_ &quot;invalid&quot;.</div><div class=3D"gmail_quote"><br></=
div><div class=3D"gmail_quote">And, exactly, what I am asking for: &quot;gl=
ibc should document and define this behavior&quot;</div><span class=3D""><d=
iv class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">&gt;=C2=A0</d=
iv><div class=3D"gmail_quote">&gt; &gt; However, the range [a-Z] does match=
 all letters, lower or upper:</div><div class=3D"gmail_quote">&gt; &gt;=C2=
=A0</div><div class=3D"gmail_quote">&gt; &gt;=C2=A0 =C2=A0 =C2=A0$ printf &=
#39;%b&#39; $(printf &#39;\\U%x&#39; {32..127}) | sed &#39;s/[^a-Z]//g&#39;=
</div><div class=3D"gmail_quote">&gt; &gt;=C2=A0 =C2=A0 =C2=A0<wbr>ABCDEFGH=
IJKLMNOPQRSTUVWXYZabcd<wbr>efghijklmnopqrstuvwxyz</div><div class=3D"gmail_=
quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt; I would recommend av=
oiding mixing upper-lower case in regex</div><div class=3D"gmail_quote">&gt=
; ranges, as the result might be unexpected. Compare the following:</div><d=
iv class=3D"gmail_quote"><br></div></span><div class=3D"gmail_quote">In the=
 &quot;comment by Eric Blake&quot; we can also read:</div><div class=3D"gma=
il_quote"><br></div><div class=3D"gmail_quote">=C2=A0 =C2=A0 That is, [A-z]=
 is well-defined in the POSIX locale, and in all other</div><div class=3D"g=
mail_quote">=C2=A0 =C2=A0 locales where A collates before z (which includes=
 en_US.UTF-8)</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail=
_quote">Again: &quot;[A-z] is well-defined =E2=80=A6 &quot;</div><div class=
=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Frankly, if I were to=
 follow both main recommendations:</div><div class=3D"gmail_quote"><br></di=
v><div class=3D"gmail_quote">=C2=A0 =C2=A0 - Any other locale than C is uns=
pecified: do not use them.</div><div class=3D"gmail_quote">=C2=A0 =C2=A0 - =
Any range that does not match the previously known ranges:</div><span class=
=3D""><div class=3D"gmail_quote">=C2=A0 =C2=A0 =C2=A0 &quot;recommend avoid=
ing mixing upper-lower case in regex ranges&quot;</div><div class=3D"gmail_=
quote"><br></div></span><div class=3D"gmail_quote">The usefulness of a brac=
ket range is reduced to almost nothing.</div><div class=3D"gmail_quote">Onl=
y C and only either [a-z] or [A-Z].</div><div class=3D"gmail_quote"><br></d=
iv><div class=3D"gmail_quote">Is it not possible to declare and document wh=
at the collation</div><div class=3D"gmail_quote">order is/should be for oth=
er locales?</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_q=
uote">******************************<wbr>******************************<wbr=
>**********</div><div class=3D"gmail_quote">3.- Corect exactly how.</div><s=
pan class=3D""><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quo=
te">&gt; &gt; If this is the correct way in which sed should work, then, if=
 you please:</div><div class=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"=
gmail_quote">&gt; Yes, it is.</div><div class=3D"gmail_quote"><br></div></s=
pan><div class=3D"gmail_quote">Thanks, but: What does it mean exactly?=C2=
=A0 =C2=A0My opinion in the right.</div><div class=3D"gmail_quote"><br></di=
v><div class=3D"gmail_quote">=C2=A0 - That [a-z] will always mean &#39;abcd=
efghijklmnopqrstuvwxyz&#39; in the C locale?. (Yes)</div><div class=3D"gmai=
l_quote">=C2=A0 - That the order in C locale follows the ASCII numeric orde=
r?.=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(Yes)</div><div c=
lass=3D"gmail_quote">=C2=A0 - That no other locale should be used?=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(No?)</div><div =
class=3D"gmail_quote">=C2=A0 - That the order in any other locale is secret=
?=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (Yes)</div><div class=3D"gmail_quote">=C2=
=A0 - That ranges like [A-z] (valid in C) can not be used in other locales?=
=C2=A0 =C2=A0 =C2=A0 (No?)</div><div class=3D"gmail_quote">=C2=A0 - That ot=
her ranges like [*-d] (valid in C) are a crazy idea?=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (No?)</div><div class=3D"gmail_quote">=
=C2=A0 - References to collation order in the manuals must be stricken out?=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(No?)</div><div class=3D"gmail_quote"><br=
></div><div class=3D"gmail_quote">And we have not even started with more ch=
aracters as they are possible in UNICODE.</div><div class=3D"gmail_quote"><=
br></div><div class=3D"gmail_quote">=C2=A0 =C2=A0- Is this valid:</div><div=
 class=3D"gmail_quote">=C2=A0 =C2=A0$ LC_ALL=3Den_CA.utf8 ./collorder &#39;=
[^a-z]&#39; 32 255</div><div class=3D"gmail_quote">=C2=A0 =C2=A0<wbr>abcdef=
ghijklmnopqrstuvwxyz=C2=AA=C2=BA=C3=9F=C3=A0<wbr>=C3=A1=C3=A2=C3=A3=C3=A4=
=C3=A5=C3=A6=C3=A7=C3=A8=C3=A9=C3=AA=C3=AB=C3=AC=C3=AD=C3=AE=C3=AF=C3=B0=C3=
=B1=C3=B2=C3=B3=C3=B4=C3=B5=C3=B6=C3=B8=C3=B9=C3=BA=C3=BB=C3=BC=C3=BD=C3=BF=
</div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=
=A0 =C2=A0Does it mean that [a-z] is closer to [[:lower:]] than ASCII a-z?<=
/div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">=C2=A0=
 =C2=A0- Is this expected? (phonetic symbols)</div><div class=3D"gmail_quot=
e">=C2=A0 =C2=A0$ LC_ALL=3Den_CA.utf8 ./collorder &#39;[^a-z]&#39; 0x250 0x=
2af</div><div class=3D"gmail_quote">=C2=A0 =C2=A0=C9=93=C9=94=C9=96=C9=97=
=C9=99=C9=9B=C9=A0=C9=B5</div><div class=3D"gmail_quote"><br></div><div cla=
ss=3D"gmail_quote">=C2=A0 =C2=A0- Should this work? In what order? (phoneti=
c symbols)</div><div class=3D"gmail_quote">=C2=A0 =C2=A0$ LC_ALL=3Den_CA.ut=
f8 ./collorder &#39;[^=C9=96-=C9=9B]&#39; 0x250 0x2af</div><div class=3D"gm=
ail_quote">=C2=A0 =C2=A0=C9=96=C9=97=C9=99=C9=9B</div><div class=3D"gmail_q=
uote">=C2=A0 =C2=A0</div><div class=3D"gmail_quote">=C2=A0 =C2=A0- Why all =
Latin characters are being included? (Latin extended)</div><div class=3D"gm=
ail_quote">=C2=A0 =C2=A0$ LC_ALL=3Den_CA.utf8 ./collorder &#39;[^a-z]&#39; =
0x1e00 0x1fff</div><div class=3D"gmail_quote">=C2=A0 =C2=A0<wbr>=E1=B8=81=
=E1=B8=83=E1=B8=85=E1=B8=87=E1=B8=89=E1=B8=8B=E1=B8=8D=E1=B8=8F=E1=B8=91=E1=
=B8=93=E1=B8=95=E1=B8=97=E1=B8=99=E1=B8=9B=E1=B8=9D=E1=B8=9F=E1=B8=A1=E1=B8=
=A3=E1=B8=A5=E1=B8=A7=E1=B8=A9=E1=B8=AB=E1=B8=AD=E1=B8=AF=E1=B8=B1=E1=B8=B3=
=E1=B8=B5=E1=B8=B7=E1=B8=B9=E1=B8=BB<wbr>=E1=B8=BD=E1=B8=BF=E1=B9=81=E1=B9=
=83=E1=B9=85=E1=B9=87=E1=B9=89=E1=B9=8B=E1=B9=8D=E1=B9=8F=E1=B9=91=E1=B9=93=
=E1=B9=95=E1=B9=97=E1=B9=99=E1=B9=9B=E1=B9=9D=E1=B9=9F=E1=B9=A1=E1=B9=A3=E1=
=B9=A5=E1=B9=A7=E1=B9=A9=E1=B9=AB=E1=B9=AD=E1=B9=AF=E1=B9=B1=E1=B9=B3=E1=B9=
=B5=E1=B9=B7<wbr>=E1=B9=B9=E1=B9=BB=E1=B9=BD=E1=B9=BF=E1=BA=81=E1=BA=83=E1=
=BA=85=E1=BA=87=E1=BA=89</div><div class=3D"gmail_quote">=C2=A0 =C2=A0<wbr>=
=E1=BA=8B=E1=BA=8D=E1=BA=8F=E1=BA=96=E1=BA=97=E1=BA=98=E1=BA=99=E1=BA=9A=E1=
=BA=9B=E1=BA=A1=E1=BA=A3=E1=BA=A5=E1=BA=A7=E1=BA=A9=E1=BA=AB=E1=BA=AD=E1=BA=
=AF=E1=BA=B1=E1=BA=B3=E1=BA=B5=E1=BA=B7=E1=BA=B9=E1=BA=BB=E1=BA=BD=E1=BA=BF=
=E1=BB=81=E1=BB=83=E1=BB=85=E1=BB=87=E1=BB=89<wbr>=E1=BB=8B=E1=BB=8D=E1=BB=
=8F=E1=BB=91=E1=BB=93=E1=BB=95=E1=BB=97=E1=BB=99=E1=BB=9B=E1=BB=9D=E1=BB=9F=
=E1=BB=A1=E1=BB=A3=E1=BB=A5=E1=BB=A7=E1=BB=A9=E1=BB=AB=E1=BB=AD=E1=BB=AF=E1=
=BB=B1=E1=BB=B3=E1=BB=B5=E1=BB=B7=E1=BB=B9</div><span class=3D""><div class=
=3D"gmail_quote"><br></div><div class=3D"gmail_quote">&gt; &gt;=C2=A0 =C2=
=A0 =C2=A0- What is the rationale leading to such decision?.</div><div clas=
s=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt; The bug r=
eports linked above contain long discussions about it.</div><div class=3D"g=
mail_quote"><br></div></span><div class=3D"gmail_quote">Yes, there are disc=
ussions about what was relevant at the time.</div><div class=3D"gmail_quote=
">But none explain in clear simple words what order the characters</div><di=
v class=3D"gmail_quote">in a bracket range will follow in a locale that is =
NOT C. (see</div><div class=3D"gmail_quote">some simple examples above).</d=
iv><span class=3D""><div class=3D"gmail_quote"><br></div><div class=3D"gmai=
l_quote">&gt; Please also see the following thread, which promoted the rest=
riction</div><div class=3D"gmail_quote">&gt; of &quot;sane regex ranges&quo=
t; - meaning ASCII order alone (and applies to gawk,</div><div class=3D"gma=
il_quote">&gt; grep, sed and other programs using gnulib&#39;s regex engine=
):</div><div class=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quot=
e">&gt; <a href=3D"https://lists.gnu.org/archive/html/bug-gnulib/2011-06/ms=
g00200.html" target=3D"_blank">https://lists.gnu.org/archive/<wbr>html/bug-=
gnulib/2011-06/<wbr>msg00200.html</a></div><div class=3D"gmail_quote"><br><=
/div></span><div class=3D"gmail_quote">ASCII order alone? Only for characte=
rs in numeric range 0x00-0x7f ????</div><div class=3D"gmail_quote"><br></di=
v><div class=3D"gmail_quote">=C2=A0 =C2=A0 - How comes that an =C3=A1 gets =
included in the very limited [a-b]?</div><div class=3D"gmail_quote">=C2=A0 =
=C2=A0 $ LC_ALL=3Den_CA.utf8 ./collorder &#39;[^a-b]&#39; 0x00 0xff</div><d=
iv class=3D"gmail_quote">=C2=A0 =C2=A0 ab=C2=AA=C3=A0=C3=A1=C3=A2=C3=A3=C3=
=A4=C3=A5=C3=A6</div><span class=3D""><div class=3D"gmail_quote"><br></div>=
<div class=3D"gmail_quote">&gt; &gt;=C2=A0 =C2=A0 =C2=A0- Where is it docum=
ented?.</div><div class=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail=
_quote">&gt; The links above to the sed and grep manuals.</div><div class=
=3D"gmail_quote"><br></div></span><div class=3D"gmail_quote">None of the li=
nked documents explain the above result for [^a-b].</div><span class=3D""><=
div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">&gt; &gt;=C2=
=A0 =C2=A0 =C2=A0- Where is it implemented in the code?.</div><div class=3D=
"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt; I think a goo=
d place to start is gnulib&#39;s DFA regex engine,</div><div class=3D"gmail=
_quote">&gt; here:</div><div class=3D"gmail_quote">&gt; <a href=3D"https://=
opengrok.housegordon.com/source/xref/gnulib/lib/dfa.c" target=3D"_blank">ht=
tps://opengrok.housegordon.<wbr>com/source/xref/gnulib/lib/<wbr>dfa.c</a></=
div><div class=3D"gmail_quote">&gt; or here:</div><div class=3D"gmail_quote=
">&gt; <a href=3D"http://git.savannah.gnu.org/cgit/gnulib.git/tree/lib/dfa.=
c" target=3D"_blank">http://git.savannah.gnu.org/<wbr>cgit/gnulib.git/tree/=
lib/dfa.c</a></div><div class=3D"gmail_quote"><br></div></span><div class=
=3D"gmail_quote">I have to recognize that I am unable to understand any of =
those</div><div class=3D"gmail_quote">4000 lines of code without some detai=
led help of how it works.</div><div class=3D"gmail_quote">I am really sorry=
.</div><span class=3D""><div class=3D"gmail_quote"><br></div><div class=3D"=
gmail_quote">&gt; Search for the comment &#39;build range characters&#39; f=
or a starting point.</div><div class=3D"gmail_quote">&gt;=C2=A0</div><div c=
lass=3D"gmail_quote">&gt; Both gnu grep and sed use this code.</div><div cl=
ass=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt; &gt;=C2=
=A0 =C2=A0 =C2=A0- Why does the manual document otherwise?.</div><div class=
=3D"gmail_quote">&gt;=C2=A0</div><div class=3D"gmail_quote">&gt; Errors in =
the manual are always a possibility.</div><div class=3D"gmail_quote">&gt; I=
f you spot such an error, or an example showing incorrect</div><div class=
=3D"gmail_quote">&gt; usage/output - please let us know where it is (e.g. a=
 link</div><div class=3D"gmail_quote">&gt; to a manual page=C2=A0 / section=
).</div><div class=3D"gmail_quote"><br></div></span><div class=3D"gmail_quo=
te">I have provided a couple of points where &quot;collating order&quot; is=
 used.</div><div class=3D"gmail_quote">But I suspect that those are not mis=
takes from your point of view and</div><div class=3D"gmail_quote">that what=
 is missing is a more detailed description of which collating</div><div cla=
ss=3D"gmail_quote">order is being used. I may be perfectly wrong, of course=
.</div><span class=3D""><div class=3D"gmail_quote">=C2=A0</div><div class=
=3D"gmail_quote">&gt; As such, I&#39;m marking this as &quot;not a bug&quot=
; and closing the ticket,</div><div class=3D"gmail_quote">&gt; but discussi=
on can continue by replying to this thread.</div><div class=3D"gmail_quote"=
><br></div></span><div class=3D"gmail_quote">I still remain in doubt, at th=
e very minimum.</div><div class=3D"gmail_quote"><br></div><div class=3D"gma=
il_quote">&gt; regards,</div><div class=3D"gmail_quote">&gt;=C2=A0 - assaf<=
/div><div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Many t=
hanks and regards</div><span class=3D"HOEnZb"><font color=3D"#888888"><div =
class=3D"gmail_quote">- Bize</div><div><br></div></font></span></div></div>=
</div>
</div><br></div></div></div>

--0000000000005ba3e4056ce7b0a7--


From debbugs-submit-bounces@debbugs.gnu.org Fri May 25 00:48:46 2018
Received: (at 31526) by debbugs.gnu.org; 25 May 2018 04:48:47 +0000
Received: from localhost ([127.0.0.1]:47236 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1fM4ei-0003Fv-1t
	for submit@debbugs.gnu.org; Fri, 25 May 2018 00:48:44 -0400
Received: from mail-oi0-f53.google.com ([209.85.218.53]:34213)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <binaryzebra@gmail.com>) id 1fM4ed-0003Fh-Fd
 for 31526@debbugs.gnu.org; Fri, 25 May 2018 00:48:40 -0400
Received: by mail-oi0-f53.google.com with SMTP id l1-v6so3494863oii.1
 for <31526@debbugs.gnu.org>; Thu, 24 May 2018 21:48:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:in-reply-to:references:from:date:message-id:subject:to
 :cc; bh=ojl/9pNvK2z7hPwajRmbxgTFyR3gTviBd8R1CcwMs2c=;
 b=NeSf6kjBWBqfBi3As4LSIhRaUkFY9mm5SurMmN8GoNbsafny0IgyFIHYWfx7oEo9tV
 pC00ses7c67FRdEh6T7CdC4oNzumBHx21ejbo/a/GFqnBdjD65lAHRGAJv0zgglmx6h3
 GifCLgjdNitndKKhDlShO15oEVsu6CCnDxp+DQ1y78JapWoe6bNRtyLBONwpKFh0gEYP
 n9ECgU0RTH0Lm3eZWnF0e5WOiGoaLgzdK9BD7IFtZO3WqAtlnQdDWGAKFx6qHEtzk8uo
 nGzeeDGJyfy3mCtVcNQ2rsfvu4zp2w+pYo6SYFZAFFaz32DK8ukPHtFbEJJA5mttLEJo
 54Fw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:in-reply-to:references:from:date
 :message-id:subject:to:cc;
 bh=ojl/9pNvK2z7hPwajRmbxgTFyR3gTviBd8R1CcwMs2c=;
 b=nia/UU8a23PaGvzgaB+CJnKNt+tT6qfoZgNp3MEqj+6q48tks7ex0m0P3MRWcAsVOz
 DZ5kn/PeAFJRqLaTyqvPrxAX8M2vrdT9uQhoI+GVUtj2+ubxH9rJdgDChBVLTfMlCQQo
 YW9+P+f1IH9GA2uO7molW+4tS+3/Ak7ueNadKiGweVufnd+30SPgybWwlNTCJ3NYTHMD
 OiYYBtiX/PnjOTxZpxu2BwVRcbOHWcct4A7DioBfeZUr7rZSvMX77tr6x7cw78OC4G9e
 Aq3GhWUMbVEbQ+Ribd9QEBp5m+FevZIXnAs5NOWQMtvtxpiB+Mjp7OO8bafauGg2smlm
 ZvRg==
X-Gm-Message-State: ALKqPweAe5h0RjhsWJYNHURSdrMIdftJfzQZ4gZ+u+QObpGgY1HZxc/F
 P8wbyYukQ9Gg9oUxuFyg0f5Va8YjhLSb+NCssh8=
X-Google-Smtp-Source: ADUXVKJTbcVXlVxE0MrEIWhXySIv2pwyI0gMUv7D+JvGXehl10bJ38YqBV5NslWXnX6Zko7JBPWymRKPUYIbw9ZnfVU=
X-Received: by 2002:aca:5855:: with SMTP id m82-v6mr443108oib.24.1527223713774; 
 Thu, 24 May 2018 21:48:33 -0700 (PDT)
MIME-Version: 1.0
Received: by 2002:a9d:2f04:0:0:0:0:0 with HTTP; Thu, 24 May 2018 21:48:33
 -0700 (PDT)
In-Reply-To: <fc1793e7-0344-8c6a-3690-80398b8744d7@gmail.com>
References: <CAFra36hoTNH0s6oOHtNQfk6M_67Zk24tVZ5QTN67NfVwhpdD6A@mail.gmail.com>
 <20180520021300.myh2m4njtaq4nz3u@tomato>
 <CAFra36hQ_aj518zxwXgJQ4b9FDGiFWobS4ZoteLn6FZW0KfeTg@mail.gmail.com>
 <fc1793e7-0344-8c6a-3690-80398b8744d7@gmail.com>
From: Bize Ma <binaryzebra@gmail.com>
Date: Fri, 25 May 2018 00:48:33 -0400
Message-ID: <CAFra36ixGjP3_gapbFTrT6XgxWLye58UT48CZtOftF4JTfgxiw@mail.gmail.com>
Subject: Re: bug#31526: Range [a-z] does not follow collate order from locale.
To: Assaf Gordon <assafgordon@gmail.com>
Content-Type: multipart/alternative; boundary="000000000000f3c10b056d007ae4"
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 31526
Cc: 31526@debbugs.gnu.org
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

--000000000000f3c10b056d007ae4
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I believe that this lines carry the esence of the answer:

        > It is outside the scope of 'sed' to define the collation order.

        > Yes, there is a locale collation order.
        > It is defined in libc not in sed, and it is not well documented.
        > GNU sed has no way to change/determine it, or document what it is=
.

        >>    - That the order in any other locale is secret?
        >
        > Not "secret" as in someone actively trying to hide it,
        > but unknown/undocumented because the developers of GLIBC have not
        > documented it.

        >> But none explain in clear simple words what order the characters
        >> in a bracket range will follow in a locale that is NOT C. (see
        >> some simple examples above).
        >
        > Correct - that is not documented anywhere at the moment.

So:

    - This is not a bug that sed developers could or would resolve.
    - The sort order needs to be documented by glibc.

In fact, sed developers do not support bracket ranges in a locale that is
not C:

        > Any other locale than C is unspecified: do not use them.

Best Regards

Bize Ma


---------------------------------------------------------------------------=
--
Some general clarifications follow:


>> In range definitions I believe that there are two goals in conflict:
>>
>>      - An stable, simple, range description for programmers.
>>      - A clear descrition (even if long) for multilanguage users.
>>
> Why are they in conflict? =E2=80=A6

Because if a long description is required, then, it is "not simple".


> Exactly because regex ranges in multibyte locales are not well-defined,
> the recommendation is not to use them in portable sed scripts.

Portable? That is new word. It did not appeared in previous e-mails.
Why do you assume that I want/need to have only "portable" ranges?


>> **********************************************************************
>> 1.- About ASCII character numeric ranges:
[...]
> In "C/POSIX" locale, regex range [a-d] matches a,b,c,d.
> In other locales, it is not well defined (and can match many variations,
> depending on your operating system/libc).

Yes, Simple: sed defers to glibc (or other libc) the responsability
to define and implement such order.
Thus: sed developers could not support any specific range order.


[...]
>> The -E option is not (yet) defined in current POSIX (The Open Group
>> Base Specifications Issue 7, 2018 edition) for sed.
>> Yes, It is believed that it will be accepted for the next POSIX version.
>>
> Technically speaking, the "-E" option is not "unspecified".

I did not use the word "unspecified", I said: "not (yet) defined".
Please do not put words in my mouth.

> It is an extension beyond the current POSIX standard, and GNU programs
> have many such extensions.

And, as an extension, is something that the POSIX standard has not (yet)
defined.


[...]
> But how do you treat range "[a-Z]" ?

If the collating order sorts `a` before `Z`, the range is valid and
should give a "resonable" result.

    $ echo
'0123456789:;<=3D>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw=
xyz'
|
    >     LC_ALL=3Den_CA.utf8   sed 's/[^a-Z]//g'
    ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz

As you can see above, glibc (and thus sed) does not claim the range
to be invalid and thus it returns a "reasonable" result
(whatever "reasonable" is meaning here).

> This is range ASCII 97 to ASCII 90 ... is an implementation expected
> to swap the min/max values, and treat it as ASCII range 90-97 ?
> or somehow understand these are letters, and change it to ASCII 65 to 122
?

ASCII values only have an exact meaning in C locale (and (maybe) in
C.UTF-8).
And that is only because that is the collating sort order of C locale.

In other locales, the sort order is usually (very) diferent
than ASCII numeric values.


[...]
> 2. In multibyte locales, ranges of specific letters (e.g. "[A-D]")
> are not well specified and should be avoided in portable scripts.

That word again: portable. Only in portable scripts?
What should happen in all other scripts?

[...]
>> **********************************************************************
>> 3.- Correct exactly how.
[...]
>>    - That other ranges like [*-d] (valid in C) are a crazy idea?
         (No?)
>
> Instead of "crazy" let's call it "unspecified" =E2=80=A6

Let's call it what it is: unsupported by sed.

>>    - References to collation order in the manuals must be stricken out?
        (No?)
> I'm not sure I understand this...

You said:
        I don't think it is documented to be so anywhere in GNU programs.

[...]
> The term "collation order" is defined in POSIX, e.g. here:
>
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html#tag=
_07_03_02

I have NOT asked what the term means in POSIX, but what it means to sed.

 [...]
> Here's an example of glibc's strange behavior (or at least
> strange to me, as I found no explanation for it):
>
> In most multibyte UTF-8 locales the punctuation order
> differs from ASCII order,

Collation order is a language issue, each language has special and
many times conflicting views of what "the correct order" should be.
That is how we humans think. Consider very "simple" everyday dates,
there are as many month names as languages there are.  As many week
day names as languages there are. That is what an individual of any
culture has learnt to expect as the "natural order". All we can do,
if confronted with diverse expectations, is to accept that they do
exist and addapt to accept them.

Please take a look at the Unicode Collation page:

    http://unicode.org/reports/tr10/

> =E2=80=A6 but is consistently the same (e.g. en_CA.UTF-8 and fr_FR.UTF-8)=
.
> For some reason, ja_JP.UTF-8 order is more like ASCII.
>
> Compare the following:
>
>   $ printf "%s\n" a A b B "=C3=A1" "=E3=81=82" "=E3=81=B2" . , : - =3D > =
in
>   $ LC_ALL=3DC           sort in > out-C
>   $ LC_ALL=3Den_CA.UTF-8 sort in > out-CA
>   $ LC_ALL=3Dja_JP.UTF-8 sort in > out-JA
>   $ paste out-C out-CA out-JA
> , =3D ,
> - - -
> . , .
> : : :
> =3D . =3D
> A =E3=81=82 A
> B =E3=81=B2 B
> a A a
> b a b
> =C3=A1 =C3=A1 =E3=81=82
> =E3=81=82 B =E3=81=B2
> =E3=81=B2 b =C3=A1

What all the above reveals is one order, the order that sort follows.
But you are still failing to get it:

    That is entirelly diferent than what glic follows. Try:

    $ LC_ALL=3DC                   sed 's/[A-B]/x/g'  out-C     >out-C-sed
    $ LC_ALL=3Den_CA.utf8   sed 's/[A-B]/x/g'   out-CA  >out-CA-sed
    $ LC_ALL=3Den_JP.utf8     sed 's/[A-B]/x/g'   out-JA  >out-JA-sed
    $ paste     out-C-sed      out-CA-sed      out-JA-sed
, =3D ,
- - -
. , .
: : :
=3D . =3D
=E3=81=82
=E3=81=B2
a a
b a b
=C3=A1 =C3=A1 =E3=81=82
=E3=81=82 =E3=81=B2
=E3=81=B2 b =C3=A1

    The `a` and the `=C3=A1` were sorted between `A` and `B`in the en_CA.ut=
f8
locale.
    But sed did NOT match them.
    Yes, just one particular example in en_CA.utf8 locale.

[...]
>>> As such, I'm marking this as "not a bug" and closing the ticket,
>>> but discussion can continue by replying to this thread.
>>
>> I still remain in doubt, at the very minimum.
>
> I hope this helps clears things out, but I'm happy to continue
> this discussion if there are other questions.

I am clear now that this is unsupported by sed, thanks.

--000000000000f3c10b056d007ae4
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div>I believe that this lines carry the esence of th=
e answer:</div><div><br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; It is o=
utside the scope of &#39;sed&#39; to define the collation order.</div><div>=
<br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; Yes, there is a locale coll=
ation order.</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; It is defined in li=
bc not in sed, and it is not well documented.</div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 &gt; GNU sed has no way to change/determine it, or document what=
 it is.</div><div><br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt;&gt;=C2=A0=
 =C2=A0 - That the order in any other locale is secret?</div><div>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 &gt;=C2=A0</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; =
Not &quot;secret&quot; as in someone actively trying to hide it,</div><div>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; but unknown/undocumented because the devel=
opers of GLIBC have not</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; document=
ed it.</div><div><br></div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt;&gt; But no=
ne explain in clear simple words what order the characters</div><div>=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 &gt;&gt; in a bracket range will follow in a locale t=
hat is NOT C. (see</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt;&gt; some simp=
le examples above).</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt;=C2=A0</div><=
div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 &gt; Correct - that is not documented anywh=
ere at the moment.</div><div><br></div><div>So:</div><div><br></div><div>=
=C2=A0 =C2=A0 - This is not a bug that sed developers could or would resolv=
e.</div><div>=C2=A0 =C2=A0 - The sort order needs to be documented by glibc=
.</div><div><br></div><div>In fact, sed developers do not support bracket r=
anges in a locale that is not C:</div><div><br></div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 &gt; Any other locale than C is unspecified: do not use them.</d=
iv><div><br></div><div>Best Regards</div><div><br></div><div>Bize Ma</div><=
div><br></div><div><br></div><div><br></div><div>--------------------------=
---------------------------------------------------</div><div>Some general =
clarifications follow:</div><div><br></div><div><br></div><div>&gt;&gt; In =
range definitions I believe that there are two goals in conflict:</div><div=
>&gt;&gt;=C2=A0</div><div>&gt;&gt;=C2=A0 =C2=A0 =C2=A0 - An stable, simple,=
 range description for programmers.</div><div>&gt;&gt;=C2=A0 =C2=A0 =C2=A0 =
- A clear descrition (even if long) for multilanguage users.</div><div>&gt;=
&gt;=C2=A0</div><div>&gt; Why are they in conflict? =E2=80=A6</div><div><br=
></div><div>Because if a long description is required, then, it is &quot;no=
t simple&quot;.</div><div><br></div><div><br></div><div>&gt; Exactly becaus=
e regex ranges in multibyte locales are not well-defined,</div><div>&gt; th=
e recommendation is not to use them in portable sed scripts.</div><div><br>=
</div><div>Portable? That is new word. It did not appeared in previous e-ma=
ils.</div><div>Why do you assume that I want/need to have only &quot;portab=
le&quot; ranges?</div><div><br></div><div><br></div><div><br></div><div>&gt=
;&gt; *********************************************************************=
*</div><div>&gt;&gt; 1.- About ASCII character numeric ranges:</div><div>[.=
..]</div><div>&gt; In &quot;C/POSIX&quot; locale, regex range [a-d] matches=
 a,b,c,d.</div><div>&gt; In other locales, it is not well defined (and can =
match many variations,</div><div>&gt; depending on your operating system/li=
bc).</div><div><br></div><div>Yes, Simple: sed defers to glibc (or other li=
bc) the responsability</div><div>to define and implement such order.</div><=
div>Thus: sed developers could not support any specific range order.</div><=
div><br></div><div><br></div><div>[...]</div><div>&gt;&gt; The -E option is=
 not (yet) defined in current POSIX (The Open Group</div><div>&gt;&gt; Base=
 Specifications Issue 7, 2018 edition) for sed.</div><div>&gt;&gt; Yes, It =
is believed that it will be accepted for the next POSIX version.</div><div>=
&gt;&gt;=C2=A0</div><div>&gt; Technically speaking, the &quot;-E&quot; opti=
on is not &quot;unspecified&quot;.</div><div><br></div><div>I did not use t=
he word &quot;unspecified&quot;, I said: &quot;not (yet) defined&quot;.</di=
v><div>Please do not put words in my mouth.</div><div><br></div><div>&gt; I=
t is an extension beyond the current POSIX standard, and GNU programs</div>=
<div>&gt; have many such extensions.</div><div><br></div><div>And, as an ex=
tension, is something that the POSIX standard has not (yet) defined.</div><=
div><br></div><div><br></div><div><br></div><div>[...]</div><div>&gt; But h=
ow do you treat range &quot;[a-Z]&quot; ?</div><div><br></div><div>If the c=
ollating order sorts `a` before `Z`, the range is valid and</div><div>shoul=
d give a &quot;resonable&quot; result.</div><div><br></div><div>=C2=A0 =C2=
=A0 $ echo &#39;0123456789:;&lt;=3D&gt;?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`ab=
cdefghijklmnopqrstuvwxyz&#39; |</div><div>=C2=A0 =C2=A0 &gt;=C2=A0 =C2=A0 =
=C2=A0LC_ALL=3Den_CA.utf8=C2=A0 =C2=A0sed &#39;s/[^a-Z]//g&#39;</div><div>=
=C2=A0 =C2=A0 ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</div><di=
v><br></div><div>As you can see above, glibc (and thus sed) does not claim =
the range</div><div>to be invalid and thus it returns a &quot;reasonable&qu=
ot; result</div><div>(whatever &quot;reasonable&quot; is meaning here).</di=
v><div><br></div><div>&gt; This is range ASCII 97 to ASCII 90 ... is an imp=
lementation expected</div><div>&gt; to swap the min/max values, and treat i=
t as ASCII range 90-97 ?</div><div>&gt; or somehow understand these are let=
ters, and change it to ASCII 65 to 122 ?</div><div><br></div><div>ASCII val=
ues only have an exact meaning in C locale (and (maybe) in C.UTF-8).</div><=
div>And that is only because that is the collating sort order of C locale.<=
/div><div><br></div><div>In other locales, the sort order is usually (very)=
 diferent</div><div>than ASCII numeric values.</div><div><br></div><div><br=
></div><div><br></div><div>[...]=C2=A0</div><div>&gt; 2. In multibyte local=
es, ranges of specific letters (e.g. &quot;[A-D]&quot;)</div><div>&gt; are =
not well specified and should be avoided in portable scripts.</div><div><br=
></div><div>That word again: portable. Only in portable scripts?</div><div>=
What should happen in all other scripts?</div><div><br></div><div>[...]</di=
v><div>&gt;&gt; ***********************************************************=
***********</div><div>&gt;&gt; 3.- Correct exactly how.</div><div>[...]</di=
v><div>&gt;&gt;=C2=A0 =C2=A0 - That other ranges like [*-d] (valid in C) ar=
e a crazy idea?=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0(No?)</div><div>&gt;=C2=A0</div><div>&gt; Instead of &quot;crazy&quot; l=
et&#39;s call it &quot;unspecified&quot; =E2=80=A6</div><div><br></div><div=
>Let&#39;s call it what it is: unsupported by sed.</div><div><br></div><div=
>&gt;&gt;=C2=A0 =C2=A0 - References to collation order in the manuals must =
be stricken out?=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (No?)</div><div>&gt; I&#=
39;m not sure I understand this...</div><div><br></div><div>You said:</div>=
<div>=C2=A0 =C2=A0 =C2=A0 =C2=A0 I don&#39;t think it is documented to be s=
o anywhere in GNU programs.</div><div>=C2=A0</div><div>[...]</div><div>&gt;=
 The term &quot;collation order&quot; is defined in POSIX, e.g. here:</div>=
<div>&gt; <a href=3D"http://pubs.opengroup.org/onlinepubs/9699919799/basede=
fs/V1_chap07.html#tag_07_03_02">http://pubs.opengroup.org/onlinepubs/969991=
9799/basedefs/V1_chap07.html#tag_07_03_02</a></div><div><br></div><div>I ha=
ve NOT asked what the term means in POSIX, but what it means to sed.</div><=
div><br></div><div>=C2=A0[...]</div><div>&gt; Here&#39;s an example of glib=
c&#39;s strange behavior (or at least</div><div>&gt; strange to me, as I fo=
und no explanation for it):</div><div>&gt;=C2=A0</div><div>&gt; In most mul=
tibyte UTF-8 locales the punctuation order</div><div>&gt; differs from ASCI=
I order,</div><div><br></div><div>Collation order is a language issue, each=
 language has special and</div><div>many times conflicting views of what &q=
uot;the correct order&quot; should be.</div><div>That is how we humans thin=
k. Consider very &quot;simple&quot; everyday dates,</div><div>there are as =
many month names as languages there are.=C2=A0 As many week</div><div>day n=
ames as languages there are. That is what an individual of any</div><div>cu=
lture has learnt to expect as the &quot;natural order&quot;. All we can do,=
</div><div>if confronted with diverse expectations, is to accept that they =
do</div><div>exist and addapt to accept them.</div><div><br></div><div>Plea=
se take a look at the Unicode Collation page:</div><div><br></div><div>=C2=
=A0 =C2=A0 <a href=3D"http://unicode.org/reports/tr10/">http://unicode.org/=
reports/tr10/</a></div><div><br></div><div>&gt; =E2=80=A6 but is consistent=
ly the same (e.g. en_CA.UTF-8 and fr_FR.UTF-8).</div><div>&gt; For some rea=
son, ja_JP.UTF-8 order is more like ASCII.</div><div>&gt;=C2=A0</div><div>&=
gt; Compare the following:</div><div>&gt;=C2=A0</div><div>&gt;=C2=A0 =C2=A0=
$ printf &quot;%s\n&quot; a A b B &quot;=C3=A1&quot; &quot;=E3=81=82&quot; =
&quot;=E3=81=B2&quot; . , : - =3D &gt; in</div><div>&gt;=C2=A0 =C2=A0$ LC_A=
LL=3DC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0sort in &gt; out-C</div><div=
>&gt;=C2=A0 =C2=A0$ LC_ALL=3Den_CA.UTF-8 sort in &gt; out-CA</div><div>&gt;=
=C2=A0 =C2=A0$ LC_ALL=3Dja_JP.UTF-8 sort in &gt; out-JA</div><div>&gt;=C2=
=A0 =C2=A0$ paste out-C out-CA out-JA</div><div>&gt;<span style=3D"white-sp=
ace:pre">	</span>,<span style=3D"white-space:pre">	</span>=3D<span style=3D=
"white-space:pre">	</span>,</div><div>&gt;<span style=3D"white-space:pre">	=
</span>-<span style=3D"white-space:pre">	</span>-<span style=3D"white-space=
:pre">	</span>-</div><div>&gt;<span style=3D"white-space:pre">	</span>.<spa=
n style=3D"white-space:pre">	</span>,<span style=3D"white-space:pre">	</spa=
n>.</div><div>&gt;<span style=3D"white-space:pre">	</span>:<span style=3D"w=
hite-space:pre">	</span>:<span style=3D"white-space:pre">	</span>:</div><di=
v>&gt;<span style=3D"white-space:pre">	</span>=3D<span style=3D"white-space=
:pre">	</span>.<span style=3D"white-space:pre">	</span>=3D</div><div>&gt;<s=
pan style=3D"white-space:pre">	</span>A<span style=3D"white-space:pre">	</s=
pan>=E3=81=82<span style=3D"white-space:pre">	</span>A</div><div>&gt;<span =
style=3D"white-space:pre">	</span>B<span style=3D"white-space:pre">	</span>=
=E3=81=B2<span style=3D"white-space:pre">	</span>B</div><div>&gt;<span styl=
e=3D"white-space:pre">	</span>a<span style=3D"white-space:pre">	</span>A<sp=
an style=3D"white-space:pre">	</span>a</div><div>&gt;<span style=3D"white-s=
pace:pre">	</span>b<span style=3D"white-space:pre">	</span>a<span style=3D"=
white-space:pre">	</span>b</div><div>&gt;<span style=3D"white-space:pre">	<=
/span>=C3=A1<span style=3D"white-space:pre">	</span>=C3=A1<span style=3D"wh=
ite-space:pre">	</span>=E3=81=82</div><div>&gt;<span style=3D"white-space:p=
re">	</span>=E3=81=82<span style=3D"white-space:pre">	</span>B<span style=
=3D"white-space:pre">	</span>=E3=81=B2</div><div>&gt;<span style=3D"white-s=
pace:pre">	</span>=E3=81=B2<span style=3D"white-space:pre">	</span>b<span s=
tyle=3D"white-space:pre">	</span>=C3=A1</div><div><br></div><div>What all t=
he above reveals is one order, the order that sort follows.</div><div>But y=
ou are still failing to get it:</div><div><br></div><div>=C2=A0 =C2=A0 That=
 is entirelly diferent than what glic follows. Try:</div><div><br></div><di=
v>=C2=A0 =C2=A0 $ LC_ALL=3DC=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0sed &#39;s/[A-B]/x/g&#39;=C2=A0 out-C=C2=A0 =C2=A0 =
=C2=A0&gt;out-C-sed</div><div>=C2=A0 =C2=A0 $ LC_ALL=3Den_CA.utf8=C2=A0 =C2=
=A0sed &#39;s/[A-B]/x/g&#39;=C2=A0 =C2=A0out-CA=C2=A0 &gt;out-CA-sed</div><=
div>=C2=A0 =C2=A0 $ LC_ALL=3Den_JP.utf8=C2=A0 =C2=A0 =C2=A0sed &#39;s/[A-B]=
/x/g&#39;=C2=A0 =C2=A0out-JA=C2=A0 &gt;out-JA-sed</div><div>=C2=A0 =C2=A0 $=
 paste=C2=A0 =C2=A0 =C2=A0out-C-sed=C2=A0 =C2=A0 =C2=A0 out-CA-sed=C2=A0 =
=C2=A0 =C2=A0 out-JA-sed</div><div><span style=3D"white-space:pre">	</span>=
,<span style=3D"white-space:pre">	</span>=3D<span style=3D"white-space:pre"=
>	</span>,</div><div><span style=3D"white-space:pre">	</span>-<span style=
=3D"white-space:pre">	</span>-<span style=3D"white-space:pre">	</span>-</di=
v><div><span style=3D"white-space:pre">	</span>.<span style=3D"white-space:=
pre">	</span>,<span style=3D"white-space:pre">	</span>.</div><div><span sty=
le=3D"white-space:pre">	</span>:<span style=3D"white-space:pre">	</span>:<s=
pan style=3D"white-space:pre">	</span>:</div><div><span style=3D"white-spac=
e:pre">	</span>=3D<span style=3D"white-space:pre">	</span>.<span style=3D"w=
hite-space:pre">	</span>=3D</div><div><span style=3D"white-space:pre">		</s=
pan>=E3=81=82</div><div><span style=3D"white-space:pre">		</span>=E3=81=B2<=
span style=3D"white-space:pre">	</span></div><div><span style=3D"white-spac=
e:pre">	</span>a<span style=3D"white-space:pre">		</span>a</div><div><span =
style=3D"white-space:pre">	</span>b<span style=3D"white-space:pre">	</span>=
a<span style=3D"white-space:pre">	</span>b</div><div><span style=3D"white-s=
pace:pre">	</span>=C3=A1<span style=3D"white-space:pre">	</span>=C3=A1<span=
 style=3D"white-space:pre">	</span>=E3=81=82</div><div><span style=3D"white=
-space:pre">	</span>=E3=81=82<span style=3D"white-space:pre">		</span>=E3=
=81=B2</div><div><span style=3D"white-space:pre">	</span>=E3=81=B2<span sty=
le=3D"white-space:pre">	</span>b<span style=3D"white-space:pre">	</span>=C3=
=A1</div><div>=C2=A0 =C2=A0=C2=A0</div><div>=C2=A0 =C2=A0 The `a` and the `=
=C3=A1` were sorted between `A` and `B`in the en_CA.utf8 locale.</div><div>=
=C2=A0 =C2=A0 But sed did NOT match them.</div><div>=C2=A0 =C2=A0 Yes, just=
 one particular example in en_CA.utf8 locale.</div><div><br></div><div>[...=
]</div><div>&gt;&gt;&gt; As such, I&#39;m marking this as &quot;not a bug&q=
uot; and closing the ticket,</div><div>&gt;&gt;&gt; but discussion can cont=
inue by replying to this thread.</div><div>&gt;&gt;=C2=A0</div><div>&gt;&gt=
; I still remain in doubt, at the very minimum.</div><div>&gt;=C2=A0</div><=
div>&gt; I hope this helps clears things out, but I&#39;m happy to continue=
</div><div>&gt; this discussion if there are other questions.</div><div><br=
></div><div>I am clear now that this is unsupported by sed, thanks.</div><d=
iv><br></div></div></div>

--000000000000f3c10b056d007ae4--


From unknown Sat Sep 20 10:49:24 2025
Received: (at fakecontrol) by fakecontrolmessage;
To: internal_control@debbugs.gnu.org
From: Debbugs Internal Request <help-debbugs@gnu.org>
Subject: Internal Control
Message-Id: bug archived.
Date: Fri, 22 Jun 2018 11:24:03 +0000
User-Agent: Fakemail v42.6.9

# This is a fake control message.
#
# The action:
# bug archived.
thanks
# This fakemail brought to you by your local debbugs
# administrator