From unknown Sat Jun 14 03:48:05 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#33837 <33837@debbugs.gnu.org> To: bug#33837 <33837@debbugs.gnu.org> Subject: Status: Unexpected result for regex with non-ascii range Reply-To: bug#33837 <33837@debbugs.gnu.org> Date: Sat, 14 Jun 2025 10:48:05 +0000 retitle 33837 Unexpected result for regex with non-ascii range reassign 33837 grep submitter 33837 Reinis Danne severity 33837 normal tag 33837 notabug thanks From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 22 16:33:44 2018 Received: (at submit) by debbugs.gnu.org; 22 Dec 2018 21:33:44 +0000 Received: from localhost ([127.0.0.1]:60273 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gaou0-0004U6-65 for submit@debbugs.gnu.org; Sat, 22 Dec 2018 16:33:44 -0500 Received: from eggs.gnu.org ([208.118.235.92]:45861) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ganBu-0001kx-8D for submit@debbugs.gnu.org; Sat, 22 Dec 2018 14:44:06 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ganBo-0004DT-Bz for submit@debbugs.gnu.org; Sat, 22 Dec 2018 14:44:01 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:40467) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1ganBo-0004DL-98 for submit@debbugs.gnu.org; Sat, 22 Dec 2018 14:44:00 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55222) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ganBn-0002rH-9d for bug-grep@gnu.org; Sat, 22 Dec 2018 14:44:00 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ganBm-0004Bd-Gj for bug-grep@gnu.org; Sat, 22 Dec 2018 14:43:59 -0500 Received: from mail-io1-xd36.google.com ([2607:f8b0:4864:20::d36]:36087) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1ganBm-0004BK-CR for bug-grep@gnu.org; Sat, 22 Dec 2018 14:43:58 -0500 Received: by mail-io1-xd36.google.com with SMTP id m19so6405232ioh.3 for ; Sat, 22 Dec 2018 11:43:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=qMPIf/v3Autbnx3Ci7HpZN7UjVXThQZrbTumw7ptJuA=; b=hR17MTKMsg1BtVPItct4kMX4b+ZdG8mhcqfVOA9IbmHs6HT+RBvtmzXK2WN0cx0llJ 80Q06p89PbITUOnWMyzn/oCrKWyLPPasIC/0JIHsYONpXRwehi4j8Djsaj+k0UhB8lcV cJPhU9BlGvNIlekTebhdm1H4AaYUuQioTm85SwEpTs4+EB04xYeXUUoxKXmluUGo/9IK Kvx6jl/XQa2ll4fhy/BYIYn9GacQWDlCW/afBjgVVxgC0dr2tHbPPaNCzvefc6BcK5iG Nf+72aWUynu8j+j4pHtXV65lnNz7REsaqMwU28aDW25ZgY9ExJfZG/nokqoQK047vLBW EgFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-transfer-encoding; bh=qMPIf/v3Autbnx3Ci7HpZN7UjVXThQZrbTumw7ptJuA=; b=ft4xK7lEXmIt4V3s3vVw8wOykF0TRExNZPpDXXUNViP0lrrhrbHGOhuoUDEvXxU4yX GK8WZcEXdzfPfosWV+fZJz6A7jcB0L/jp7VXGxdrhxGmMbD19eiGkYwu88ZOuZ2/V7JD 3pGgqyUGyNB9hzyYQ1YUq78AQyznwQ7aea956gJTWQuPAWJltEKY2roVI/9SVAwZ/PWS nrbTPoRppw2mOLYGicdc55R8nnjrSVnJuE8983Elua1PaTB+E2B5xWoJxgTNIMHhorNR hJ68CdHpfE5goWwxRPsDfP4vxPOOjZAV5fDRUPR1htNIXBQ3j6vKmqXK9jszm6AWJS7M VK6w== X-Gm-Message-State: AJcUukdeiZowZjoIpIhVMtwj3lT8vd8YUWUaTNcyemWyG6uecfYJbqGK WXONobHQeUUXt4r7Ow6CdniWpJFnR6N/+yCYZykm0nQoCAg= X-Google-Smtp-Source: ALg8bN5F/PDk0ZJBEw4rEENqD0SOwKA9B54MjPASyHi6dehkB25kfU5Ym52HeKjAYsMwmeJ6p1Pp2wGCF8RQ5eaviDc= X-Received: by 2002:a6b:5a14:: with SMTP id o20mr4785981iob.206.1545507837468; Sat, 22 Dec 2018 11:43:57 -0800 (PST) MIME-Version: 1.0 From: Reinis Danne Date: Sat, 22 Dec 2018 21:43:46 +0200 Message-ID: Subject: Unexpected result for regex with non-ascii range To: bug-grep@gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 22 Dec 2018 16:33:42 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Hi! grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation of yY for lv_LV.UTF-8 locale (by implementing rational range interpretation?) [1]. [1] https://sourceware.org/bugzilla/show_bug.cgi?id=3D23774 However, it seems that for ranges [a-=C5=BE] and [A-=C5=BD] there are unexp= ected results: $ echo aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI= =C4=AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8Cp= PqQrR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ=C5=BE=C5=BD | LC_COLLATE=3Dlv_LV.UTF-8 grep -Eo '[A-=C5=BD]*' aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI=C4=AB=C4= =AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8CpPqQrR=C5= =97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ =C5=BD $ echo aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI= =C4=AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8Cp= PqQrR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ=C5=BE=C5=BD | LC_COLLATE=3Dlv_LV.UTF-8 grep -Eo '[a-=C5=BE]*' a =C4=81=C4=80b c =C4=8D=C4=8Cd e =C4=93=C4=92f g =C4=A3=C4=A2h i =C4=AB=C4=AAy j k =C4=B7=C4=B6l =C4=BC=C4=BBm n =C5=86=C5=85o =C5=8D=C5=8Cp q r =C5=97=C5=96s =C5=A1=C5=A0t u =C5=AB=C5=AAv w x z =C5=BE=C5=BD For the uppercase the result is completely bogus, but for the lowercase ran= ge it seems that accented uppercase letters are interleaved with the lowercase ones. I would expect all letters to have their uppercase variants de-interleaved = here. I don't know if grep alters the collation rules or it is done by glibc (2.2= 8). strxfrm() gives me this result: Using LC_COLLATE=3Dlv_LV.UTF-8 char strxfrm i c2b7010201020101e29b96 I c2b7010201070101e2afb7 =C4=AB c2b70102140102020101e29bb7 =C4=AA c2b70102140107020101e2b096 y c2b701030102 Y c2b701030107 j c382010201020101e29c96 J c382010201070101e2b0a4 Using LC_COLLATE=3DC.UTF-8 char strxfrm i 6b I 4b =C4=AB c4ad =C4=AA c4ac y 7b Y 5b j 6c J 4c Reinis From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 23 15:18:12 2018 Received: (at 33837) by debbugs.gnu.org; 23 Dec 2018 20:18:12 +0000 Received: from localhost ([127.0.0.1]:34058 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gbACS-0002ZL-Gq for submit@debbugs.gnu.org; Sun, 23 Dec 2018 15:18:12 -0500 Received: from mail-wm1-f54.google.com ([209.85.128.54]:35451) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gbACQ-0002Z5-NN for 33837@debbugs.gnu.org; Sun, 23 Dec 2018 15:18:11 -0500 Received: by mail-wm1-f54.google.com with SMTP id c126so10370085wmh.0 for <33837@debbugs.gnu.org>; Sun, 23 Dec 2018 12:18:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=IfNIuY4uPxFkLaiCUPIAvkfqGgdrmY0vIfZVUKy0wCY=; b=ds2hyqgY8zkby9rv2+GL3NPMTB5yTCd1WY1IjwSjXkeNLJIasD1flY7RtFkLYeX477 fL23FruuB9ysix9H+9Bn8GeUA5RplB0KcAUDn3KG6SUjI6m8cCdaq468n5WpVp0vVRsS LOoPozDMhrMcDOuLYbcTMXDPMxvwngU/CJnocbCQBE4qfs4e3RGKmvAdxG8131KmEYgk +3ApaFrS3cYshtZ06Jx65iuXHwOpotaHBZ9xACeYtMwvSuf9oaQ5P6Lq/uJxhBdHhWWv R5lnPN5Nrv10hk8s0d8N4OrjOPXChc2Lx4ZohnVA8wVTHA2mkM0T/3SypfYTf9ILhm45 ERxw== X-Gm-Message-State: AA+aEWZ0ehc8tSTeXb8o7kGUBCni8+M9/q0rpkYVO3Kev3d/dyton8iV vEGW9nIUGV4IgSQtTgEecueUbi/37yg2rUyOnr4= X-Google-Smtp-Source: AFSGD/UASG0HlTHo9kzfL07RYnL5YEXRTp0VYxzAsnDTCePzoEiwt9cYyMo5mlPvQ1KGWpcqP5zmfSZHsvTXdQrp2pQ= X-Received: by 2002:a1c:1c8:: with SMTP id 191mr9873828wmb.150.1545596284794; Sun, 23 Dec 2018 12:18:04 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jim Meyering Date: Sun, 23 Dec 2018 12:17:52 -0800 Message-ID: Subject: Re: bug#33837: Unexpected result for regex with non-ascii range To: rei4dan@gmail.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 33837 Cc: 33837@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) tags 33873 notabug close 33873 stop On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne wrote: > grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation > of yY for lv_LV.UTF-8 locale (by implementing rational range > interpretation?) [1]. > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=3D23774 > > However, it seems that for ranges [a-=C5=BE] and [A-=C5=BD] there are une= xpected results: > $ echo aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI= =C4=AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8Cp= PqQrR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ=C5=BE=C5=BD > | LC_COLLATE=3Dlv_LV.UTF-8 grep -Eo '[A-=C5=BD]*' > aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI=C4=AB= =C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8CpPqQrR= =C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ > =C5=BD > $ echo aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI= =C4=AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8Cp= PqQrR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ=C5=BE=C5=BD > | LC_COLLATE=3Dlv_LV.UTF-8 grep -Eo '[a-=C5=BE]*' > a > =C4=81=C4=80b > c > =C4=8D=C4=8Cd ... > > For the uppercase the result is completely bogus, but for the lowercase r= ange > it seems that accented uppercase letters are interleaved with the > lowercase ones. > > I would expect all letters to have their uppercase variants de-interleave= d here. > > I don't know if grep alters the collation rules or it is done by glibc (2= .28). > strxfrm() gives me this result: > Using LC_COLLATE=3Dlv_LV.UTF-8 > char strxfrm > i c2b7010201020101e29b96 > I c2b7010201070101e2afb7 ... Thanks for the report. However, ... Using a multi-byte character as a range endpoint elicits what the standards documents call "unspecified behavior". Quoting grep's own manual, > Within a bracket expression, a "range expression" consists of two charact= ers separated by a hyphen. It matches any single character that sorts betw= een the two characters, inclusive. In the default C locale, the sorting se= quence is the native character order; for example, '[a-d]' is equivalent to= '[abcd]'. In other locales, the sorting sequence is not specified, and '[= a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail t= o match any character, or the set of characters that it matches might even = be erratic. To obtain the traditional interpretation of bracket expression= s, you can use the 'C' locale by setting the 'LC_ALL' environment variable = to the value 'C'. For the record, POSIX says this: http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html: > Range expressions are, historically, an integral part of REs. However, th= e requirements of "natural language behavior" and portability do conflict. = In the POSIX locale, ranges must be treated according to the collating sequ= ence and include such characters that fall within the range based on that c= ollating sequence, regardless of character values. In other locales, ranges= have unspecified behavior. I am marking the auto-created issue as "not-a-bug", and can't even (reasonably) label it as "wishlist", because allowing what your usage implies is fundamentally contradictory. You're welcome to continue the discussion here. From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 23 16:06:59 2018 Received: (at 33837) by debbugs.gnu.org; 23 Dec 2018 21:06:59 +0000 Received: from localhost ([127.0.0.1]:34074 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gbAxf-0003o7-Ck for submit@debbugs.gnu.org; Sun, 23 Dec 2018 16:06:59 -0500 Received: from mail-io1-f41.google.com ([209.85.166.41]:34404) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gbAxd-0003nx-J3 for 33837@debbugs.gnu.org; Sun, 23 Dec 2018 16:06:58 -0500 Received: by mail-io1-f41.google.com with SMTP id l22so392301ioh.1 for <33837@debbugs.gnu.org>; Sun, 23 Dec 2018 13:06:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=hfc2TeNE899pY3tagKiqf8zwW9NIad7AfmnsmTN8E2g=; b=JiwXfZ6jctrWXy7OXVyZO6U+FsBZw6TfZVRKVRnJUfh0q6he8KXlW2/kf4n6SAkWA5 U0tkNoJh+1dRDG4jxTxgH4VnTi1G5gA5scv/kU68O59f17ug6JEF+N35bJUauZH/rgG5 FTpcA8t4O+LYSKJ5VkvB5JrcVaapnrnBEYV9Rb2nuHgdYMjSfYragzSra1d3T7a/RrXb j1pPNqW3Te8uQgX7MKYrYBKg6C2yNtxN9ZGZWKn96xhghMU1Gh0J4QG/ihKfZPLBlVe6 dkjS3QAMSNgH/0iV0ZKFt73wGg9aBPXUkhJXZGs3qyMGBiecaztu10Mh6FrPyr3fMpY6 gPXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=hfc2TeNE899pY3tagKiqf8zwW9NIad7AfmnsmTN8E2g=; b=Ih+was9vQkvRKjgNDq3Pp0EeVhDrZp6S3bfKsBU39knvMnevYW+snJGwLgOqFDhx7a AMcJCNY/RuSMeGKPbZm3AeFA0mhn5iJ0SOa8kehYaMuL20j1H/Oij4Yt6ZicfdEQdQX4 kX+hj29O/7tmc3whZgqoXfQ+hddJiAKkn4zWCRhnwd5pqVIZh3pfWUNpLlgRRfvTIdQ+ 117nQthGXdXOpHPpMy4gpZiXVb2P8aXFqGoCMFpBJmV+rNSK74wKfdA+ThFAUun7mwdu cEPFNVd2mFi7Hu1l5uT0AniPxJMAyH8plxnD1V6rb41eCkgEoeeisJQBYkFqZBBG50j6 fz3A== X-Gm-Message-State: AJcUukeKMSXJoLf3ulKMhsHl+jcyMj0TMPT6qZ43n6EI1oUWUx1NaX95 042Wmmd8kaT+kEEfumGeCnG/YNiA5QGFnakWyhdRUg== X-Google-Smtp-Source: ALg8bN749KhHYCabmEJuAf++3Cs3b/ebu3VAiMReUHIDxkeNNknx4cCJZMiwp2Hw5Z6poNUCdJNeeGIHCrUbzch9yc0= X-Received: by 2002:a6b:5a14:: with SMTP id o20mr6961584iob.206.1545599212000; Sun, 23 Dec 2018 13:06:52 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Reinis Danne Date: Sun, 23 Dec 2018 23:06:40 +0200 Message-ID: Subject: Re: bug#33837: Unexpected result for regex with non-ascii range To: Jim Meyering Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 33837 Cc: 33837@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) sv=C4=93td., 2018. g. 23. dec., plkst. 22:18 =E2=80=94 lietot=C4=81js Jim M= eyering () rakst=C4=ABja: > > tags 33873 notabug > close 33873 > stop > > On Sat, Dec 22, 2018 at 1:34 PM Reinis Danne wrote: > > grep-3.3 and sed-4.6 seem to have fixed issue with incorrect collation > > of yY for lv_LV.UTF-8 locale (by implementing rational range > > interpretation?) [1]. > > > > [1] https://sourceware.org/bugzilla/show_bug.cgi?id=3D23774 > > > > However, it seems that for ranges [a-=C5=BE] and [A-=C5=BD] there are u= nexpected results: > > $ echo aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hH= iI=C4=AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5= =8CpPqQrR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ=C5=BE=C5=BD > > | LC_COLLATE=3Dlv_LV.UTF-8 grep -Eo '[A-=C5=BD]*' > > aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hHiI=C4= =AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5=8CpPqQ= rR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ > > =C5=BD > > $ echo aA=C4=81=C4=80bBcC=C4=8D=C4=8CdDeE=C4=93=C4=92fFgG=C4=A3=C4=A2hH= iI=C4=AB=C4=AAyYjJkK=C4=B7=C4=B6lL=C4=BC=C4=BBmMnN=C5=86=C5=85oO=C5=8D=C5= =8CpPqQrR=C5=97=C5=96sS=C5=A1=C5=A0tTuU=C5=AB=C5=AAvVwWxXzZ=C5=BE=C5=BD > > | LC_COLLATE=3Dlv_LV.UTF-8 grep -Eo '[a-=C5=BE]*' > > a > > =C4=81=C4=80b > > c > > =C4=8D=C4=8Cd > ... > > > > For the uppercase the result is completely bogus, but for the lowercase= range > > it seems that accented uppercase letters are interleaved with the > > lowercase ones. > > > > I would expect all letters to have their uppercase variants de-interlea= ved here. > > > > I don't know if grep alters the collation rules or it is done by glibc = (2.28). > > strxfrm() gives me this result: > > Using LC_COLLATE=3Dlv_LV.UTF-8 > > char strxfrm > > i c2b7010201020101e29b96 > > I c2b7010201070101e2afb7 > ... > > Thanks for the report. However, ... > Using a multi-byte character as a range endpoint elicits what the > standards documents call "unspecified behavior". > > Quoting grep's own manual, > > > Within a bracket expression, a "range expression" consists of two chara= cters separated by a hyphen. It matches any single character that sorts be= tween the two characters, inclusive. In the default C locale, the sorting = sequence is the native character order; for example, '[a-d]' is equivalent = to '[abcd]'. In other locales, the sorting sequence is not specified, and = '[a-d]' might be equivalent to '[abcd]' or to '[aBbCcDd]', or it might fail= to match any character, or the set of characters that it matches might eve= n be erratic. To obtain the traditional interpretation of bracket expressi= ons, you can use the 'C' locale by setting the 'LC_ALL' environment variabl= e to the value 'C'. > > For the record, POSIX says this: > http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html: > > > Range expressions are, historically, an integral part of REs. However, = the requirements of "natural language behavior" and portability do conflict= . In the POSIX locale, ranges must be treated according to the collating se= quence and include such characters that fall within the range based on that= collating sequence, regardless of character values. In other locales, rang= es have unspecified behavior. > > I am marking the auto-created issue as "not-a-bug", and can't even > (reasonably) label it as "wishlist", because allowing what your usage > implies is fundamentally contradictory. > > You're welcome to continue the discussion here. Thank you for the response. I had read that document before. I didn't realize that sorting order and collation order are two different things, or rather that alphabetic sorting would imply collation while sorting order the manual was talking about refers to comparison of code point numerical values. From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 04:00:37 2020 Received: (at control) by debbugs.gnu.org; 2 Jan 2020 09:00:37 +0000 Received: from localhost ([127.0.0.1]:38096 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imwLM-0003lm-Or for submit@debbugs.gnu.org; Thu, 02 Jan 2020 04:00:36 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:44856) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1imwLK-0003fT-Kp for control@debbugs.gnu.org; Thu, 02 Jan 2020 04:00:34 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 144A4160017 for ; Thu, 2 Jan 2020 01:00:29 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id SRbZwZBm3DU4 for ; Thu, 2 Jan 2020 01:00:28 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7B528160054 for ; Thu, 2 Jan 2020 01:00:28 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id rkQYZ6lsn861 for ; Thu, 2 Jan 2020 01:00:28 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 5921B160017 for ; Thu, 2 Jan 2020 01:00:28 -0800 (PST) To: control@debbugs.gnu.org From: Paul Eggert Subject: 33837 is not a bug Organization: UCLA Computer Science Department Message-ID: Date: Thu, 2 Jan 2020 01:00:28 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) tags 33837 notabug close 33837 stop From unknown Sat Jun 14 03:48:05 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Thu, 30 Jan 2020 12:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator