From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:01:34 2019 Received: (at submit) by debbugs.gnu.org; 5 Dec 2019 20:01:34 +0000 Received: from localhost ([127.0.0.1]:45238 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxJd-0002cD-Cr for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:01:33 -0500 Received: from lists.gnu.org ([209.51.188.17]:50917) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icvuL-0008Ic-FW for submit@debbugs.gnu.org; Thu, 05 Dec 2019 13:31:21 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:34876) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1icvuK-0000h8-Fv for bug-grep@gnu.org; Thu, 05 Dec 2019 13:31:21 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1icvuJ-0002vm-He for bug-grep@gnu.org; Thu, 05 Dec 2019 13:31:20 -0500 Received: from mail-qk1-x731.google.com ([2607:f8b0:4864:20::731]:33891) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1icvuH-0002rv-Ds for bug-grep@gnu.org; Thu, 05 Dec 2019 13:31:17 -0500 Received: by mail-qk1-x731.google.com with SMTP id d202so4225209qkb.1 for ; Thu, 05 Dec 2019 10:31:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=H+48xYGVWW/AFNZUg8ER0T4bSxisRhk86HPGucQg7+8=; b=aM8FGvcvjt/QaEAKnLMakQhpZLWBijov59gb70NM9LJ59n81hI4YhyIi1t/gkyeS5m fOsf2bYYTbNnT62dyXN+uq3ka6wf4tUUYB/+5A+bOgQsl9j4SSGpvdDNZM7kTAYxnrOe OO6k/hMQi1J09OzDXFn5vAxtTUJc/8y6EB6jl5hpTeX5jXpxiyVqq9T8jvrSvjKdVbi3 TfGEeQ+LI6DDskfZwEPdTPIFaSlM4xdRob8sB4XfDXdd0H0vYqSKYysX4ugFi4ROsP3/ caUI7WLH7gLhWB80037QQ3qJAUX+Sj11qY8U7VIWV4xPd3KRX0p5fxleaCAfJWtUb8i6 5jwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=H+48xYGVWW/AFNZUg8ER0T4bSxisRhk86HPGucQg7+8=; b=Uu8F9WXxRqNVZU2k+DUWWdceC6+suWnk6h192b4tkD/9vkE9wbx6fOhGWWXhoa4M8Y fgt3+g2yZr5vYBP9+gRrPU6SemI94JjC39bhXupbkusCzD2/zrXgyZeKu8FL7za3SKlI dguhpajk3iesuzUjCo3uJHf0Axv3tdbMpEpt2GXRaOOeRSG7Wa/vFX5kB1ACakhp74V2 dFueY4F3IBD0gC1a0xzPDny9ZBUvTLGO070P0coAifJN9EGtiP2QURU5sb53G+DDxhdQ lOtRpi2st3Hp1JWQk+jLLjijLTAU/rus7Qwl250Li5scx+mG/3ow/I/sTilAH/MylnH7 V33g== X-Gm-Message-State: APjAAAWbhUAHejE/uCkBkUdz8G/CL1Wa7fC/IzPu9XhktLA1Po2G0ZnN 8HhVKeqlxIZdcsOlVK7w+jdx4bB8nHS5xKR0JUO6JOc+so+MCw== X-Google-Smtp-Source: APXvYqy4TJxsh236XQVEJ7NqJGcDZothYhSNtyr1CTsErCgjkDGw7jTg5G1g40fWwfZlNYXeyYIO29NE7cfDBIQRUm4= X-Received: by 2002:a37:9acb:: with SMTP id c194mr9487304qke.291.1575570671612; Thu, 05 Dec 2019 10:31:11 -0800 (PST) MIME-Version: 1.0 From: jan h Date: Thu, 5 Dec 2019 18:30:58 +0000 Message-ID: Subject: Locale can cause incorrect number parsing in binary files To: bug-grep@gnu.org Content-Type: text/plain; charset="UTF-8" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::731 X-Spam-Score: 0.7 (/) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 05 Dec 2019 15:01:32 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) grep 3.3 I get a few weird symbols (seems valid utf-8), along with normal numbers with the following simple snippet (.UTF-8 and .utf8 result in same, even .UtF---8 is the same): LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n" wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters meanwhile, with LC_ALL being C.UTF-8 this is not the case, LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c consistently results in 1024 characters/bytes, as it's supposed to be... it's not just en_US, it seems ANY utf-8 locale, other than C results in this bug, whereas non-utf8 versions are fine, bare en_US doesn't show this bug, nor does en_US.iso88591... worthy of note is that [[:digit:]] works correctly, while [0-9] does not (and 1-9 is same bug as 0-9, if you were wondering), setting -E doesn't change anything either... From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:29:34 2019 Received: (at control) by debbugs.gnu.org; 5 Dec 2019 20:29:34 +0000 Received: from localhost ([127.0.0.1]:45291 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxkj-0003Tr-Ob for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:29:34 -0500 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:24183 helo=us-smtp-1.mimecast.com) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxkg-0003TQ-Gl for control@debbugs.gnu.org; Thu, 05 Dec 2019 15:29:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575577765; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=htAVTw5Vn6k6J/H0AawEGTDRDE3VedAWKB5fDkix/LA=; b=C+4XwjwmTJAZFEoRGbg/qicQr6dlYDqd2KkeY9836DIaDPw46mk2zBhER6nCX75hRGprEi 4O1CWX/06Pdxg+U3tOZNu8fSZlGfcpunitaXiFe4th6eb72B2X6ze6X6B54YGwmI4AQdGZ DYaqNI5WpZnvWDDJxB+zfWPVmiHf5II= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-25-fzixrLkRMauydfthgNnxHA-1; Thu, 05 Dec 2019 15:29:22 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id CC1A61005502; Thu, 5 Dec 2019 20:29:20 +0000 (UTC) Received: from [10.3.116.171] (ovpn-116-171.phx2.redhat.com [10.3.116.171]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 81F77600D1; Thu, 5 Dec 2019 20:29:20 +0000 (UTC) Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary files To: jan h , 38503-done@debbugs.gnu.org References: From: Eric Blake Organization: Red Hat, Inc. Message-ID: <756269ef-ec82-f723-1bc8-b784bfbabad9@redhat.com> Date: Thu, 5 Dec 2019 14:29:19 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-MC-Unique: fzixrLkRMauydfthgNnxHA-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) tag 38503 notabug thanks On 12/5/19 12:30 PM, jan h wrote: > grep 3.3 >=20 > I get a few weird symbols (seems valid utf-8), along with normal > numbers with the following simple snippet (.UTF-8 and .utf8 result in > same, even .UtF---8 is the same): > LC_ALL=3Den_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "= \n" > wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte charac= ters It's important to note that POSIX says that the regex [0-9] has=20 locale-dependent effects. Outside of the C/POSIX locale, it matches=20 whatever the locale definition says it should. For example, some=20 locales allow [A-Z] to match non-ASCII letters like =C3=81. Similarly, as= =20 you have found, on your system, the en_US.UTF-8 locale is defined to=20 match non-ASCII Unicode digits when a range expression for [0-9] is in=20 force. Note that the Rational Range Interpretation of ranges claims that [0-9]=20 should have the expansion [012345689] in ALL locales; and more and more=20 versions of GNU utilities are starting to move to RRI (even newer glibc=20 is trying to move towards RRI for more regex operations). If this=20 example is run where RRI is in force, then it should not match non-ASCII=20 Unicode digits. But you didn't mention which version of grep you are=20 using, let alone which version of libc is providing your locale=20 definitions, to make that determination; and POSIX does not require RRI. > meanwhile, with LC_ALL being C.UTF-8 this is not the case, > LC_ALL=3DC.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|= wc -c > consistently results in 1024 characters/bytes, as it's supposed to be... Well, in the POSIX locale (C.UTF-8 is not the POSIX locale, but follows=20 enough of the same rules), [0-9] _is_ required to match the same as=20 [01234356789]. That's the only locale where you get RRI for free,=20 rather than having to worry if your choice of program version and locale=20 definition provide it. > it's not just en_US, it seems ANY utf-8 locale, other than C results > in this bug, whereas non-utf8 versions are fine, bare en_US doesn't > show this bug, nor does en_US.iso88591... en_US.iso88591 does not have the problem because in that encoding, there=20 aren't any non-ASCII digits. So [0-9] will never match any non-ASCII=20 Unicode digits because the charset in use doesn't have such characters. >=20 > worthy of note is that [[:digit:]] works correctly, while [0-9] does > not (and 1-9 is same bug as 0-9, if you were wondering), setting -E > doesn't change anything either... POSIX requires [[:digit:]] to expand to the same 10 characters in ALL=20 locales, regardless of what the implementation does with [0-9], and=20 regardless of whether an implementation uses RRI. (This is true for=20 [[:digit:]], but not for other named ranges; for example, [[:alpha:]] is=20 still locale-dependent and may expand to more than 26 characters). Since the problem you reported is due to your locale, I'm closing this=20 as a non-bug. We may reopen it if additional details show that your=20 version of grep was supposed to be using RRI but failed to do so. And=20 feel free to continue conversation, even if we don't reopen the bug. --=20 Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:40:48 2019 Received: (at 38503-done) by debbugs.gnu.org; 5 Dec 2019 20:40:48 +0000 Received: from localhost ([127.0.0.1]:45319 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxvb-0005jh-Pj for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:40:48 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:33193 helo=us-smtp-1.mimecast.com) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxvZ-0005jY-UH for 38503-done@debbugs.gnu.org; Thu, 05 Dec 2019 15:40:46 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575578445; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=R8/tIBJhK7Qi3OzbnTtcp5XssfmU1ThZvlx95w1L2v4=; b=T7kJV5fbO7i/+ktY+smxX01tqswmfCSfBEWxCRvxORKMZl5V89JMINyP35cog4pc+M6bPD uXqt86YY/Km7gxzoSa2PAcYsyIdkq/7y6IHvrky6Gh6BqtwentumajIQ3/yvf1DqriAvb/ kfQsZ4UWXQuBKwOxQQzxCZASBqXiJt0= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-216-CSaAwOB-P1K1V7kUrvx-Kg-1; Thu, 05 Dec 2019 15:40:44 -0500 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 4A71D1005502; Thu, 5 Dec 2019 20:40:43 +0000 (UTC) Received: from [10.3.116.171] (ovpn-116-171.phx2.redhat.com [10.3.116.171]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 15A6F579F; Thu, 5 Dec 2019 20:40:42 +0000 (UTC) Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary files From: Eric Blake To: jan h , 38503-done@debbugs.gnu.org References: <756269ef-ec82-f723-1bc8-b784bfbabad9@redhat.com> Organization: Red Hat, Inc. Message-ID: Date: Thu, 5 Dec 2019 14:40:42 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <756269ef-ec82-f723-1bc8-b784bfbabad9@redhat.com> Content-Language: en-US X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-MC-Unique: CSaAwOB-P1K1V7kUrvx-Kg-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 38503-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 12/5/19 2:29 PM, Eric Blake wrote: > tag 38503 notabug > thanks >=20 > On 12/5/19 12:30 PM, jan h wrote: >> grep 3.3 >> >=20 > Note that the Rational Range Interpretation of ranges claims that [0-9]= =20 > should have the expansion [012345689] in ALL locales; and more and more= =20 > versions of GNU utilities are starting to move to RRI (even newer glibc= =20 > is trying to move towards RRI for more regex operations).=C2=A0 If this= =20 > example is run where RRI is in force, then it should not match non-ASCII= =20 > Unicode digits.=C2=A0 But you didn't mention which version of grep you ar= e=20 > using, let alone which version of libc is providing your locale=20 > definitions, to make that determination; and POSIX does not require RRI. Sorry, I missed that you did mention grep 3.3. And the NEWS for grep=20 does not mention 'RRI' or 'Rational Range Interpretation' (compare that=20 to bash 4.2 introducing globasciiranges, or gawk introducing RRI in=20 4.0.1). So I'm not sure of the current state of whether grep tries to=20 use RRI on all systems or only on systems where it relies on gnulib's=20 regcomp instead of libc. So we may still need to reopen this if we=20 decide grep needs more RRI fixes. --=20 Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:43:59 2019 Received: (at submit) by debbugs.gnu.org; 5 Dec 2019 20:44:00 +0000 Received: from localhost ([127.0.0.1]:45329 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxyh-0005oy-9l for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:43:59 -0500 Received: from lists.gnu.org ([209.51.188.17]:55816) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icw3O-00004s-KW for submit@debbugs.gnu.org; Thu, 05 Dec 2019 13:40:43 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:59943) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1icw3N-0003eH-Gd for bug-grep@gnu.org; Thu, 05 Dec 2019 13:40:42 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: * X-Spam-Status: No, score=1.3 required=5.0 tests=BAYES_50,FREEMAIL_FROM, PDS_BTC_ID autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1icw3J-0005dA-Ve for bug-grep@gnu.org; Thu, 05 Dec 2019 13:40:39 -0500 Received: from mail-qk1-x72a.google.com ([2607:f8b0:4864:20::72a]:36713) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1icw3J-0005YQ-Op for bug-grep@gnu.org; Thu, 05 Dec 2019 13:40:37 -0500 Received: by mail-qk1-x72a.google.com with SMTP id v19so4242695qkv.3 for ; Thu, 05 Dec 2019 10:40:36 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :content-transfer-encoding; bh=GkVusCWmgiPo/o2C4AmSj1satnyjcVjj5nmzg8V+5HI=; b=CwmStDRoP3r9a25oYSBQG3/70mgs0eeoYv40itm/gu17b5nFxDz5NtZY3gIHFo3BkF iPwFIGPJs8TeY2C34OdK/9QaiFmXA6VzaMKIUmKXvxr89KfaQ9ed0x3HRrZjNiSDBsnT rK1freKWyEV59FDdWQyl19vhnK5Fe0zS6oLWS4tvDX4C50AEui1kDVaCBIs5KqtfQ7V7 tDDQU+BpDQQuw+0xLsLJXrID8xcgTPsoc8tGMp2oT42WcubS+M1EKRAxr49Rwc8LuyyJ r2h/QagQNIUV1oO82SG4mCB9Om0IFu0DvHoqjpSeboB5d7Wlf41ULJYAc531SaTWZU4k W5/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:content-transfer-encoding; bh=GkVusCWmgiPo/o2C4AmSj1satnyjcVjj5nmzg8V+5HI=; b=tI1L/7N/TyZdmOsrgCgRJ/m8Wk0K9yLmZ1yKSgOAWt1H64/+L9+B/tQJbzKuJIR76B 53ZECztQhech2TJLiWWJxrVnj1ofc0u0DXU3QZVGiGVjyht76LbdPYrY/lAMhWrJ4uMr QIn52jiFSEl1lpPH0SW5hK+q6hNmA0v02et6AD6+IiY9mfiskKQWZ40mD/hkX0aOj80+ 3KDEqic3Q159GNAN7Liiu6OSujBOMtm2HsyB5kqE4PD1L1mZ2n6vL5LETViOqBJyvQ1F lmpCtd9rV2rDBImIycY0aZCNvlj/T0L6iqTK804ZjjfzZLceWwS8ojBdCNpWCHftZ+a1 GcPg== X-Gm-Message-State: APjAAAXPKOG6H94jWmtB6QrUjELdSGWlxRaHnMRr1PcpZw0O3rBjwISC T0JyJYS/LlGf2Cgb8wzAQPJHiZA/ZaY/sYmx1UQgWkuoBlQ= X-Google-Smtp-Source: APXvYqyrJXzb4eWQCVrLDNGwJwKOvk2TjrLjbv9mLshXZmj3EfG5zC5LRw94Ok+3s9CUaGSsC+lLWgFdhqoJgTVBD3Y= X-Received: by 2002:ae9:e649:: with SMTP id x9mr9376399qkl.405.1575571235331; Thu, 05 Dec 2019 10:40:35 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: jan h Date: Thu, 5 Dec 2019 18:40:21 +0000 Message-ID: Subject: Re: Locale can cause incorrect number parsing in binary files To: bug-grep@gnu.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::72a X-Spam-Score: -0.8 (/) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 05 Dec 2019 15:43:58 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.8 (-) On another machine with grep 3.1 this does not appear to be the case, so, regression? Kontakt jan h () kirjutas kuup=C3=A4eval N, 5. detsember 2019 kell 18:30: > > grep 3.3 > > I get a few weird symbols (seems valid utf-8), along with normal > numbers with the following simple snippet (.UTF-8 and .utf8 result in > same, even .UtF---8 is the same): > LC_ALL=3Den_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "= \n" > wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte charac= ters > meanwhile, with LC_ALL being C.UTF-8 this is not the case, > LC_ALL=3DC.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|= wc -c > consistently results in 1024 characters/bytes, as it's supposed to be... > it's not just en_US, it seems ANY utf-8 locale, other than C results > in this bug, whereas non-utf8 versions are fine, bare en_US doesn't > show this bug, nor does en_US.iso88591... > > worthy of note is that [[:digit:]] works correctly, while [0-9] does > not (and 1-9 is same bug as 0-9, if you were wondering), setting -E > doesn't change anything either... From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:44:00 2019 Received: (at submit) by debbugs.gnu.org; 5 Dec 2019 20:44:00 +0000 Received: from localhost ([127.0.0.1]:45332 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icxyh-0005p0-NI for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:44:00 -0500 Received: from lists.gnu.org ([209.51.188.17]:42900) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icwHb-0000ZE-4W for submit@debbugs.gnu.org; Thu, 05 Dec 2019 13:55:23 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:34649) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1icwHZ-0004KX-Nc for bug-grep@gnu.org; Thu, 05 Dec 2019 13:55:22 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.0 required=5.0 tests=BAYES_40,FREEMAIL_FROM autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1icwHX-0003kb-Eo for bug-grep@gnu.org; Thu, 05 Dec 2019 13:55:21 -0500 Received: from mail-qv1-xf31.google.com ([2607:f8b0:4864:20::f31]:42464) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1icwHV-0003j5-J0 for bug-grep@gnu.org; Thu, 05 Dec 2019 13:55:19 -0500 Received: by mail-qv1-xf31.google.com with SMTP id q19so1690804qvy.9 for ; Thu, 05 Dec 2019 10:55:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=1rPwBJbJAJofMxUoBx7pvlJ4YPpAORGtqRFS7b6zz3I=; b=Q4b+MO1fuTAteSs00aBtfF2AN0nhgzONADoGf9W4SB6beU+cjZ372LnLeBAAFl2cmR gp1cT2v9Bic/uhljuWdnF8jeWcUMirhAwC6QfqII3liEDaYVX8mpG5K1YrrO3cGTb6jp hQpNXPTDB0Oa6Dlc4bdh/glxZ3AQIWRPnZyB79Guq/28uASD65bPEaiJa+p3OuXe6vK0 lnb6k/rpGbiMo4v4rqkORY4/Z6uy7Dl1TNGlne3pcX+v6q0jU2WjSMw4MdaeW9wtqhqN sVEUo6C61qg+AAKDz+LGLcARlhpbPMFSC5iEqOKQCwPRAoWDJiN5EB65gv5U9szKfHUr 0HoA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=1rPwBJbJAJofMxUoBx7pvlJ4YPpAORGtqRFS7b6zz3I=; b=WIqvB95i7zz/gAOahs6brToZOLT+7FPaowwrJE4cgHrO8cSyeGt10uKbigS8aoT4Gp yPHvFDAecsoxu13bmndOkIa1t48JiKkRg4pQXC9cf6OYUaNXhTyK+4KhOxX98/FER97f cs2TPr+ZBDsTvYG7rtLvZe3HwuMb5bHmaV6FNY7PFz3jEviKT6Uvx/Gouwwj6T/KtS+s k90FXpwikPndkp+0sCAN8p2q1/+mrRJJrHjb9Y0RRwOWEHLVCLzAt2SV37Ic40H82zsT cF6BZvA2f+mBVANEeoX3xWT84v8W1WcPBNSmyeWlegLP8yjXAtH+ktuEapfqdAljDIdu L+YA== X-Gm-Message-State: APjAAAVRdWimKOzGOAL35sCY6/590CDrUQPxoTuLja9WgWvZk9Wk01P+ ZDKfxsCf9E/7cR4/vMfGSFew8zlcZDqpt8EIUvknJNoBLcU= X-Google-Smtp-Source: APXvYqwBTaItam1vJXxDivEiU3cp3+io78x7so0x4mj8oBGpBSJD31g1BUg8Qu6RkihHnvZIoIkhT3DUy36WsvvqdFM= X-Received: by 2002:ad4:4e6a:: with SMTP id ec10mr9047535qvb.160.1575572116027; Thu, 05 Dec 2019 10:55:16 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: jan h Date: Thu, 5 Dec 2019 18:55:01 +0000 Message-ID: Subject: Re: Locale can cause incorrect number parsing in binary files To: bug-grep@gnu.org Content-Type: text/plain; charset="UTF-8" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::f31 X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 05 Dec 2019 15:43:58 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) compiling from scratch resulted in a normal, working version apparently Arch's package was somehow badly made? From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:50:19 2019 Received: (at 38503) by debbugs.gnu.org; 5 Dec 2019 20:50:19 +0000 Received: from localhost ([127.0.0.1]:45356 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icy4p-0007wf-Dg for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:50:19 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:46468 helo=us-smtp-1.mimecast.com) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icy4m-0007wX-LB for 38503@debbugs.gnu.org; Thu, 05 Dec 2019 15:50:17 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1575579016; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=g4G5cmCXo7mD0aTSU97AnX5cIzXCyyL6oDml5EQZIwI=; b=DWjUs9I+7tYInoBbfjSk0CuWrhFk1vxGi7m20CxFV52crKgDYnVVe9O9BSt7WbzUiRqbuN 732Vgw41GS9jLMPk4P6pMI2QHVoNCIgtcTZiKwJPYiOnFk7mTpdsr+3zjzAFHpJeJO0Cyg pFqwQ8FbYQtCpgb5dv4hZwLF0Gr/FdQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-170-b8SnVRNaM7-1DvQqKN8apQ-1; Thu, 05 Dec 2019 15:50:14 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B5B7518557C8; Thu, 5 Dec 2019 20:50:13 +0000 (UTC) Received: from [10.3.116.171] (ovpn-116-171.phx2.redhat.com [10.3.116.171]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8234A694A3; Thu, 5 Dec 2019 20:50:13 +0000 (UTC) Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary files To: jan h , 38503@debbugs.gnu.org References: From: Eric Blake Organization: Red Hat, Inc. Message-ID: <8206172d-1dfc-4509-5f21-e6a24d01830b@redhat.com> Date: Thu, 5 Dec 2019 14:50:12 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-MC-Unique: b8SnVRNaM7-1DvQqKN8apQ-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 38503 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 12/5/19 12:55 PM, jan h wrote: > compiling from scratch resulted in a normal, working version > apparently Arch's package was somehow badly made? You also need to check whether your builds were using gnulib's regcomp replacement, or sticking with the one from glibc; and in turn which version of glibc is in use (as it was glibc 2.28 that tried to use RRI in more locales, although work is still not complete there - and the presence or absence of particular historical glibc regcomp bugs determines whether configure decides to use gnulib's version instead). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 05 15:56:13 2019 Received: (at 38503-done) by debbugs.gnu.org; 5 Dec 2019 20:56:13 +0000 Received: from localhost ([127.0.0.1]:45366 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icyAX-00086q-Br for submit@debbugs.gnu.org; Thu, 05 Dec 2019 15:56:13 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:34000) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1icyAV-00086c-Bc for 38503-done@debbugs.gnu.org; Thu, 05 Dec 2019 15:56:12 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 720841601B4; Thu, 5 Dec 2019 12:56:04 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id lXrvs-KblNc8; Thu, 5 Dec 2019 12:56:03 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id C3CAF16023B; Thu, 5 Dec 2019 12:56:03 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 9l1WJRmNFLGl; Thu, 5 Dec 2019 12:56:03 -0800 (PST) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id A88E91601B4; Thu, 5 Dec 2019 12:56:03 -0800 (PST) Subject: Re: bug#38503: Locale can cause incorrect number parsing in binary files To: Eric Blake , jan h , 38503-done@debbugs.gnu.org References: <756269ef-ec82-f723-1bc8-b784bfbabad9@redhat.com> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <248b9c64-5cdf-f2f2-a902-187f68f99a4e@cs.ucla.edu> Date: Thu, 5 Dec 2019 12:56:03 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 38503-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 12/5/19 12:40 PM, Eric Blake wrote: > I'm not sure of the current state of whether grep tries to use RRI on > all systems or only on systems where it relies on gnulib's regcomp > instead of libc. As I recall, grep doesn't make any special effort to use RRI. That is, if the underlying library uses RRI, then grep does so as well; otherwise it doesn't. From unknown Tue Aug 19 21:02:52 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Fri, 03 Jan 2020 12:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator