From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Stephane Chazelas Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 20 Nov 2016 21:51:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 24975@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.147967864529712 (code B ref -1); Sun, 20 Nov 2016 21:51:02 +0000 Received: (at submit) by debbugs.gnu.org; 20 Nov 2016 21:50:45 +0000 Received: from localhost ([127.0.0.1]:36474 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8a0b-0007j9-9r for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:45 -0500 Received: from eggs.gnu.org ([208.118.235.92]:42521) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8a0Z-0007iw-L8 for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:43 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c8a0T-0005rn-G9 for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:38 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:32833) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1c8a0T-0005rj-7j for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:37 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33492) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1c8a0S-0008BJ-4j for bug-grep@gnu.org; Sun, 20 Nov 2016 16:50:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c8a0N-0005rU-40 for bug-grep@gnu.org; Sun, 20 Nov 2016 16:50:36 -0500 Received: from mail-wm0-x230.google.com ([2a00:1450:400c:c09::230]:37001) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1c8a0M-0005rQ-Sz for bug-grep@gnu.org; Sun, 20 Nov 2016 16:50:31 -0500 Received: by mail-wm0-x230.google.com with SMTP id t79so115134237wmt.0 for ; Sun, 20 Nov 2016 13:50:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=RxYzmnrjHZVfoW4M9d1HP1KBWggDBG7yN6ykY5399ww=; b=lpgb/uBxoYtL1mwZH/EXYGXkn+8E90kx8TAQ6EomLr/OtvHlrM7HoFJrkMlSn0GQWQ oFRmBoiARR+0bRcN1lgzVr1R7UY0RsAwfskjSQ/JIZ6WY8NnlpLb9LZBrGI9EGhvznyC AYPpGHm+frtEGh1e87Vjsr/tptQ3gV61EjERvpHVXBGsv5A9bcm594ZKPyBdvMqEJj6z P5SuAJjQ4j/mBjiXK7NQQvMpEgxTB6JuEvuWCFrOgwGz3SlkT2uhcUu8g++ukC/a9FLM witTMTdwHNH1y8OtUj1VBDYgM4HBFzKnZwdcqW7peYJ0o+mxCgs7F/VdY55Id7G+3bPS QwjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=RxYzmnrjHZVfoW4M9d1HP1KBWggDBG7yN6ykY5399ww=; b=JAtpUHHrHhacXGnRTVQjhjd0+Vm5boZzTHFukl/kPKrfnALJK7u2+WWpmCIJ3iAwQn ljMmIwx2+rSQ12v00t6SlTlLUBknLGEa753l4DkigThXIrDAlXvZasGivNlIw5DkAhPv PseD4GyFIUNl+Mf4wNbUNH+G/43MaYf+II14eWeNIEiN1t1I4VpeItBmJs59LpzMpDHI MwHFsxKZepEtEypLsMIjk1yL65YZzcbxtVvBI+FiQliBjqULftMvFYQyPiPSBV1vaOgA OFJqf4UR4vO2yNQCwTlZksyLHnu6f1Py3UOS3HBX5dqiR70gpj4UPFAMoqtJAHi4ibhT OWmQ== X-Gm-Message-State: AKaTC03c1ngGIacFPzIImcE52mYlpMI+S4nvQjrY6R6wHbdPtWkBztAEouasOAi7UZRiLQ== X-Received: by 10.28.21.21 with SMTP id 21mr10280642wmv.132.1479678629460; Sun, 20 Nov 2016 13:50:29 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id x188sm15883151wmx.4.2016.11.20.13.50.28 for (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sun, 20 Nov 2016 13:50:28 -0800 (PST) Date: Sun, 20 Nov 2016 21:50:28 +0000 From: Stephane Chazelas Message-ID: <20161120215028.GB25881@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) $ locale charmap GB18030 $ printf '\uC9\n' | grep '.*7' | hd 00000000 81 30 87 37 0a |.0.7.| 00000005 U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030). $ printf '\uC9\n' | grep '.*0' fails. $ printf '\uC9\n' | grep -o '.*7' returns with a zero exit status but outputs nothing. It's as if .*7 matched an empty string somewhere. printf '\uC9\n' | grep '\(.*7\)\1' fails. so do: grep 7 grep '7$' grep '.7' grep '[^x]*7' printf 'x\uC9\n' | grep -E '.+7' These match: grep '.\{0,1\}7' grep -E '.?7' printf '\uC9x\n' | grep '.*7x' # still outputs nothing with -o That's not confined to GB18030. You get similar issues with BIG5-HKSCS, BIG5 or GBK. $ locale charmap BIG5-HKSCS $ printf '\ue9\n' | grep '.*m' | hd 00000000 88 6d 0a |.m.| 00000003 Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64. -- Stephane From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Stephane Chazelas Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 20 Nov 2016 23:00:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 24975@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.14796827654688 (code B ref -1); Sun, 20 Nov 2016 23:00:02 +0000 Received: (at submit) by debbugs.gnu.org; 20 Nov 2016 22:59:25 +0000 Received: from localhost ([127.0.0.1]:36498 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8b53-0001DY-9O for submit@debbugs.gnu.org; Sun, 20 Nov 2016 17:59:25 -0500 Received: from eggs.gnu.org ([208.118.235.92]:51024) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8b52-0001DI-8F for submit@debbugs.gnu.org; Sun, 20 Nov 2016 17:59:24 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c8b4w-0006M4-D5 for submit@debbugs.gnu.org; Sun, 20 Nov 2016 17:59:19 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:45457) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1c8b4w-0006M0-Aa for submit@debbugs.gnu.org; Sun, 20 Nov 2016 17:59:18 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41995) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1c8b4v-0001US-CX for bug-grep@gnu.org; Sun, 20 Nov 2016 17:59:18 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c8b4r-0006KR-GR for bug-grep@gnu.org; Sun, 20 Nov 2016 17:59:17 -0500 Received: from mail-wm0-x244.google.com ([2a00:1450:400c:c09::244]:33186) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1c8b4r-0006KB-AJ for bug-grep@gnu.org; Sun, 20 Nov 2016 17:59:13 -0500 Received: by mail-wm0-x244.google.com with SMTP id u144so22369022wmu.0 for ; Sun, 20 Nov 2016 14:59:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=J6T6J5fQHzDfMkX8pe1KKgdQGyxCOEto78BfSI5UU/A=; b=M0IsD71LXIYYdt8VKYkL76sYvhGc89UR8BIwdwD358/K0yFnVyz3CJROZtPoKmp6oC 6NAEpU0CDDBngiMmi9KkOZniMRuDT1iGgZczR8xm9Gt3nrgLuMk6Y8yKRUB4Q6gV3Fq6 Gc+tR9dQojkZpis/4xHje5Ks6WBGULwMggAwuH0SBme9RmDiNaCe89dh3sfi9pWdvIMq QIKl3myVFUUsCOrFWbkhq8l0yLYM6pkZ6NJQ0xXWGUHtP/pRBM6GQbvtOAaiHtCo1gDh kSSlXDpDMluhNPMq1XCtH5/s2iv+GgfGSyu/cH3zSuuf7XcYJzAjugOvAZruiRNteN6h FNwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=J6T6J5fQHzDfMkX8pe1KKgdQGyxCOEto78BfSI5UU/A=; b=Qs++eftK39gt1vpEr5uHIa7g8n03VzVB3Gtks5/vTcj9ljAI6kZ8VtMISvVzPR1r2R PwsO+VKD2lsFQ0M6LbULO5d4iSZBNFd8NKZ/hanthwuxGxJTPigQhorLmHiN1l4D11nz R5lq66wKdBZSbl00CfXXmMomq4/GckIUVBm72UrQaEHYbqdS4kpgRSyuXXXBz95EQhOG g6RKRtgN/eoTjRnx09irSp6SCuGgOfJ9naIsfY4rkYUZiqP4keiG7pJ4kcr+5q1OKQ9p pJ5a/VgPWQFrXF4GOJcTcEpyDl79sX6yExjwfRZAhPdN1IkJvi7K0dgFHKXvE/zfVRV3 goBQ== X-Gm-Message-State: AKaTC01YO2+BtfE8sJ6Mrnbf6UX7SGsiwSmcNTxKXXCDWXBhi8BPC1bnQgMcv2KhUebqHg== X-Received: by 10.28.126.11 with SMTP id z11mr11501501wmc.87.1479682752009; Sun, 20 Nov 2016 14:59:12 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id w79sm16103948wmw.0.2016.11.20.14.59.10 for (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sun, 20 Nov 2016 14:59:11 -0800 (PST) Date: Sun, 20 Nov 2016 22:59:10 +0000 From: Stephane Chazelas Message-ID: <20161120225910.GA27109@chaz.gmail.com> References: <20161120215028.GB25881@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161120215028.GB25881@chaz.gmail.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) 2016-11-20 21:50:28 +0000, Stephane Chazelas: > $ locale charmap > GB18030 > $ printf '\uC9\n' | grep '.*7' | hd > 00000000 81 30 87 37 0a |.0.7.| > 00000005 > > U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030). [...] > Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64. [...] Same behaviour with 2.26 on Solaris 11. -- Stephane From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 21 Nov 2016 05:54:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Stephane Chazelas Cc: 24975@debbugs.gnu.org Received: via spool by 24975-submit@debbugs.gnu.org id=B24975.147970763724650 (code B ref 24975); Mon, 21 Nov 2016 05:54:02 +0000 Received: (at 24975) by debbugs.gnu.org; 21 Nov 2016 05:53:57 +0000 Received: from localhost ([127.0.0.1]:36609 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8hYD-0006PW-B8 for submit@debbugs.gnu.org; Mon, 21 Nov 2016 00:53:57 -0500 Received: from mail-io0-f193.google.com ([209.85.223.193]:34174) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8hYC-0006PI-A8 for 24975@debbugs.gnu.org; Mon, 21 Nov 2016 00:53:56 -0500 Received: by mail-io0-f193.google.com with SMTP id n13so5378877ioe.1 for <24975@debbugs.gnu.org>; Sun, 20 Nov 2016 21:53:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=B3xvUcErsk4XZYxyYG9oBcvrRb/ez22r4mG8vNljISI=; b=F0MMkhPTGLAmE2RmQiO7x5PQ9l/8Y8S/2Y9yJUveBfGNmebtjFOslX7Q9HPwVic+yO qLyir2IwWzcLO/Zmuyek4Itn2p8x4VkbvaGWff6maCF9qOfUJNt5LvUl6EvBz0iE1nt+ fKCSzIEavAjDWdi67GZF7f/Wr+2NauFGVnIq0STDt16L53E4Xc4gYPQ3cQ1+lrbXV0yU zk8a1dgr9yfRdRUEzlcUfas+dLM+3oPl9BZNANty6KCFLEPsM+MgAGqldWr5W7Q0IoNp flW/N32FEd6WQVMVRQ20sYqQQhK8W68wY6++TaqBFe+9vC2KjEPRRLVGOH8mNFADW/s0 tk+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=B3xvUcErsk4XZYxyYG9oBcvrRb/ez22r4mG8vNljISI=; b=H17EjJmJSjH4gSnsvs4LUoiNxfPAKdGk9zERsB/AHOnBzSam+ywX5moZNGCYgDBZVm onP1YCg7Sp0Q9hID8iy/hjjKBpz3JQ3J9pl5fij29JoprLiaj+4/YaFoNR7C9vc/1bhi 0WLGLL9l4VHutNupAh6wGfuvevqhz+ydC6Hp7bhFE6VSMuLATCvnc7ISvuJojYFKiX+M V4pWPfVpkwHABDC0bwdbkw6LeZCIEeBdbsCmrtPla5ff6gmz3eHI3UDI/i4ugwItfbDo hGxHtTnHhZWe4j5i2SdlhnwFPJy/jO6SNpw5Ph7VKylRzdxpcdI9G+sJJafzIkisY4t9 vE3w== X-Gm-Message-State: AKaTC00rlQxqoL3GjlxHyQWF+Ibb9NIiQJn9cY+GiO1/5OCQFIM6OocRjvdGeMCh2StMfJpxE1BiYR9m/jrpFQ== X-Received: by 10.107.149.144 with SMTP id x138mr11013382iod.23.1479707630496; Sun, 20 Nov 2016 21:53:50 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.141.195 with HTTP; Sun, 20 Nov 2016 21:53:29 -0800 (PST) In-Reply-To: <20161120225910.GA27109@chaz.gmail.com> References: <20161120215028.GB25881@chaz.gmail.com> <20161120225910.GA27109@chaz.gmail.com> From: Jim Meyering Date: Sun, 20 Nov 2016 21:53:29 -0800 X-Google-Sender-Auth: H_YANPn1bkt8IQq42_-pNh9z2AI Message-ID: Content-Type: text/plain; charset=UTF-8 X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas wrote: > 2016-11-20 21:50:28 +0000, Stephane Chazelas: >> $ locale charmap >> GB18030 >> $ printf '\uC9\n' | grep '.*7' | hd >> 00000000 81 30 87 37 0a |.0.7.| >> 00000005 >> >> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030). > [...] >> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64. > [...] > > Same behaviour with 2.26 on Solaris 11. Thank you for the report. I can reproduce that error on Fedora 25 with this: $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c 5 I confirmed that the problem does not arise (i.e., no match, with exit status of 1) when we force the use of glibc's regex matcher by inserting a trivial back-reference: $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E '()\1.*7' k); echo $? 1 This bisected to v2.18-54-g3ef4c8e, but that commit was just the messenger: it exposed the latent bug by making it so this case was no longer handled by glibc's regexp matcher, but rather by grep's dfa.c. From unknown Sun Jun 22 00:47:58 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Stephane Chazelas Subject: bug#24975: closed (Re: bug#24975: Matching issues with characters whose encoding ends in some other character) Message-ID: References: <20161120215028.GB25881@chaz.gmail.com> X-Gnu-PR-Message: they-closed 24975 X-Gnu-PR-Package: grep Reply-To: 24975@debbugs.gnu.org Date: Mon, 28 Nov 2016 00:00:03 +0000 Content-Type: multipart/mixed; boundary="----------=_1480291203-16791-1" This is a multi-part message in MIME format... ------------=_1480291203-16791-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #24975: Matching issues with characters whose encoding ends in some other c= haracter which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 24975@debbugs.gnu.org. --=20 24975: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24975 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1480291203-16791-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 24975-done) by debbugs.gnu.org; 27 Nov 2016 23:59:35 +0000 Received: from localhost ([127.0.0.1]:44016 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cB9M6-0004Ln-Uj for submit@debbugs.gnu.org; Sun, 27 Nov 2016 18:59:35 -0500 Received: from mail-io0-f195.google.com ([209.85.223.195]:34439) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cB9M5-0004La-5S for 24975-done@debbugs.gnu.org; Sun, 27 Nov 2016 18:59:33 -0500 Received: by mail-io0-f195.google.com with SMTP id r94so19499459ioe.1 for <24975-done@debbugs.gnu.org>; Sun, 27 Nov 2016 15:59:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=5OO3bAhFJt3kA+ceCNljegb1ibG6IzRIpvhfTXFKsEQ=; b=H3p1tGGiIJwRv1Uf/bKQ9yfs3CAWfA13Oed3h7Ct0mogE//9IN4OdBQ0lzaxSLIj0i DLcEHJ00n2kYHOMlRpxqf+YzHtmIlnsUWMjB3E1r/qMjnYSMhrW3FTCd0n4Ls6ixngQa TuG+8oLSAuKHF77a/8sbgV/ObONyNVfEXSRw/bojO6S5P5IjWBOURaX/EfKLPFmE65Pw FFvlKLum+LSUh7BgI+KCunGBu1HTvToRfQk1dN8T/iDtrvZmvUb+MoPR7trhIGaI19ed qdggHMByw1yKNC7OkiGDB297zS3Fk8FqMla2X2LOx/cvxTfyONfMx5VvwUPUqtS5m4R+ 9/6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=5OO3bAhFJt3kA+ceCNljegb1ibG6IzRIpvhfTXFKsEQ=; b=LZ8Y0zdEsppsL2T+sdd7ny9UftZH6DDHFO3z4N+Fsd4Fun1zWiQmzH/w6jdDCz8j+o m1SjFHBPw/6N4lOKi7r6F9AY/nhU4rA3nyBBkZH+I6nhaAAldnvrJnI69LaYsAf2ImYH 5H9/zaASc7UiG6hShNfD5TGaNDB/Pu+50ef/1/TPN6zLsr9KPom6c0sQomHpHr9NVddg hecEHAMmvdNeVSqNewAbrJwl/OHT1Yp8thr50/EuhUEGxON5lEoN01s/hA9gmtju3ZXM BNkoXlzmX0FLMqkdVoQX3bUnqJSL4vvZaR2VCyOQMB9sMDW+pqlTnu4F0snt34609wIP ptvg== X-Gm-Message-State: AKaTC02VJGjyXr2Gskc7AvssbmMSQMVMGinHF5dOBtr6U9vYAWADOoLT4DLGUQxkQh6CzxFuiAyjDsDOf028CA== X-Received: by 10.107.10.11 with SMTP id u11mr15824122ioi.29.1480291167353; Sun, 27 Nov 2016 15:59:27 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.146.66 with HTTP; Sun, 27 Nov 2016 15:59:05 -0800 (PST) In-Reply-To: References: <20161120215028.GB25881@chaz.gmail.com> <20161120225910.GA27109@chaz.gmail.com> From: Jim Meyering Date: Sun, 27 Nov 2016 15:59:05 -0800 X-Google-Sender-Auth: dKHLDswZovCRt4L7kCAstZ_lNAs Message-ID: Subject: Re: bug#24975: Matching issues with characters whose encoding ends in some other character To: Stephane Chazelas Content-Type: multipart/mixed; boundary=001a113ed60a31e2ff05425125a9 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 24975-done Cc: "bug-gnulib@gnu.org List" , 24975-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) --001a113ed60a31e2ff05425125a9 Content-Type: text/plain; charset=UTF-8 On Sun, Nov 20, 2016 at 9:53 PM, Jim Meyering wrote: > On Sun, Nov 20, 2016 at 2:59 PM, Stephane Chazelas > wrote: >> 2016-11-20 21:50:28 +0000, Stephane Chazelas: >>> $ locale charmap >>> GB18030 >>> $ printf '\uC9\n' | grep '.*7' | hd >>> 00000000 81 30 87 37 0a |.0.7.| >>> 00000005 >>> >>> U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030). >> [...] >>> Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64. >> [...] >> >> Same behaviour with 2.26 on Solaris 11. > > Thank you for the report. > I can reproduce that error on Fedora 25 with this: > > $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep '.*7' k)|wc -c > 5 > > I confirmed that the problem does not arise (i.e., no match, with exit > status of 1) when we force the use of glibc's regex matcher by > inserting a trivial back-reference: > > $ (export LC_ALL=zh_CN.gb18030; printf '\uC9\n' > k; grep -E > '()\1.*7' k); echo $? > 1 > > This bisected to v2.18-54-g3ef4c8e, but that commit was just the > messenger: it exposed the latent bug by making it so this case was no > longer handled by glibc's regexp matcher, but rather by grep's dfa.c. I've fixed this by forcing any non-UTF8 multibyte locale to use regex rather than DFA matcher with the following. The gnulib/dfa patch makes that change, and the grep change updates to latest gnulib, adds tests and NEWS. I suspect this won't be the last word in this area, because it feels like we should be able to adjust DFA's tables so that people using such locales can retain DFA's efficiency without the bug in the current implementation. --001a113ed60a31e2ff05425125a9 Content-Type: text/plain; charset=US-ASCII; name="gnulib-dfa-mb-non-UTF8-fix.diff" Content-Disposition: attachment; filename="gnulib-dfa-mb-non-UTF8-fix.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_iw1b3g7e0 RnJvbSBiZDZkNjZlNTAyNzg2ZGYyMWQyZGNhYTdiNDczZWU4NTFmODQwYWFhIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U3VuLCAyNyBOb3YgMjAxNiAxNTozNjo1MSAtMDgwMApTdWJqZWN0OiBbUEFUQ0hdIGRmYTogYXZv aWQgZmFsc2UgbWF0Y2ggaW4gbm9uLVVURjggbXVsdGlieXRlIGxvY2FsZXMKCiogbGliL2RmYS5j IChkZmFfc3VwcG9ydGVkKTogVHJlYXQgYW55IG5vbi1VVEY4IG11bHRpYnl0ZSBsb2NhbGUKYXMg Im5vdCBzdXBwb3J0ZWQiIHNvIHRoYXQgY2FsbGVycyB3aWxsIHJlc29ydCB0byB1c2luZyByZWdl eC1iYXNlZAptYXRjaGVyLiAgVGhpcyB3aWxsIHN1cmVseSBodXJ0IHBlcmZvcm1hbmNlLCBidXQg Y29ycmVjdG5lc3MgdHJ1bXBzCnBlcmZvcm1hbmNlIGhlcmUsIGFuZCB0aGUgYWZmZWN0ZWQgbG9j YWxlcyBhcmUgbGVzcyBhbmQgbGVzcyByZWxldmFudCwKdGhlc2UgZGF5cy4gIFNlZSBncmVwJ3Mg YnVnIHJlcG9ydCBodHRwczovL2J1Z3MuZ251Lm9yZy8yNDk3NS4KLS0tCiBDaGFuZ2VMb2cgfCA5 ICsrKysrKysrKwogbGliL2RmYS5jIHwgNiArKysrKysKIDIgZmlsZXMgY2hhbmdlZCwgMTUgaW5z ZXJ0aW9ucygrKQoKZGlmZiAtLWdpdCBhL0NoYW5nZUxvZyBiL0NoYW5nZUxvZwppbmRleCAwZGIz ZGE4Li5mZWM0ZmI5IDEwMDY0NAotLS0gYS9DaGFuZ2VMb2cKKysrIGIvQ2hhbmdlTG9nCkBAIC0x LDMgKzEsMTIgQEAKKzIwMTYtMTEtMjcgIEppbSBNZXllcmluZyAgPG1leWVyaW5nQGZiLmNvbT4K KworCWRmYTogYXZvaWQgZmFsc2UgbWF0Y2ggaW4gbm9uLVVURjggbXVsdGlieXRlIGxvY2FsZXMK KwkqIGxpYi9kZmEuYyAoZGZhX3N1cHBvcnRlZCk6IFRyZWF0IGFueSBub24tVVRGOCBtdWx0aWJ5 dGUgbG9jYWxlCisJYXMgIm5vdCBzdXBwb3J0ZWQiIHNvIHRoYXQgY2FsbGVycyB3aWxsIHJlc29y dCB0byB1c2luZyByZWdleC1iYXNlZAorCW1hdGNoZXIuICBUaGlzIHdpbGwgc3VyZWx5IGh1cnQg cGVyZm9ybWFuY2UsIGJ1dCBjb3JyZWN0bmVzcyB0cnVtcHMKKwlwZXJmb3JtYW5jZSBoZXJlLCBh bmQgdGhlIGFmZmVjdGVkIGxvY2FsZXMgYXJlIGxlc3MgYW5kIGxlc3MgcmVsZXZhbnQsCisJdGhl c2UgZGF5cy4gIFNlZSBncmVwJ3MgYnVnIHJlcG9ydCBodHRwczovL2J1Z3MuZ251Lm9yZy8yNDk3 NS4KKwogMjAxNi0xMS0yNyAgTWlrZSBGcnlzaW5nZXIgIDx2YXBpZXJAZ2VudG9vLm9yZz4KCiAJ cHRzbmFtZV9yOiBsZXZlcmFnZSBBQ19IRUFERVJfTUFKT1IgdG8gcHJvdmlkZSBtYWpvcigpCmRp ZmYgLS1naXQgYS9saWIvZGZhLmMgYi9saWIvZGZhLmMKaW5kZXggNTU3ODIzMi4uZjBlZDEzOSAx MDA2NDQKLS0tIGEvbGliL2RmYS5jCisrKyBiL2xpYi9kZmEuYwpAQCAtMzI3Miw2ICszMjcyLDEy IEBAIGZyZWVfbWJkYXRhIChzdHJ1Y3QgZGZhICpkKQogc3RhdGljIGJvb2wgX0dMX0FUVFJJQlVU RV9QVVJFCiBkZmFfc3VwcG9ydGVkIChzdHJ1Y3QgZGZhIGNvbnN0ICpkKQogeworICAvKiBEZWNs YXJlIGFueSBub24tVVRGOCBtdWx0aWJ5dGUgbG9jYWxlICJub3Qgc3VwcG9ydGVkLiIgIE90aGVy d2lzZSwgYQorICAgICByZWdleHAgbGlrZSAiLio3IiB3b3VsZCBtaXN0YWtlbmx5IG1hdGNoIFx1 QzksIGUuZy4sIHZpYSB0aGlzIGNvbW1hbmQ6CisgICAgIChleHBvcnQgTENfQUxMPXpoX0NOLmdi MTgwMzA7IHByaW50ZiAnXHVDOVxuJyB8IGdyZXAgJy4qNycpICAqLworICBpZiAoZC0+bG9jYWxl aW5mby5tdWx0aWJ5dGUgJiYgIWQtPmxvY2FsZWluZm8udXNpbmdfdXRmOCkKKyAgICByZXR1cm4g ZmFsc2U7CisKICAgc2l6ZV90IGk7CiAgIGZvciAoaSA9IDA7IGkgPCBkLT50aW5kZXg7IGkrKykK ICAgICB7Ci0tIAoyLjkuMwoK --001a113ed60a31e2ff05425125a9 Content-Type: text/plain; charset=US-ASCII; name="grep-fix-false-matches-mb-non-UTF8.diff" Content-Disposition: attachment; filename="grep-fix-false-matches-mb-non-UTF8.diff" Content-Transfer-Encoding: base64 X-Attachment-Id: f_iw1b3uhi1 RnJvbSBmY2U2NDM4ODY5ODFhYjE0YzFkNGM4ZmQ4ZjBmNGQzM2Y1N2M1ZWY5IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U3VuLCAyNyBOb3YgMjAxNiAxNTozMTozNSAtMDgwMApTdWJqZWN0OiBbUEFUQ0hdIGdyZXA6IGF2 b2lkIGZhbHNlIG1hdGNoZXMgaW4gbm9uLVVURjggbXVsdGlieXRlIGxvY2FsZXMKCiogZ251bGli OiBVcGRhdGUgdG8gbGF0ZXN0LCBmb3IgdGhlIGRmYS5jIGZpeC4KKiBORVdTIChCdWcgZml4ZXMp OiBNZW50aW9uIGl0LgoqIHRlc3RzL2ZhbHNlLW1hdGNoLW1iLW5vbi11dGY4OiBOZXcgZmlsZSwg d2l0aCB0ZXN0cyBmb3IgdGhpcy4KQmFzZWQgb24gdGVzdHMgZnJvbSBTdGVwaGFuZSBDaGF6ZWxh cy4KKiB0ZXN0cy9NYWtlZmlsZS5hbSAoVEVTVFMpOiBBZGQgaXQuCkludHJvZHVjZWQgYnkgY29t bWl0IHYyLjE4LTU0LWczZWY0YzhlLCBhIGNoYW5nZSB0aGF0IG1hZGUgZ3JlcCB1c2UKaXRzIERG QSBtYXRjaGVyIG1vcmUgYWdncmVzc2l2ZWx5LiAgVGhlIG1hbGZ1bmN0aW9uIGFyaXNlcyBvbmx5 IHdpdGgKdGhlIERGQSBtYXRjaGVyLCBub3Qgd2l0aCByZWdleC4KUmVwb3J0ZWQgYnkgU3RlcGhh bmUgQ2hhemVsYXMgaW4gaHR0cHM6Ly9idWdzLmdudS5vcmcvMjQ5NzUKLS0tCiBORVdTICAgICAg ICAgICAgICAgICAgICAgICAgICB8ICA3ICsrKysrKysKIGdudWxpYiAgICAgICAgICAgICAgICAg ICAgICAgIHwgIDIgKy0KIHRlc3RzL01ha2VmaWxlLmFtICAgICAgICAgICAgIHwgIDEgKwogdGVz dHMvZmFsc2UtbWF0Y2gtbWItbm9uLXV0ZjggfCAzOCArKysrKysrKysrKysrKysrKysrKysrKysr KysrKysrKysrKysrKwogNCBmaWxlcyBjaGFuZ2VkLCA0NyBpbnNlcnRpb25zKCspLCAxIGRlbGV0 aW9uKC0pCiBjcmVhdGUgbW9kZSAxMDA3NTUgdGVzdHMvZmFsc2UtbWF0Y2gtbWItbm9uLXV0ZjgK CmRpZmYgLS1naXQgYS9ORVdTIGIvTkVXUwppbmRleCBiZDFhMjAxLi45NzFjYmQ5IDEwMDY0NAot LS0gYS9ORVdTCisrKyBiL05FV1MKQEAgLTQsNiArNCwxMyBAQCBHTlUgZ3JlcCBORVdTICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLSotIG91dGxpbmUgLSotCgogKiogQnVnIGZp eGVzCgorICBncmVwIG5vIGxvbmdlciByZXBvcnRzIGEgZmFsc2UgbWF0Y2ggaW4gYSBtdWx0aWJ5 dGUsIG5vbi1VVEY4IGxvY2FsZQorICBsaWtlIHpoX0NOLmdiMTgwMzAsIHdpdGggYSByZWd1bGFy IGV4cHJlc3Npb24gbGlrZSAiLio3IiB0aGF0IGp1c3QKKyAgaGFwcGVucyB0byBtYXRjaCB0aGUg NC1ieXRlIHJlcHJlc2VudGF0aW9uIG9mIGdiMTgwMzAncyBcdUM5LCB0aGUKKyAgZmluYWwgYnl0 ZSBvZiB3aGljaCBpcyB0aGUgZGlnaXQgIjciLiAgVGhpcyAiZml4IiBpcyB0byBtYWtlIGdyZXAK KyAgYWx3YXlzIHVzZSB0aGUgc2xvd2VyIHJlZ2V4IG1hdGNoZXIgaW4gc3VjaCBsb2NhbGVzLgor ICBbYnVnIGludHJvZHVjZWQgaW4gZ3JlcC0yLjE5XQorCiAgIGdyZXAgYnkgZGVmYXVsdCBub3cg cmVhZHMgYWxsIG9mIHN0YW5kYXJkIGlucHV0IGlmIGl0IGlzIGEgcGlwZSwKICAgZXZlbiBpZiB0 aGlzIGNhbm5vdCBhZmZlY3QgZ3JlcCdzIG91dHB1dCBvciBleGl0IHN0YXR1cy4gIFRoaXMgd29y a3MKICAgYmV0dGVyIHdpdGggbm9ucG9ydGFibGUgc2NyaXB0cyB0aGF0IHJ1biAiUFJPR1JBTSB8 IGdyZXAgUEFUVEVSTgpkaWZmIC0tZ2l0IGEvZ251bGliIGIvZ251bGliCmluZGV4IDYwZThmZmMu LmJkNmQ2NmUgMTYwMDAwCi0tLSBhL2dudWxpYgorKysgYi9nbnVsaWIKQEAgLTEgKzEgQEAKLVN1 YnByb2plY3QgY29tbWl0IDYwZThmZmNhMDJkZDRlYWMzYTg3Yjc0NGY0ZjllZjY4ZjNkZmZhMzUK K1N1YnByb2plY3QgY29tbWl0IGJkNmQ2NmU1MDI3ODZkZjIxZDJkY2FhN2I0NzNlZTg1MWY4NDBh YWEKZGlmZiAtLWdpdCBhL3Rlc3RzL01ha2VmaWxlLmFtIGIvdGVzdHMvTWFrZWZpbGUuYW0KaW5k ZXggNTZlODYwZi4uNDQyZTg1YSAxMDA2NDQKLS0tIGEvdGVzdHMvTWFrZWZpbGUuYW0KKysrIGIv dGVzdHMvTWFrZWZpbGUuYW0KQEAgLTk0LDYgKzk0LDcgQEAgVEVTVFMgPQkJCQkJCVwKICAgZXF1 aXYtY2xhc3NlcwkJCQkJXAogICBlcmUJCQkJCQlcCiAgIGV1Yy1tYgkJCQkJXAorICBmYWxzZS1t YXRjaC1tYi1ub24tdXRmOAkJCVwKICAgZmVkb3JhCQkJCQlcCiAgIGZncmVwLWluZmxvb3AJCQkJ CVwKICAgZmlsZQkJCQkJCVwKZGlmZiAtLWdpdCBhL3Rlc3RzL2ZhbHNlLW1hdGNoLW1iLW5vbi11 dGY4IGIvdGVzdHMvZmFsc2UtbWF0Y2gtbWItbm9uLXV0ZjgKbmV3IGZpbGUgbW9kZSAxMDA3NTUK aW5kZXggMDAwMDAwMC4uNmRmZDEwYQotLS0gL2Rldi9udWxsCisrKyBiL3Rlc3RzL2ZhbHNlLW1h dGNoLW1iLW5vbi11dGY4CkBAIC0wLDAgKzEsMzggQEAKKyMhIC9iaW4vc2gKKyMgVGVzdCBmb3Ig ZmFsc2UgbWF0Y2hlcyBpbiBncmVwIDIuMTkuLjIuMjYgaW4gbXVsdGlieXRlLCBub24tVVRGOCBs b2NhbGVzCisjCisjIENvcHlyaWdodCAoQykgMjAxNiBGcmVlIFNvZnR3YXJlIEZvdW5kYXRpb24s IEluYy4KKyMKKyMgQ29weWluZyBhbmQgZGlzdHJpYnV0aW9uIG9mIHRoaXMgZmlsZSwgd2l0aCBv ciB3aXRob3V0IG1vZGlmaWNhdGlvbiwKKyMgYXJlIHBlcm1pdHRlZCBpbiBhbnkgbWVkaXVtIHdp dGhvdXQgcm95YWx0eSBwcm92aWRlZCB0aGUgY29weXJpZ2h0CisjIG5vdGljZSBhbmQgdGhpcyBu b3RpY2UgYXJlIHByZXNlcnZlZC4KKworLiAiJHtzcmNkaXI9Ln0vaW5pdC5zaCI7IHBhdGhfcHJl cGVuZF8gLi4vc3JjCisKKyMgQWRkICIuIiB0byBQQVRIIGZvciB0aGUgdXNlIG9mIGdldC1tYi1j dXItbWF4LgorcGF0aF9wcmVwZW5kXyAuCisKK2ZhaWw9MAorCitsb2M9emhfQ04uZ2IxODAzMAor dGVzdCAiJChnZXQtbWItY3VyLW1heCAkbG9jKSIgPSA0IHx8IHNraXBfICJubyBzdXBwb3J0IGZv ciB0aGUgJGxvYyBsb2NhbGUiCisKKyMgVGhpcyBtdXN0IG5vdCBtYXRjaDogdGhlIGlucHV0IGlz IGEgc2luZ2xlIGNoYXJhY3RlciwgXHVDOSBmb2xsb3dlZAorIyBieSBhIG5ld2xpbmUuICBCdXQg aXQganVzdCBzbyBoYXBwZW5zIHRoYXQgdGhhdCBjaGFyYWN0ZXIgaXMgbWFkZSB1cAorIyBvZiBm b3VyIGJ5dGVzLCB0aGUgbGFzdCBvZiB3aGljaCBpcyB0aGUgZGlnaXQsIDcsIGFuZCBncmVwJ3Mg REZBCisjIG1hdGNoZXIgd291bGQgbWlzdGFrZW5seSByZXBvcnQgdGhhdCAiLio3IiBtYXRjaGVz IHRoYXQgaW5wdXQgbGluZS4KK3ByaW50ZiAnXDIwMTBcMjA3N1xuJyA+IGluIHx8IGZyYW1ld29y a19mYWlsdXJlXworTENfQUxMPSRsb2MgcmV0dXJuc18gMSBncmVwIC1FICcuKjcnIGluIHx8IGZh aWw9MQorCitMQ19BTEw9JGxvYyByZXR1cm5zXyAxIGdyZXAgLUUgJy57MCwxfTcnIGluIHx8IGZh aWw9MQorCitMQ19BTEw9JGxvYyByZXR1cm5zXyAxIGdyZXAgLUUgJy4/NycgaW4gfHwgZmFpbD0x CisKKyMgU2ltaWxhciBmb3IgdGhlIFx1ZTkgY29kZSBwb2ludCwgd2hpY2ggZW5kcyBpbiBhbiAi bSIgYnl0ZS4KK2xvYz16aF9ISy5iaWc1aGtzY3MKK3Rlc3QgIiQoZ2V0LW1iLWN1ci1tYXggJGxv YykiID0gMiB8fCBza2lwXyAibm8gc3VwcG9ydCBmb3IgdGhlICRsb2MgbG9jYWxlIgorCitwcmlu dGYgJ1wyMTBtXG4nID4gaW4gfHwgZnJhbWV3b3JrX2ZhaWx1cmVfCitMQ19BTEw9JGxvYyByZXR1 cm5zXyAxIGdyZXAgJy4qbScgaW4gfHwgZmFpbD0xCisKK0V4aXQgJGZhaWwKLS0gCjIuOS4zCgo= --001a113ed60a31e2ff05425125a9-- ------------=_1480291203-16791-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 20 Nov 2016 21:50:45 +0000 Received: from localhost ([127.0.0.1]:36474 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8a0b-0007j9-9r for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:45 -0500 Received: from eggs.gnu.org ([208.118.235.92]:42521) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1c8a0Z-0007iw-L8 for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:43 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c8a0T-0005rn-G9 for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:38 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:32833) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1c8a0T-0005rj-7j for submit@debbugs.gnu.org; Sun, 20 Nov 2016 16:50:37 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33492) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1c8a0S-0008BJ-4j for bug-grep@gnu.org; Sun, 20 Nov 2016 16:50:36 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1c8a0N-0005rU-40 for bug-grep@gnu.org; Sun, 20 Nov 2016 16:50:36 -0500 Received: from mail-wm0-x230.google.com ([2a00:1450:400c:c09::230]:37001) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1c8a0M-0005rQ-Sz for bug-grep@gnu.org; Sun, 20 Nov 2016 16:50:31 -0500 Received: by mail-wm0-x230.google.com with SMTP id t79so115134237wmt.0 for ; Sun, 20 Nov 2016 13:50:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-disposition :user-agent; bh=RxYzmnrjHZVfoW4M9d1HP1KBWggDBG7yN6ykY5399ww=; b=lpgb/uBxoYtL1mwZH/EXYGXkn+8E90kx8TAQ6EomLr/OtvHlrM7HoFJrkMlSn0GQWQ oFRmBoiARR+0bRcN1lgzVr1R7UY0RsAwfskjSQ/JIZ6WY8NnlpLb9LZBrGI9EGhvznyC AYPpGHm+frtEGh1e87Vjsr/tptQ3gV61EjERvpHVXBGsv5A9bcm594ZKPyBdvMqEJj6z P5SuAJjQ4j/mBjiXK7NQQvMpEgxTB6JuEvuWCFrOgwGz3SlkT2uhcUu8g++ukC/a9FLM witTMTdwHNH1y8OtUj1VBDYgM4HBFzKnZwdcqW7peYJ0o+mxCgs7F/VdY55Id7G+3bPS QwjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:subject:message-id:mime-version :content-disposition:user-agent; bh=RxYzmnrjHZVfoW4M9d1HP1KBWggDBG7yN6ykY5399ww=; b=JAtpUHHrHhacXGnRTVQjhjd0+Vm5boZzTHFukl/kPKrfnALJK7u2+WWpmCIJ3iAwQn ljMmIwx2+rSQ12v00t6SlTlLUBknLGEa753l4DkigThXIrDAlXvZasGivNlIw5DkAhPv PseD4GyFIUNl+Mf4wNbUNH+G/43MaYf+II14eWeNIEiN1t1I4VpeItBmJs59LpzMpDHI MwHFsxKZepEtEypLsMIjk1yL65YZzcbxtVvBI+FiQliBjqULftMvFYQyPiPSBV1vaOgA OFJqf4UR4vO2yNQCwTlZksyLHnu6f1Py3UOS3HBX5dqiR70gpj4UPFAMoqtJAHi4ibhT OWmQ== X-Gm-Message-State: AKaTC03c1ngGIacFPzIImcE52mYlpMI+S4nvQjrY6R6wHbdPtWkBztAEouasOAi7UZRiLQ== X-Received: by 10.28.21.21 with SMTP id 21mr10280642wmv.132.1479678629460; Sun, 20 Nov 2016 13:50:29 -0800 (PST) Received: from chaz.gmail.com ([90.201.137.34]) by smtp.gmail.com with ESMTPSA id x188sm15883151wmx.4.2016.11.20.13.50.28 for (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sun, 20 Nov 2016 13:50:28 -0800 (PST) Date: Sun, 20 Nov 2016 21:50:28 +0000 From: Stephane Chazelas To: bug-grep@gnu.org Subject: Matching issues with characters whose encoding ends in some other character Message-ID: <20161120215028.GB25881@chaz.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) $ locale charmap GB18030 $ printf '\uC9\n' | grep '.*7' | hd 00000000 81 30 87 37 0a |.0.7.| 00000005 U+00C9's encoding does end in the 0x37 byte (7 in ASCII and GB18030). $ printf '\uC9\n' | grep '.*0' fails. $ printf '\uC9\n' | grep -o '.*7' returns with a zero exit status but outputs nothing. It's as if .*7 matched an empty string somewhere. printf '\uC9\n' | grep '\(.*7\)\1' fails. so do: grep 7 grep '7$' grep '.7' grep '[^x]*7' printf 'x\uC9\n' | grep -E '.+7' These match: grep '.\{0,1\}7' grep -E '.?7' printf '\uC9x\n' | grep '.*7x' # still outputs nothing with -o That's not confined to GB18030. You get similar issues with BIG5-HKSCS, BIG5 or GBK. $ locale charmap BIG5-HKSCS $ printf '\ue9\n' | grep '.*m' | hd 00000000 88 6d 0a |.m.| 00000003 Reproduced with 2.25, 2.26 and the current git head on ubuntu 16.04 amd64. -- Stephane ------------=_1480291203-16791-1-- From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 28 Nov 2016 13:50:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: jim@meyering.net Cc: 24975@debbugs.gnu.org, stephane.chazelas@gmail.com Received: via spool by 24975-submit@debbugs.gnu.org id=B24975.14803409824138 (code B ref 24975); Mon, 28 Nov 2016 13:50:01 +0000 Received: (at 24975) by debbugs.gnu.org; 28 Nov 2016 13:49:42 +0000 Received: from localhost ([127.0.0.1]:44285 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBMJS-00014f-IF for submit@debbugs.gnu.org; Mon, 28 Nov 2016 08:49:42 -0500 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:56294) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBMJQ-00014Q-3o for 24975@debbugs.gnu.org; Mon, 28 Nov 2016 08:49:41 -0500 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id 9D21A8806A3 for <24975@debbugs.gnu.org>; Mon, 28 Nov 2016 22:49:32 +0900 (JST) X-matriXscan-loop-detect: 935e8dd52b28d5e33315a7a3d2de02bdce1502a5 Received: from mail04.kcn.ne.jp ([61.86.6.183]) by mxs02-s with ESMTP; Mon, 28 Nov 2016 22:49:29 +0900 (JST) Received: from [10.120.1.73] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail04.kcn.ne.jp (Postfix) with ESMTPA id 382BD129009B; Mon, 28 Nov 2016 22:49:29 +0900 (JST) Date: Mon, 28 Nov 2016 22:49:27 +0900 From: Norihiro Tanaka In-Reply-To: References: Message-Id: <20161128224926.B874.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_583C343100000000B871_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.73 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -2.9 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) --------_583C343100000000B871_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Jim Meyering wrote: > I suspect this won't be the last word in this area, because it feels > like we should be able to adjust DFA's tables so that people using > such locales can retain DFA's efficiency without the bug in the > current implementation. Hi Jim, It is a bug in dfa for period expression in non-UTF8 locales. dfa calculates transition for single byte characters and a multibyte character separately and merge both results. However, if backs to an initial state in transition for single byte characters, we should stop matching single byte characters. Thanks, Norihiro --------_583C343100000000B871_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-avoid-match-middle-in-multibyte-character.patch" Content-Disposition: attachment; filename="0001-dfa-avoid-match-middle-in-multibyte-character.patch" Content-Transfer-Encoding: base64 RnJvbSA2NzQ4NGE2N2Q3ZDMxMGQ3NmEyZWI4MGI2OGE4ZWM4ZWI1YzZhN2ZjIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDI4IE5vdiAyMDE2IDIyOjI2OjA3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBhdm9pZCBtYXRjaCBtaWRkbGUgaW4gbXVsdGlieXRlIGNoYXJhY3RlcgoKKiBsaWIvZGZhLmMg KHRyYW5zaXRfc3RhdGUpOiBJZiBmYWlscyBpbiBtYXRjaGluZyBzaW5nbGUgYnl0ZSBjaGFyYWN0 ZXJzCm9uIGEgc3RhdGUgaW5jbHVkaW5nIHBlcmlvZCBleHByZXNzaW9uIGluIG5vbi1VVEY4IG11 bHRpYnl0ZSBsb2NhbGVzLApza2lwIHRyYWlsaW5nIGJ5dGVzLgooZGZhX3N1cHBvcnRlZCk6IFJl dmVydCBwcmV2aW91cyBjaGFuZ2UuCi0tLQogQ2hhbmdlTG9nIHwgICAgOCArKysrKysrKwogbGli L2RmYS5jIHwgICAgOCArLS0tLS0tLQogMiBmaWxlcyBjaGFuZ2VkLCA5IGluc2VydGlvbnMoKyks IDcgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvQ2hhbmdlTG9nIGIvQ2hhbmdlTG9nCmluZGV4 IGZlYzRmYjkuLmZkMDYyYWUgMTAwNjQ0Ci0tLSBhL0NoYW5nZUxvZworKysgYi9DaGFuZ2VMb2cK QEAgLTEsMyArMSwxMSBAQAorMjAxNi0xMS0yNyAgTm9yaWhpcm8gVGFuYWthIDxub3JpdG5rQGtj bi5uZS5qcD4KKworICAgICAgICBkZmE6IGF2b2lkIG1hdGNoIG1pZGRsZSBpbiBtdWx0aWJ5dGUg Y2hhcmFjdGVyCisgICAgICAgICogbGliL2RmYS5jICh0cmFuc2l0X3N0YXRlKTogSWYgZmFpbHMg aW4gbWF0Y2hpbmcgc2luZ2xlIGJ5dGUgY2hhcmFjdGVycworICAgICAgICBvbiBhIHN0YXRlIGlu Y2x1ZGluZyBwZXJpb2QgZXhwcmVzc2lvbiBpbiBub24tVVRGOCBtdWx0aWJ5dGUgbG9jYWxlcywK KyAgICAgICAgc2tpcCB0cmFpbGluZyBieXRlcy4KKyAgICAgICAgKGRmYV9zdXBwb3J0ZWQpOiBS ZXZlcnQgcHJldmlvdXMgY2hhbmdlLgorCiAyMDE2LTExLTI3ICBKaW0gTWV5ZXJpbmcgIDxtZXll cmluZ0BmYi5jb20+CiAKIAlkZmE6IGF2b2lkIGZhbHNlIG1hdGNoIGluIG5vbi1VVEY4IG11bHRp Ynl0ZSBsb2NhbGVzCmRpZmYgLS1naXQgYS9saWIvZGZhLmMgYi9saWIvZGZhLmMKaW5kZXggZjBl ZDEzOS4uNjczZWY5NSAxMDA2NDQKLS0tIGEvbGliL2RmYS5jCisrKyBiL2xpYi9kZmEuYwpAQCAt MjkxMyw3ICsyOTEzLDcgQEAgdHJhbnNpdF9zdGF0ZSAoc3RydWN0IGRmYSAqZCwgc3RhdGVfbnVt IHMsIHVuc2lnbmVkIGNoYXIgY29uc3QgKipwcCwKICAgLyogQ2FsY3VsYXRlIHRoZSBzdGF0ZSB3 aGljaCBjYW4gYmUgcmVhY2hlZCBmcm9tIHRoZSBzdGF0ZSAncycgYnkKICAgICAgY29uc3VtaW5n ICdtYmNsZW4nIHNpbmdsZSBieXRlcyBmcm9tIHRoZSBidWZmZXIuICAqLwogICBzMSA9IHM7Ci0g IGZvciAoaSA9IDA7IGkgPCBtYmNsZW4gJiYgMCA8PSBzOyBpKyspCisgIGZvciAoaSA9IDA7IGkg PCBtYmNsZW4gJiYgKGkgPT0gMCB8fCBkLT5taW5fdHJjb3VudCA8PSBzKTsgaSsrKQogICAgIHMg PSB0cmFuc2l0X3N0YXRlX3NpbmdsZWJ5dGUgKGQsIHMsIHBwKTsKICAgKnBwICs9IG1iY2xlbiAt IGk7CiAKQEAgLTMyNzIsMTIgKzMyNzIsNiBAQCBmcmVlX21iZGF0YSAoc3RydWN0IGRmYSAqZCkK IHN0YXRpYyBib29sIF9HTF9BVFRSSUJVVEVfUFVSRQogZGZhX3N1cHBvcnRlZCAoc3RydWN0IGRm YSBjb25zdCAqZCkKIHsKLSAgLyogRGVjbGFyZSBhbnkgbm9uLVVURjggbXVsdGlieXRlIGxvY2Fs ZSAibm90IHN1cHBvcnRlZC4iICBPdGhlcndpc2UsIGEKLSAgICAgcmVnZXhwIGxpa2UgIi4qNyIg d291bGQgbWlzdGFrZW5seSBtYXRjaCBcdUM5LCBlLmcuLCB2aWEgdGhpcyBjb21tYW5kOgotICAg ICAoZXhwb3J0IExDX0FMTD16aF9DTi5nYjE4MDMwOyBwcmludGYgJ1x1QzlcbicgfCBncmVwICcu KjcnKSAgKi8KLSAgaWYgKGQtPmxvY2FsZWluZm8ubXVsdGlieXRlICYmICFkLT5sb2NhbGVpbmZv LnVzaW5nX3V0ZjgpCi0gICAgcmV0dXJuIGZhbHNlOwotCiAgIHNpemVfdCBpOwogICBmb3IgKGkg PSAwOyBpIDwgZC0+dGluZGV4OyBpKyspCiAgICAgewotLSAKMS43LjEKCg== --------_583C343100000000B871_MULTIPART_MIXED_-- From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 28 Nov 2016 14:49:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: jim@meyering.net Cc: bug-gnulib@gnu.org, 24975@debbugs.gnu.org, stephane.chazelas@gmail.com Received: via spool by 24975-submit@debbugs.gnu.org id=B24975.14803444919464 (code B ref 24975); Mon, 28 Nov 2016 14:49:02 +0000 Received: (at 24975) by debbugs.gnu.org; 28 Nov 2016 14:48:11 +0000 Received: from localhost ([127.0.0.1]:44312 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBNE3-0002SX-EB for submit@debbugs.gnu.org; Mon, 28 Nov 2016 09:48:11 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:49825) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBNE0-0002S8-MX for 24975@debbugs.gnu.org; Mon, 28 Nov 2016 09:48:09 -0500 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id 9355480668 for <24975@debbugs.gnu.org>; Mon, 28 Nov 2016 23:48:01 +0900 (JST) X-matriXscan-loop-detect: b2d3902ffe875d819cfef8d3d43d974575145467 Received: from mail05.kcn.ne.jp ([61.86.6.184]) by mxs01-s with ESMTP; Mon, 28 Nov 2016 23:48:00 +0900 (JST) Received: from [10.120.1.73] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail05.kcn.ne.jp (Postfix) with ESMTPA id C46577D0099; Mon, 28 Nov 2016 23:47:59 +0900 (JST) Date: Mon, 28 Nov 2016 23:47:57 +0900 From: Norihiro Tanaka In-Reply-To: References: Message-Id: <20161128234756.B878.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_583C343100000000B871_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.73 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -2.9 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) --------_583C343100000000B871_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Jim Meyering wrote: > I suspect this won't be the last word in this area, because it feels > like we should be able to adjust DFA's tables so that people using > such locales can retain DFA's efficiency without the bug in the > current implementation. Hi Jim, It is a bug in dfa for period expression in non-UTF8 locales. dfa calculates transition for single byte characters and a multibyte character separately and merge both results. However, if backs to an initial state in transition for single byte characters, we should stop matching single byte characters. Thanks, Norihiro --------_583C343100000000B871_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-avoid-match-middle-in-multibyte-character.patch" Content-Disposition: attachment; filename="0001-dfa-avoid-match-middle-in-multibyte-character.patch" Content-Transfer-Encoding: base64 RnJvbSA2NzQ4NGE2N2Q3ZDMxMGQ3NmEyZWI4MGI2OGE4ZWM4ZWI1YzZhN2ZjIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDI4IE5vdiAyMDE2IDIyOjI2OjA3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBhdm9pZCBtYXRjaCBtaWRkbGUgaW4gbXVsdGlieXRlIGNoYXJhY3RlcgoKKiBsaWIvZGZhLmMg KHRyYW5zaXRfc3RhdGUpOiBJZiBmYWlscyBpbiBtYXRjaGluZyBzaW5nbGUgYnl0ZSBjaGFyYWN0 ZXJzCm9uIGEgc3RhdGUgaW5jbHVkaW5nIHBlcmlvZCBleHByZXNzaW9uIGluIG5vbi1VVEY4IG11 bHRpYnl0ZSBsb2NhbGVzLApza2lwIHRyYWlsaW5nIGJ5dGVzLgooZGZhX3N1cHBvcnRlZCk6IFJl dmVydCBwcmV2aW91cyBjaGFuZ2UuCi0tLQogQ2hhbmdlTG9nIHwgICAgOCArKysrKysrKwogbGli L2RmYS5jIHwgICAgOCArLS0tLS0tLQogMiBmaWxlcyBjaGFuZ2VkLCA5IGluc2VydGlvbnMoKyks IDcgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvQ2hhbmdlTG9nIGIvQ2hhbmdlTG9nCmluZGV4 IGZlYzRmYjkuLmZkMDYyYWUgMTAwNjQ0Ci0tLSBhL0NoYW5nZUxvZworKysgYi9DaGFuZ2VMb2cK QEAgLTEsMyArMSwxMSBAQAorMjAxNi0xMS0yNyAgTm9yaWhpcm8gVGFuYWthIDxub3JpdG5rQGtj bi5uZS5qcD4KKworICAgICAgICBkZmE6IGF2b2lkIG1hdGNoIG1pZGRsZSBpbiBtdWx0aWJ5dGUg Y2hhcmFjdGVyCisgICAgICAgICogbGliL2RmYS5jICh0cmFuc2l0X3N0YXRlKTogSWYgZmFpbHMg aW4gbWF0Y2hpbmcgc2luZ2xlIGJ5dGUgY2hhcmFjdGVycworICAgICAgICBvbiBhIHN0YXRlIGlu Y2x1ZGluZyBwZXJpb2QgZXhwcmVzc2lvbiBpbiBub24tVVRGOCBtdWx0aWJ5dGUgbG9jYWxlcywK KyAgICAgICAgc2tpcCB0cmFpbGluZyBieXRlcy4KKyAgICAgICAgKGRmYV9zdXBwb3J0ZWQpOiBS ZXZlcnQgcHJldmlvdXMgY2hhbmdlLgorCiAyMDE2LTExLTI3ICBKaW0gTWV5ZXJpbmcgIDxtZXll cmluZ0BmYi5jb20+CiAKIAlkZmE6IGF2b2lkIGZhbHNlIG1hdGNoIGluIG5vbi1VVEY4IG11bHRp Ynl0ZSBsb2NhbGVzCmRpZmYgLS1naXQgYS9saWIvZGZhLmMgYi9saWIvZGZhLmMKaW5kZXggZjBl ZDEzOS4uNjczZWY5NSAxMDA2NDQKLS0tIGEvbGliL2RmYS5jCisrKyBiL2xpYi9kZmEuYwpAQCAt MjkxMyw3ICsyOTEzLDcgQEAgdHJhbnNpdF9zdGF0ZSAoc3RydWN0IGRmYSAqZCwgc3RhdGVfbnVt IHMsIHVuc2lnbmVkIGNoYXIgY29uc3QgKipwcCwKICAgLyogQ2FsY3VsYXRlIHRoZSBzdGF0ZSB3 aGljaCBjYW4gYmUgcmVhY2hlZCBmcm9tIHRoZSBzdGF0ZSAncycgYnkKICAgICAgY29uc3VtaW5n ICdtYmNsZW4nIHNpbmdsZSBieXRlcyBmcm9tIHRoZSBidWZmZXIuICAqLwogICBzMSA9IHM7Ci0g IGZvciAoaSA9IDA7IGkgPCBtYmNsZW4gJiYgMCA8PSBzOyBpKyspCisgIGZvciAoaSA9IDA7IGkg PCBtYmNsZW4gJiYgKGkgPT0gMCB8fCBkLT5taW5fdHJjb3VudCA8PSBzKTsgaSsrKQogICAgIHMg PSB0cmFuc2l0X3N0YXRlX3NpbmdsZWJ5dGUgKGQsIHMsIHBwKTsKICAgKnBwICs9IG1iY2xlbiAt IGk7CiAKQEAgLTMyNzIsMTIgKzMyNzIsNiBAQCBmcmVlX21iZGF0YSAoc3RydWN0IGRmYSAqZCkK IHN0YXRpYyBib29sIF9HTF9BVFRSSUJVVEVfUFVSRQogZGZhX3N1cHBvcnRlZCAoc3RydWN0IGRm YSBjb25zdCAqZCkKIHsKLSAgLyogRGVjbGFyZSBhbnkgbm9uLVVURjggbXVsdGlieXRlIGxvY2Fs ZSAibm90IHN1cHBvcnRlZC4iICBPdGhlcndpc2UsIGEKLSAgICAgcmVnZXhwIGxpa2UgIi4qNyIg d291bGQgbWlzdGFrZW5seSBtYXRjaCBcdUM5LCBlLmcuLCB2aWEgdGhpcyBjb21tYW5kOgotICAg ICAoZXhwb3J0IExDX0FMTD16aF9DTi5nYjE4MDMwOyBwcmludGYgJ1x1QzlcbicgfCBncmVwICcu KjcnKSAgKi8KLSAgaWYgKGQtPmxvY2FsZWluZm8ubXVsdGlieXRlICYmICFkLT5sb2NhbGVpbmZv LnVzaW5nX3V0ZjgpCi0gICAgcmV0dXJuIGZhbHNlOwotCiAgIHNpemVfdCBpOwogICBmb3IgKGkg PSAwOyBpIDwgZC0+dGluZGV4OyBpKyspCiAgICAgewotLSAKMS43LjEKCg== --------_583C343100000000B871_MULTIPART_MIXED_-- From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 28 Nov 2016 16:49:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka , jim@meyering.net Cc: 24975@debbugs.gnu.org, bug-gnulib@gnu.org, stephane.chazelas@gmail.com Received: via spool by 24975-submit@debbugs.gnu.org id=B24975.148035172221408 (code B ref 24975); Mon, 28 Nov 2016 16:49:01 +0000 Received: (at 24975) by debbugs.gnu.org; 28 Nov 2016 16:48:42 +0000 Received: from localhost ([127.0.0.1]:45142 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBP6g-0005ZE-3f for submit@debbugs.gnu.org; Mon, 28 Nov 2016 11:48:42 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:33662) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBP6e-0005Yz-KV for 24975@debbugs.gnu.org; Mon, 28 Nov 2016 11:48:41 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 818E8160060; Mon, 28 Nov 2016 08:48:34 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id SxbdNjaEQDY2; Mon, 28 Nov 2016 08:48:32 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id CE3E3160066; Mon, 28 Nov 2016 08:48:32 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ZvJ3_7XLEX2y; Mon, 28 Nov 2016 08:48:32 -0800 (PST) Received: from Penguin.CS.UCLA.EDU (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id B106D160060; Mon, 28 Nov 2016 08:48:32 -0800 (PST) References: <20161128234756.B878.27F6AC2D@kcn.ne.jp> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: <5345264d-a801-9705-509b-4d527a9cc37d@cs.ucla.edu> Date: Mon, 28 Nov 2016 08:48:29 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <20161128234756.B878.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------8636136A9B4CB83B3BFC3F04" X-Spam-Score: -2.9 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) This is a multi-part message in MIME format. --------------8636136A9B4CB83B3BFC3F04 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Thanks for that DFA fix, which should be much better than the previous workaround. I installed it into gnulib and installed the attached patch into grep. --------------8636136A9B4CB83B3BFC3F04 Content-Type: application/x-patch; name="0001-build-update-gnulib-submodule-to-latest.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="0001-build-update-gnulib-submodule-to-latest.patch" RnJvbSA3NjM0OGM4N2U3M2IzN2Q0NGNhZWMzYmI1YjI0YzMzYzE0NTVlZDk2IE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBNb24sIDI4IE5vdiAyMDE2IDA4OjM5OjM3IC0wODAwClN1YmplY3Q6IFtQQVRD SF0gYnVpbGQ6IHVwZGF0ZSBnbnVsaWIgc3VibW9kdWxlIHRvIGxhdGVzdAoKLS0tCiBORVdT ICAgfCAzICstLQogZ251bGliIHwgMiArLQogMiBmaWxlcyBjaGFuZ2VkLCAyIGluc2VydGlv bnMoKyksIDMgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvTkVXUyBiL05FV1MKaW5kZXgg OTcxY2JkOS4uOWQ5YjBlYyAxMDA2NDQKLS0tIGEvTkVXUworKysgYi9ORVdTCkBAIC03LDgg KzcsNyBAQCBHTlUgZ3JlcCBORVdTICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgLSotIG91dGxpbmUgLSotCiAgIGdyZXAgbm8gbG9uZ2VyIHJlcG9ydHMgYSBmYWxzZSBt YXRjaCBpbiBhIG11bHRpYnl0ZSwgbm9uLVVURjggbG9jYWxlCiAgIGxpa2UgemhfQ04uZ2Ix ODAzMCwgd2l0aCBhIHJlZ3VsYXIgZXhwcmVzc2lvbiBsaWtlICIuKjciIHRoYXQganVzdAog ICBoYXBwZW5zIHRvIG1hdGNoIHRoZSA0LWJ5dGUgcmVwcmVzZW50YXRpb24gb2YgZ2IxODAz MCdzIFx1QzksIHRoZQotICBmaW5hbCBieXRlIG9mIHdoaWNoIGlzIHRoZSBkaWdpdCAiNyIu ICBUaGlzICJmaXgiIGlzIHRvIG1ha2UgZ3JlcAotICBhbHdheXMgdXNlIHRoZSBzbG93ZXIg cmVnZXggbWF0Y2hlciBpbiBzdWNoIGxvY2FsZXMuCisgIGZpbmFsIGJ5dGUgb2Ygd2hpY2gg aXMgdGhlIGRpZ2l0ICI3Ii4KICAgW2J1ZyBpbnRyb2R1Y2VkIGluIGdyZXAtMi4xOV0KIAog ICBncmVwIGJ5IGRlZmF1bHQgbm93IHJlYWRzIGFsbCBvZiBzdGFuZGFyZCBpbnB1dCBpZiBp dCBpcyBhIHBpcGUsCmRpZmYgLS1naXQgYS9nbnVsaWIgYi9nbnVsaWIKaW5kZXggYmQ2ZDY2 ZS4uOWNiYTQyZiAxNjAwMDAKLS0tIGEvZ251bGliCisrKyBiL2dudWxpYgpAQCAtMSArMSBA QAotU3VicHJvamVjdCBjb21taXQgYmQ2ZDY2ZTUwMjc4NmRmMjFkMmRjYWE3YjQ3M2VlODUx Zjg0MGFhYQorU3VicHJvamVjdCBjb21taXQgOWNiYTQyZjg3ZTFlODhhYzc0NmUyMzQxYzUx ZTc4ZjlmNjQwZmVmYQotLSAKMi43LjQKCg== --------------8636136A9B4CB83B3BFC3F04-- From unknown Sun Jun 22 00:47:58 2025 X-Loop: help-debbugs@gnu.org Subject: bug#24975: Matching issues with characters whose encoding ends in some other character Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Mon, 28 Nov 2016 17:13:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 24975 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Norihiro Tanaka Cc: 24975@debbugs.gnu.org, Stephane Chazelas Received: via spool by 24975-submit@debbugs.gnu.org id=B24975.148035314723708 (code B ref 24975); Mon, 28 Nov 2016 17:13:01 +0000 Received: (at 24975) by debbugs.gnu.org; 28 Nov 2016 17:12:27 +0000 Received: from localhost ([127.0.0.1]:45175 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBPTe-0006AJ-Nz for submit@debbugs.gnu.org; Mon, 28 Nov 2016 12:12:26 -0500 Received: from mail-io0-f178.google.com ([209.85.223.178]:34784) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1cBPTc-0006A4-U2 for 24975@debbugs.gnu.org; Mon, 28 Nov 2016 12:12:25 -0500 Received: by mail-io0-f178.google.com with SMTP id c21so233032634ioj.1 for <24975@debbugs.gnu.org>; Mon, 28 Nov 2016 09:12:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=NnDeeCgDDXv3ZnCKxfbMqixqy7D7RjoS4xaTfqMPj10=; b=n2lnCavSxHih+G8JPI6l7pI8EatHtaLKm7I6AdCbI3mEP4gtw2HQpaX14CajXHALmg GmQyIwtHHzr77mM2jMLY71TLThPdo7jWRqC24EpFAcHqhBVh62QKsJbQLsn/cEZ3/cAY 91ZHQT43ezDaoo7ELoMQE3Oll1mylkoUHPkMEm7rWlGvGgrY5qnIJRhX9XmSjr1hb1d6 iI1sqiJ1ji8RvO+CObU81oDRXhkHO+M8P1qsRD5ShY/Ksw7+P5RLnRLXu2+7lQ3UqufO 1KH7bvD6MsTm5w18P1AurO6hF2ll4XcXC/zfEUSv9EV7qURD59Ay1ypPFdPVkfFzRcW1 vMWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=NnDeeCgDDXv3ZnCKxfbMqixqy7D7RjoS4xaTfqMPj10=; b=iSkhUc0o5lxEl5rEwJE3vFWmc+OehDF9b/1SupJC4WGqzqfeXQ47zHufIqneO1oZfN y0ZQoxEDotae+VTkug9bcKaYOg+OENe7jpVHGTs6iD5C9MqiuTZ3a7QgLqVBWM4j/rQx s+lbqt/ZnJiDC0NekhPm9O9RS2hbn1T2/nUtxlEY9EFcog36XY890sEQ7XsQmfQiNope ozWL9/DW765QbKr97zYS4vx5Yz9cHe816RtMXFsKXdL05THVyAqvW8ZKBo2hhsssvWx+ V3LsSFFrZyn5CA+p1lvE7Qc5deyQa0qLHDFl8/4CZ1ONhukNSYo0I7uHLVhZQz3WXqdp 72eA== X-Gm-Message-State: AKaTC00Z5iUx9BfqgPti2U7Zqbb6Dt4gXlChWZBkKHyUqBIKT6Au7I5zp7wpCDKKLrDV+UHGElj0R78xNAjthA== X-Received: by 10.107.128.75 with SMTP id b72mr20187368iod.192.1480353136445; Mon, 28 Nov 2016 09:12:16 -0800 (PST) MIME-Version: 1.0 Received: by 10.107.146.66 with HTTP; Mon, 28 Nov 2016 09:11:55 -0800 (PST) In-Reply-To: <20161128224926.B874.27F6AC2D@kcn.ne.jp> References: <20161128224926.B874.27F6AC2D@kcn.ne.jp> From: Jim Meyering Date: Mon, 28 Nov 2016 09:11:55 -0800 X-Google-Sender-Auth: Y_Kkjf4-j7gLoCDHo-Ks5ulVwIg Message-ID: Content-Type: text/plain; charset=UTF-8 X-Spam-Score: 0.5 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.5 (/) On Mon, Nov 28, 2016 at 5:49 AM, Norihiro Tanaka wrote: > Jim Meyering wrote: > >> I suspect this won't be the last word in this area, because it feels >> like we should be able to adjust DFA's tables so that people using >> such locales can retain DFA's efficiency without the bug in the >> current implementation. > > Hi Jim, > > It is a bug in dfa for period expression in non-UTF8 locales. dfa > calculates transition for single byte characters and a multibyte > character separately and merge both results. However, if backs to > an initial state in transition for single byte characters, we should > stop matching single byte characters. Nice work. Thank you.