From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 12 Sep 2014 01:26:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 18454@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.141048511716661 (code B ref -1); Fri, 12 Sep 2014 01:26:02 +0000 Received: (at submit) by debbugs.gnu.org; 12 Sep 2014 01:25:17 +0000 Received: from localhost ([127.0.0.1]:38673 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSFbw-0004Kf-B2 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:16 -0400 Received: from eggs.gnu.org ([208.118.235.92]:48932) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSFbt-0004KW-92 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSFbm-0006Gw-92 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:13 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:59272) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFbm-0006GY-6s for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:06 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50626) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFbg-0007jp-0S for bug-grep@gnu.org; Thu, 11 Sep 2014 21:25:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSFbY-0005zy-Qf for bug-grep@gnu.org; Thu, 11 Sep 2014 21:24:59 -0400 Received: from ioooi.vinc17.net ([92.243.22.117]:57473) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFbY-0005zo-KU for bug-grep@gnu.org; Thu, 11 Sep 2014 21:24:52 -0400 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id 3BA322CC; Fri, 12 Sep 2014 03:24:50 +0200 (CEST) Received: by xvii.vinc17.org (Postfix, from userid 1000) id E827921A079; Fri, 12 Sep 2014 03:24:49 +0200 (CEST) Date: Fri, 12 Sep 2014 03:24:49 +0200 From: Vincent Lefevre Message-ID: <20140912012449.GB18162@xvii.vinc17.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25) Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -5.0 (-----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) With the patch that fixes bug 18266, grep -P works again on binary files (with invalid UTF-8 sequences), but it is now significantly slower than old versions (which could yield undefined behavior). Timings with the Debian packages on my personal svn working copy (binary + text files): 2.18-2 0.9s with -P, 0.4s without -P 2.20-3 11.6s with -P, 0.4s without -P On this example, that's a 13x slowdown! Though the performance issue would better be fixed in libpcre3, I suppose that it is not so simple and won't occur any time soon. Things could be done in grep: 1. Ignore -P when the pattern would have the same meaning without -P (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", at least for the simplest cases). 2. Call PCRE in the C locale when this is equivalent. 3. Transform invalid bytes to null bytes in-place before the PCRE call. This changes the current semantic, but: * the semantic on invalid bytes has never been specified, AFAIK; * the best *practical* behavior may not be the current one (I personally prefer to be able to match invalid bytes, just like one can match top-bit-set characters in the C locale, and seeing such invalid bytes as equivalent to null bytes would not be a problem for most users, IMHO -- things can also be configurable). --=20 Vincent Lef=E8vre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 12 Sep 2014 02:54:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141049041025291 (code B ref 18454); Fri, 12 Sep 2014 02:54:02 +0000 Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 02:53:30 +0000 Received: from localhost ([127.0.0.1]:38726 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSGzJ-0006Zr-T3 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 22:53:30 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:44629) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSGzG-0006Zh-9T for 18454@debbugs.gnu.org; Thu, 11 Sep 2014 22:53:27 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id DA853A6001D; Thu, 11 Sep 2014 19:53:24 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id noZDRkBwjWgI; Thu, 11 Sep 2014 19:53:23 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 76DCBA60001; Thu, 11 Sep 2014 19:53:23 -0700 (PDT) Message-ID: <54126023.8020005@cs.ucla.edu> Date: Thu, 11 Sep 2014 19:53:23 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> In-Reply-To: <20140912012449.GB18162@xvii.vinc17.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -4.8 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.8 (----) Vincent Lefevre wrote: > Things could be done in grep: > > 1. Ignore -P when the pattern would have the same meaning without -P > (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", > at least for the simplest cases). > > 2. Call PCRE in the C locale when this is equivalent. I had already considered these ideas along with several others, but they would require grep to parse and analyze the Perl regular expression. I don't know the PCRE syntax and it would take some time to write a parser. And even if I wrote one, the next PCRE release would likely change the syntax. It sounds very painful to maintain. > 3. Transform invalid bytes to null bytes in-place before the PCRE > call. This changes the current semantic, but: > * the semantic on invalid bytes has never been specified, AFAIK; > * the best *practical* behavior may not be the current one As we've already discussed, this would be incompatible with how invalid bytes are treated by other matchers. And would have undesirable practical effects, e.g., the pattern 'a..*b' would match data that would look like "ab" on many screens (because the null byte would vanish). It's a real kludge that will bite users. Even if we went along with the kludge, grep does not know what bytes PCRE considers to be invalid without invoking PCRE, which is what it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are different ways to do that and they don't all agree.) I suppose grep could reengineer libpcre's internals, to exactly duplicate the algorithm that libpcre uses to decide when bytes are invalid (except to do it 10X faster :-), but then that'd be another thing to maintain in parallel with libpcre. All of these changes sound like a lot of work, which nobody is willing to do. Here's a different idea. How about invoking grep with the --binary-files=without-match option? This should avoid much of the libpcre performance problem, without having to change 'grep'. From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 12 02:29:56 2014 Received: (at control) by debbugs.gnu.org; 12 Sep 2014 06:29:56 +0000 Received: from localhost ([127.0.0.1]:38780 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSKMl-0004qq-Jx for submit@debbugs.gnu.org; Fri, 12 Sep 2014 02:29:55 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:50963) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSKMj-0004qh-GI for control@debbugs.gnu.org; Fri, 12 Sep 2014 02:29:54 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 0960AA6001D for ; Thu, 11 Sep 2014 23:29:52 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 2mCZEK+bADbP for ; Thu, 11 Sep 2014 23:29:43 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 689ACA60001 for ; Thu, 11 Sep 2014 23:29:43 -0700 (PDT) Message-ID: <541292D7.6010605@cs.ucla.edu> Date: Thu, 11 Sep 2014 23:29:43 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: grep bug maintenance Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -4.8 (----) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.8 (----) severity 18454 wishlist severity 18097 wishlist close 18424 From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 12 Sep 2014 08:21:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14105100141358 (code B ref 18454); Fri, 12 Sep 2014 08:21:02 +0000 Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 08:20:14 +0000 Received: from localhost ([127.0.0.1]:38837 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSM5V-0000Lp-OB for submit@debbugs.gnu.org; Fri, 12 Sep 2014 04:20:14 -0400 Received: from ioooi.vinc17.net ([92.243.22.117]:47131) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSM5S-0000Le-HH for 18454@debbugs.gnu.org; Fri, 12 Sep 2014 04:20:11 -0400 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id 3FEFA131; Fri, 12 Sep 2014 10:20:09 +0200 (CEST) Received: by xvii.vinc17.org (Postfix, from userid 1000) id BB05D21A079; Fri, 12 Sep 2014 10:20:08 +0200 (CEST) Date: Fri, 12 Sep 2014 10:20:08 +0200 From: Vincent Lefevre Message-ID: <20140912082008.GC4404@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> <54126023.8020005@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <54126023.8020005@cs.ucla.edu> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25) X-Spam-Score: -2.5 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.5 (--) On 2014-09-11 19:53:23 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >Things could be done in grep: > > > >1. Ignore -P when the pattern would have the same meaning without -P > > (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", > > at least for the simplest cases). > > > >2. Call PCRE in the C locale when this is equivalent. > > I had already considered these ideas along with several others, but they > would require grep to parse and analyze the Perl regular expression. I > don't know the PCRE syntax and it would take some time to write a parser. > And even if I wrote one, the next PCRE release would likely change the > syntax. It sounds very painful to maintain. I think that (1) is rather simple, even though optimization could be missed on some patterns: ERE and PCRE have a large equivalent subclass. The pattern could be examined left to right and would consist of: - Normal characters. - ".", "^" at the beginning, "$" (alone) at the end. - [] with normal characters inside. - "*", "+", "?", "{...}" form not followed by one of "*+?{". - "|" and "(" not followed by one of these 4 characters. - "\" followed by one of ".^$[*+?{". - Some "\" + letter sequences could be recognised as well. Something like that (I haven't checked carefully). There could be another option to allow such an optimization or not. > >3. Transform invalid bytes to null bytes in-place before the PCRE > > call. This changes the current semantic, but: > > * the semantic on invalid bytes has never been specified, AFAIK; > > * the best *practical* behavior may not be the current one > > As we've already discussed, this would be incompatible with how invalid > bytes are treated by other matchers. The same thing could be done with other matchers (in an optional way). > And would have undesirable practical effects, e.g., the pattern > 'a..*b' would match data that would look like "ab" on many screens > (because the null byte would vanish). It's a real kludge that will > bite users. But this is already the case: $ printf "a\0b\n" ab $ printf "a\0b\n" | grep 'a..*b' Binary file (standard input) matches The transformation won't touch null bytes. It would just interpret invalid bytes as null bytes, so that they get matched by ".". > Even if we went along with the kludge, grep does not know what bytes > PCRE considers to be invalid without invoking PCRE, which is what > it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are > different ways to do that and they don't all agree.) It would be a bug in PCRE. Parsing UTF-8 is standard. This is sumarized by: 0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 0x00200000 - 0x03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 0x04000000 - 0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx (from the Linux utf-8(7) man page), everything else being invalid. Note that the pcre_exec(3) man page even say: PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8 validity assuming that the check can be done on the user's side, i.e. in a standard way. > Here's a different idea. How about invoking grep with the > --binary-files=without-match option? This should avoid much of the > libpcre performance problem, without having to change 'grep'. I often want to take binary files into account, for instance because executables can contain text I search for (error messages...). There may be other examples. -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 12 Sep 2014 16:49:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Vincent Lefevre Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141054050221149 (code B ref 18454); Fri, 12 Sep 2014 16:49:01 +0000 Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 16:48:22 +0000 Received: from localhost ([127.0.0.1]:39610 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSU1F-0005V2-9t for submit@debbugs.gnu.org; Fri, 12 Sep 2014 12:48:21 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:44656) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSU1C-0005Us-Fr for 18454@debbugs.gnu.org; Fri, 12 Sep 2014 12:48:19 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 850C8A6001D; Fri, 12 Sep 2014 09:48:17 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 6fYlKLl7e-Ok; Fri, 12 Sep 2014 09:48:08 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id C64F1A60010; Fri, 12 Sep 2014 09:48:08 -0700 (PDT) Message-ID: <541323C8.10500@cs.ucla.edu> Date: Fri, 12 Sep 2014 09:48:08 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <54126023.8020005@cs.ucla.edu> <20140912082008.GC4404@xvii.vinc17.org> In-Reply-To: <20140912082008.GC4404@xvii.vinc17.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -4.5 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.5 (----) Vincent Lefevre wrote: > I think that (1) is rather simple You may think it simple for the REs you're interested in, but someone else might say "hey! that doesn't cover the REs *I'm* interested in!". Solving the problem in general is nontrivial. > But this is already the case: I was assuming the case where the input data contains an encoding error (not a null byte) that is transformed to a null byte before the user sees it. Really, this null-byte-replacement business would be just too weird. I don't see it as a viable general-purpose solution. > Parsing UTF-8 is standard. It's a standard that keeps evolving, different releases of libpcre have done it differently, and I expect things to continue to evolve. It's not something I would want to maintain separately from libpcre itself. Have you investigated why libpcre is so *slow* when doing UTF-8 checking? Why would libpcre be 10x slower than grep's checking by hand?!? I don't get it. Surely there's a simple fix on the libpcre side. > I often want to take binary files into account In those cases I suggest using a unibyte C locale. This should solve the performance problem. Really, unibyte is the way to go here; it's gonna be faster for large binary scanning no matter what is done about this UTF-8 business. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 12 Sep 2014 22:09:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Paul Eggert Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141055974018709 (code B ref 18454); Fri, 12 Sep 2014 22:09:01 +0000 Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 22:09:00 +0000 Received: from localhost ([127.0.0.1]:39707 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSZ1X-0004rh-LR for submit@debbugs.gnu.org; Fri, 12 Sep 2014 18:09:00 -0400 Received: from ioooi.vinc17.net ([92.243.22.117]:47303) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSZ1V-0004rY-5w for 18454@debbugs.gnu.org; Fri, 12 Sep 2014 18:08:57 -0400 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id A1796131; Sat, 13 Sep 2014 00:08:55 +0200 (CEST) Received: by xvii.vinc17.org (Postfix, from userid 1000) id 6017A21A079; Sat, 13 Sep 2014 00:08:55 +0200 (CEST) Date: Sat, 13 Sep 2014 00:08:55 +0200 From: Vincent Lefevre Message-ID: <20140912220855.GK4404@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> <54126023.8020005@cs.ucla.edu> <20140912082008.GC4404@xvii.vinc17.org> <541323C8.10500@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <541323C8.10500@cs.ucla.edu> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25) X-Spam-Score: -2.2 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.2 (--) On 2014-09-12 09:48:08 -0700, Paul Eggert wrote: > Vincent Lefevre wrote: > >I think that (1) is rather simple > > You may think it simple for the REs you're interested in, but someone else > might say "hey! that doesn't cover the REs *I'm* interested in!". Solving > the problem in general is nontrivial. This is still better than no optimization at all. > >But this is already the case: > > I was assuming the case where the input data contains an encoding error (not > a null byte) that is transformed to a null byte before the user sees it. > > Really, this null-byte-replacement business would be just too weird. I > don't see it as a viable general-purpose solution. Anyway since the problem can exist with null bytes, the problem needs to be solved for null bytes. But this is also already the case: $ printf "a\0b\n" | grep -a 'a..*b' a^@b (where the "^@" is in reverse video). So, the only "issue" would be that $ printf "a\x91b\n" | grep -a 'a..*b' would output "a^@b" instead of... possibly something worse. Indeed, outputting invalid UTF-8 sequences to the terminal is bad. Ideally you would output "a<91>b" with "<91>" in reverse video. At some price (this would be slower). Now, if the behavior is chosen by an option, the user would be aware of the meaning of the output, so that this won't really matter. > >Parsing UTF-8 is standard. > > It's a standard that keeps evolving, different releases of libpcre > have done it differently, and I expect things to continue to evolve. Could you give some reference? IMHO, this looks more like a bug. Anyway, UTF-8 sequences that are valid today will still be valid in the future. The only possible change is that new sequences become valid in the future. So, the only possible problem is that such new sequences would be converted to null bytes while this shouldn't be done. This doesn't introduce undefined behavior, just a different behavior (note that this difference would also exist between two libpcre versions, thus not a big problem, and this will be fixable). > Have you investigated why libpcre is so *slow* when doing UTF-8 checking? AFAIK, this is not due to libpcre UTF-8 checking, otherwise it would also be very slow on valid text files too. I suppose that this is due to the many retries from the pcresearch.c code on binary files (the line is split into many sublines, many often consisting of a single byte), i.e. the problem is on the grep side. I don't see how this could be solved except by doing the UTF-8 check on the grep side. > >I often want to take binary files into account > > In those cases I suggest using a unibyte C locale. But I still want "." to match a single (valid) UTF-8 character. Well, using the C locale on binary files and UTF-8 on text files might be acceptable. But how can one do that with a recursive grep? -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 13 Sep 2014 01:00:04 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Vincent Lefevre Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14105699951777 (code B ref 18454); Sat, 13 Sep 2014 01:00:04 +0000 Received: (at 18454) by debbugs.gnu.org; 13 Sep 2014 00:59:55 +0000 Received: from localhost ([127.0.0.1]:39742 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSbgv-0000Sa-TY for submit@debbugs.gnu.org; Fri, 12 Sep 2014 20:59:54 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:38204) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSbgt-0000SR-Bl for 18454@debbugs.gnu.org; Fri, 12 Sep 2014 20:59:52 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id D674FA60010; Fri, 12 Sep 2014 17:59:50 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KQPXwW1SssCC; Fri, 12 Sep 2014 17:59:42 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 1FF9DA60006; Fri, 12 Sep 2014 17:59:42 -0700 (PDT) Message-ID: <541396FD.3080001@cs.ucla.edu> Date: Fri, 12 Sep 2014 17:59:41 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <54126023.8020005@cs.ucla.edu> <20140912082008.GC4404@xvii.vinc17.org> <541323C8.10500@cs.ucla.edu> <20140912220855.GK4404@xvii.vinc17.org> In-Reply-To: <20140912220855.GK4404@xvii.vinc17.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -4.5 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.5 (----) Vincent Lefevre wrote: > This is still better than no optimization at all. We'd have to see; not every optimization is worth the trouble. > if the behavior is chosen by an option, the user would be aware > of the meaning of the output, so that this won't really matter. It'd be better if there wasn't a new grep option simply to avoid a libpcre performance bug. > Could you give some reference? The pcreunicode man page mentions some of this issue under "Validity of UTF-8 string". My impression is that the actual history of behavior changes is more complicated than what that simple summary would suggest. > This doesn't introduce undefined behavior, just a different > behavior Again, it'd be better if grep Just Worked. > I suppose that this is due > to the many retries from the pcresearch.c code on binary files (the > line is split into many sublines, many often consisting of a single > byte), i.e. the problem is on the grep side. libpcre is not giving 'grep' an efficient way to search data that can contain encoding errors. This does not mean "the problem is on the grep side". > I don't see how this > could be solved except by doing the UTF-8 check on the grep side. There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening the data. That's the fundamental problem here. >>> I often want to take binary files into account >> >> In those cases I suggest using a unibyte C locale. > > I still want "." to match a single (valid) UTF-8 character. How about this idea instead? Use a unibyte C locale, and write a unibyte regular expression C that matches a single valid UTF-8 character (using whatever definition you like for UTF-8). Then, you can use . to match single bytes and C to match characters. This gives you all the power you need, without the slowdown due to UTF-8 processing, a slowdown that will be inevitable no matter how we change grep or libpcre. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> In-Reply-To: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 17 Sep 2014 01:44:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 18454@debbugs.gnu.org, Vincent Lefevre Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14109182216889 (code B ref 18454); Wed, 17 Sep 2014 01:44:02 +0000 Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 01:43:41 +0000 Received: from localhost ([127.0.0.1]:42807 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU4HS-0001n1-Rr for submit@debbugs.gnu.org; Tue, 16 Sep 2014 21:43:40 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:57999) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU4HN-0001mm-65 for 18454@debbugs.gnu.org; Tue, 16 Sep 2014 21:43:35 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 3A732A60005; Tue, 16 Sep 2014 18:43:32 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id dWD7oQXOpMQu; Tue, 16 Sep 2014 18:43:27 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id DC5B139E8014; Tue, 16 Sep 2014 18:43:26 -0700 (PDT) Message-ID: <5418E73E.2050002@cs.ucla.edu> Date: Tue, 16 Sep 2014 18:43:26 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------090004040002010007040709" X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) This is a multi-part message in MIME format. --------------090004040002010007040709 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit I worked on this some more, and came up with the attached patches proposed against the current grep Savannah master (commit 9ea9254ea58456b84ed2f0c1481ca91cdd325bf7). For years I've been wanting to write that last patch and I finally got around to it. It improves grep -P's performance by a factor of 1.2 trillion on one (admittedly artificial) benchmark. I hope its 1 ZB/s scan rate is some kind of record. The last patch probably won't help your test cases, though I hope the other patches do help somewhat. --------------090004040002010007040709 Content-Type: text/plain; charset=UTF-8; name="0001-grep-refactor-binary-vs-unknown-vs-text-flags-for-cl.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*0="0001-grep-refactor-binary-vs-unknown-vs-text-flags-for-cl.pa"; filename*1="tch" RnJvbSA3OGU3ZGQ1Nzk4YTJiYmYwZWI3MTdkN2U2NTE1MGE1ZTU1YzViYTkxIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE1OjUwOjQ3IC0wNzAwClN1YmplY3Q6IFtQQVRD SCAxLzZdIGdyZXA6IHJlZmFjdG9yIGJpbmFyeS12cy11bmtub3duLXZzLXRleHQgZmxhZ3Mg Zm9yCiBjbGFyaXR5CgoqIHNyYy9ncmVwLmMgKGVudW0gdGV4dGJpbik6IE5ldyBlbnVtLgoo dGV4dGJpbl9pc19iaW5hcnkpOiBOZXcgZnVuY3Rpb24uCihidWZmZXJfdGV4dGJpbiwgZmls ZV90ZXh0YmluLCBncmVwKTogVXNlIHRoZW0sIGZvciBjbGFyaXR5LgotLS0KIHNyYy9ncmVw LmMgfCA4NiArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrLS0tLS0t LS0tLS0tLS0tLS0tLS0tLQogMSBmaWxlIGNoYW5nZWQsIDU1IGluc2VydGlvbnMoKyksIDMx IGRlbGV0aW9ucygtKQoKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmlu ZGV4IGU0Mzc5YmMuLjFhYTY0ZGIgMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3Jj L2dyZXAuYwpAQCAtNDM3LDE2ICs0MzcsMzggQEAgY2xlYW5fdXBfc3Rkb3V0ICh2b2lkKQog ICAgIGNsb3NlX3N0ZG91dCAoKTsKIH0KIAotLyogUmV0dXJuIDEgaWYgQlVGIChvZiBzaXpl IFNJWkUpIGNvbnRhaW5zIHRleHQsIC0xIGlmIGl0IGNvbnRhaW5zCi0gICBiaW5hcnkgZGF0 YSwgYW5kIDAgaWYgdGhlIGFuc3dlciBkZXBlbmRzIG9uIHdoYXQgY29tZXMgaW1tZWRpYXRl bHkKLSAgIGFmdGVyIEJVRi4gICovCi1zdGF0aWMgaW50CisvKiBBbiBlbnVtIHRleHRiaW4g ZGVzY3JpYmVzIHRoZSBmaWxlJ3MgdHlwZSwgaW5mZXJyZWQgZnJvbSBkYXRhIHJlYWQKKyAg IGJlZm9yZSB0aGUgZmlyc3QgbGluZSBpcyBzZWxlY3RlZCBmb3Igb3V0cHV0LiAgKi8KK2Vu dW0gdGV4dGJpbgorICB7CisgICAgLyogQmluYXJ5LCBhcyBpdCBjb250YWlucyBudWxsIGJ5 dGVzIGFuZCB0aGUgLXogb3B0aW9uIGlzIG5vdCBpbiBlZmZlY3QsCisgICAgICAgb3IgaXQg Y29udGFpbnMgZW5jb2RpbmcgZXJyb3JzLiAgKi8KKyAgICBURVhUQklOX0JJTkFSWSA9IC0x LAorCisgICAgLyogTm90IGtub3duIHlldC4gIE9ubHkgdGV4dCBoYXMgYmVlbiBzZWVuIHNv IGZhci4gICovCisgICAgVEVYVEJJTl9VTktOT1dOID0gMCwKKworICAgIC8qIFRleHQuICAq LworICAgIFRFWFRCSU5fVEVYVCA9IDEKKyAgfTsKKworc3RhdGljIGJvb2wKK3RleHRiaW5f aXNfYmluYXJ5IChlbnVtIHRleHRiaW4gdGV4dGJpbikKK3sKKyAgcmV0dXJuIHRleHRiaW4g PCBURVhUQklOX1VOS05PV047Cit9CisKKy8qIFJldHVybiB0aGUgdGV4dCB0eXBlIG9mIGRh dGEgaW4gQlVGLCBvZiBzaXplIFNJWkUuICAqLworc3RhdGljIGVudW0gdGV4dGJpbgogYnVm ZmVyX3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiB7CiAgIGNoYXIg YmFkYnl0ZSA9IGVvbGJ5dGUgPyAnXDAnIDogJ1wyMDAnOwogCiAgIGlmIChNQl9DVVJfTUFY IDw9IDEpCi0gICAgcmV0dXJuIG1lbWNociAoYnVmLCBiYWRieXRlLCBzaXplKSA/IC0xIDog MTsKKyAgICB7CisgICAgICBpZiAobWVtY2hyIChidWYsIGJhZGJ5dGUsIHNpemUpKQorICAg ICAgICByZXR1cm4gVEVYVEJJTl9CSU5BUlk7CisgICAgfQogICBlbHNlCiAgICAgewogICAg ICAgbWJzdGF0ZV90IG1icyA9IHsgMCB9OwpAQCAtNDU2LDM1ICs0NzgsMzMgQEAgYnVmZmVy X3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAgICAgICBmb3IgKHAg PSBidWY7IHAgPCBidWYgKyBzaXplOyBwICs9IGNsZW4pCiAgICAgICAgIHsKICAgICAgICAg ICBpZiAoKnAgPT0gYmFkYnl0ZSkKLSAgICAgICAgICAgIHJldHVybiAtMTsKKyAgICAgICAg ICAgIHJldHVybiBURVhUQklOX0JJTkFSWTsKICAgICAgICAgICBjbGVuID0gbWJfY2xlbiAo cCwgYnVmICsgc2l6ZSAtIHAsICZtYnMpOwogICAgICAgICAgIGlmICgoc2l6ZV90KSAtMiA8 PSBjbGVuKQotICAgICAgICAgICAgcmV0dXJuIGNsZW4gPT0gKHNpemVfdCkgLTIgPyAwIDog LTE7CisgICAgICAgICAgICByZXR1cm4gY2xlbiA9PSAoc2l6ZV90KSAtMiA/IFRFWFRCSU5f VU5LTk9XTiA6IFRFWFRCSU5fQklOQVJZOwogICAgICAgICB9Ci0KLSAgICAgIHJldHVybiAx OwogICAgIH0KKworICByZXR1cm4gVEVYVEJJTl9URVhUOwogfQogCi0vKiBSZXR1cm4gMSBp ZiBhIGZpbGUgaXMga25vd24gdG8gYmUgdGV4dCBmb3IgdGhlIHB1cnBvc2Ugb2YgJ2dyZXAn LgotICAgUmV0dXJuIC0xIGlmIGl0IGlzIGtub3duIHRvIGJlIGJpbmFyeSwgMCBpZiB1bmtu b3duLgotICAgQlVGLCBvZiBzaXplIEJVRlNJWkUsIGlzIHRoZSBpbml0aWFsIGJ1ZmZlciBy ZWFkIGZyb20gdGhlIGZpbGUgd2l0aAotICAgZGVzY3JpcHRvciBGRCBhbmQgc3RhdHVzIFNU LiAgKi8KLXN0YXRpYyBpbnQKKy8qIFJldHVybiB0aGUgdGV4dCB0eXBlIG9mIGEgZmlsZS4g IEJVRiwgb2Ygc2l6ZSBCVUZTSVpFLCBpcyB0aGUgaW5pdGlhbAorICAgYnVmZmVyIHJlYWQg ZnJvbSB0aGUgZmlsZSB3aXRoIGRlc2NyaXB0b3IgRkQgYW5kIHN0YXR1cyBTVC4gICovCitz dGF0aWMgZW51bSB0ZXh0YmluCiBmaWxlX3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6 ZV90IGJ1ZnNpemUsIGludCBmZCwgc3RydWN0IHN0YXQgY29uc3QgKnN0KQogewogICAjaWZu ZGVmIFNFRUtfSE9MRQogICBlbnVtIHsgU0VFS19IT0xFID0gU0VFS19FTkQgfTsKICAgI2Vu ZGlmCiAKLSAgaW50IHRleHRiaW4gPSBidWZmZXJfdGV4dGJpbiAoYnVmLCBidWZzaXplKTsK LSAgaWYgKHRleHRiaW4gPCAwKQorICBlbnVtIHRleHRiaW4gdGV4dGJpbiA9IGJ1ZmZlcl90 ZXh0YmluIChidWYsIGJ1ZnNpemUpOworICBpZiAodGV4dGJpbl9pc19iaW5hcnkgKHRleHRi aW4pKQogICAgIHJldHVybiB0ZXh0YmluOwogCiAgIGlmICh1c2FibGVfc3Rfc2l6ZSAoc3Qp KQogICAgIHsKICAgICAgIGlmIChzdC0+c3Rfc2l6ZSA8PSBidWZzaXplKQotICAgICAgICBy ZXR1cm4gMiAqIHRleHRiaW4gLSAxOworICAgICAgICByZXR1cm4gdGV4dGJpbiA9PSBURVhU QklOX1VOS05PV04gPyBURVhUQklOX0JJTkFSWSA6IHRleHRiaW47CiAKICAgICAgIC8qIElm IHRoZSBmaWxlIGhhcyBob2xlcywgaXQgbXVzdCBjb250YWluIGEgbnVsbCBieXRlIHNvbWV3 aGVyZS4gICovCiAgICAgICBpZiAoU0VFS19IT0xFICE9IFNFRUtfRU5EICYmIGVvbGJ5dGUp CkBAIC00OTQsNyArNTE0LDcgQEAgZmlsZV90ZXh0YmluIChjaGFyIGNvbnN0ICpidWYsIHNp emVfdCBidWZzaXplLCBpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgICAgICAg ICAgIHsKICAgICAgICAgICAgICAgY3VyID0gbHNlZWsgKGZkLCAwLCBTRUVLX0NVUik7CiAg ICAgICAgICAgICAgIGlmIChjdXIgPCAwKQotICAgICAgICAgICAgICAgIHJldHVybiAwOwor ICAgICAgICAgICAgICAgIHJldHVybiBURVhUQklOX1VOS05PV047CiAgICAgICAgICAgICB9 CiAKICAgICAgICAgICAvKiBMb29rIGZvciBhIGhvbGUgYWZ0ZXIgdGhlIGN1cnJlbnQgbG9j YXRpb24uICAqLwpAQCAtNTA0LDEyICs1MjQsMTIgQEAgZmlsZV90ZXh0YmluIChjaGFyIGNv bnN0ICpidWYsIHNpemVfdCBidWZzaXplLCBpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpz dCkKICAgICAgICAgICAgICAgaWYgKGxzZWVrIChmZCwgY3VyLCBTRUVLX1NFVCkgPCAwKQog ICAgICAgICAgICAgICAgIHN1cHByZXNzaWJsZV9lcnJvciAoZmlsZW5hbWUsIGVycm5vKTsK ICAgICAgICAgICAgICAgaWYgKGhvbGVfc3RhcnQgPCBzdC0+c3Rfc2l6ZSkKLSAgICAgICAg ICAgICAgICByZXR1cm4gLTE7CisgICAgICAgICAgICAgICAgcmV0dXJuIFRFWFRCSU5fQklO QVJZOwogICAgICAgICAgICAgfQogICAgICAgICB9CiAgICAgfQogCi0gIHJldHVybiAwOwor ICByZXR1cm4gVEVYVEJJTl9VTktOT1dOOwogfQogCiAvKiBDb252ZXJ0IFNUUiB0byBhIG5v bm5lZ2F0aXZlIGludGVnZXIsIHN0b3JpbmcgdGhlIHJlc3VsdCBpbiAqT1VULgpAQCAtMTEy OSw3ICsxMTQ5LDcgQEAgc3RhdGljIGludG1heF90CiBncmVwIChpbnQgZmQsIHN0cnVjdCBz dGF0IGNvbnN0ICpzdCkKIHsKICAgaW50bWF4X3QgbmxpbmVzLCBpOwotICBpbnQgdGV4dGJp bjsKKyAgZW51bSB0ZXh0YmluIHRleHRiaW47CiAgIHNpemVfdCByZXNpZHVlLCBzYXZlOwog ICBjaGFyIG9sZGM7CiAgIGNoYXIgKmJlZzsKQEAgLTExNTksMTEgKzExNzksMTEgQEAgZ3Jl cCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAgICAgfQogCiAgIGlmIChiaW5h cnlfZmlsZXMgPT0gVEVYVF9CSU5BUllfRklMRVMpCi0gICAgdGV4dGJpbiA9IDE7CisgICAg dGV4dGJpbiA9IFRFWFRCSU5fVEVYVDsKICAgZWxzZQogICAgIHsKICAgICAgIHRleHRiaW4g PSBmaWxlX3RleHRiaW4gKGJ1ZmJlZywgYnVmbGltIC0gYnVmYmVnLCBmZCwgc3QpOwotICAg ICAgaWYgKHRleHRiaW4gPCAwKQorICAgICAgaWYgKHRleHRiaW5faXNfYmluYXJ5ICh0ZXh0 YmluKSkKICAgICAgICAgewogICAgICAgICAgIGlmIChiaW5hcnlfZmlsZXMgPT0gV0lUSE9V VF9NQVRDSF9CSU5BUllfRklMRVMpCiAgICAgICAgICAgICByZXR1cm4gMDsKQEAgLTEyMjMs OCArMTI0Myw4IEBAIGdyZXAgKGludCBmZCwgc3RydWN0IHN0YXQgY29uc3QgKnN0KQogICAg ICAgLyogRGV0ZWN0IHdoZXRoZXIgbGVhZGluZyBjb250ZXh0IGlzIGFkamFjZW50IHRvIHBy ZXZpb3VzIG91dHB1dC4gICovCiAgICAgICBpZiAobGFzdG91dCkKICAgICAgICAgewotICAg ICAgICAgIGlmICghdGV4dGJpbikKLSAgICAgICAgICAgIHRleHRiaW4gPSAxOworICAgICAg ICAgIGlmICh0ZXh0YmluID09IFRFWFRCSU5fVU5LTk9XTikKKyAgICAgICAgICAgIHRleHRi aW4gPSBURVhUQklOX1RFWFQ7CiAgICAgICAgICAgaWYgKGJlZyAhPSBsYXN0b3V0KQogICAg ICAgICAgICAgbGFzdG91dCA9IDA7CiAgICAgICAgIH0KQEAgLTEyNDMsMTIgKzEyNjMsMTYg QEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAKICAgICAgIC8qIElm IHRoZSBmaWxlJ3MgdGV4dGJpbiBoYXMgbm90IGJlZW4gZGV0ZXJtaW5lZCB5ZXQsIGFzc3Vt ZQogICAgICAgICAgaXQncyBiaW5hcnkgaWYgdGhlIG5leHQgaW5wdXQgYnVmZmVyIHN1Z2dl c3RzIHNvLiAgKi8KLSAgICAgIGlmICghIHRleHRiaW4gJiYgYnVmZmVyX3RleHRiaW4gKGJ1 ZmJlZywgYnVmbGltIC0gYnVmYmVnKSA8IDApCisgICAgICBpZiAodGV4dGJpbiA9PSBURVhU QklOX1VOS05PV04pCiAgICAgICAgIHsKLSAgICAgICAgICB0ZXh0YmluID0gLTE7Ci0gICAg ICAgICAgaWYgKGJpbmFyeV9maWxlcyA9PSBXSVRIT1VUX01BVENIX0JJTkFSWV9GSUxFUykK LSAgICAgICAgICAgIHJldHVybiAwOwotICAgICAgICAgIGRvbmVfb25fbWF0Y2ggPSBvdXRf cXVpZXQgPSB0cnVlOworICAgICAgICAgIGVudW0gdGV4dGJpbiB0YiA9IGJ1ZmZlcl90ZXh0 YmluIChidWZiZWcsIGJ1ZmxpbSAtIGJ1ZmJlZyk7CisgICAgICAgICAgaWYgKHRleHRiaW5f aXNfYmluYXJ5ICh0YikpCisgICAgICAgICAgICB7CisgICAgICAgICAgICAgIGlmIChiaW5h cnlfZmlsZXMgPT0gV0lUSE9VVF9NQVRDSF9CSU5BUllfRklMRVMpCisgICAgICAgICAgICAg ICAgcmV0dXJuIDA7CisgICAgICAgICAgICAgIHRleHRiaW4gPSB0YjsKKyAgICAgICAgICAg ICAgZG9uZV9vbl9tYXRjaCA9IG91dF9xdWlldCA9IHRydWU7CisgICAgICAgICAgICB9CiAg ICAgICAgIH0KICAgICB9CiAgIGlmIChyZXNpZHVlKQpAQCAtMTI2Myw3ICsxMjg3LDcgQEAg Z3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAgZmluaXNoX2dyZXA6CiAg IGRvbmVfb25fbWF0Y2ggPSBkb25lX29uX21hdGNoXzA7CiAgIG91dF9xdWlldCA9IG91dF9x dWlldF8wOwotICBpZiAodGV4dGJpbiA8IDAgJiYgIW91dF9xdWlldCAmJiBubGluZXMgIT0g MCkKKyAgaWYgKHRleHRiaW5faXNfYmluYXJ5ICh0ZXh0YmluKSAmJiAhb3V0X3F1aWV0ICYm IG5saW5lcyAhPSAwKQogICAgIHByaW50ZiAoXygiQmluYXJ5IGZpbGUgJXMgbWF0Y2hlc1xu IiksIGZpbGVuYW1lKTsKICAgcmV0dXJuIG5saW5lczsKIH0KLS0gCjEuOS4zCgo= --------------090004040002010007040709 Content-Type: text/plain; charset=UTF-8; name="0002-grep-z-no-longer-considers-200-to-be-binary-data.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*0="0002-grep-z-no-longer-considers-200-to-be-binary-data.patch" RnJvbSAyNDFhNzYyM2NhNTE5YTdmZjcyNjAxZGU4MDlmYjUyMzkzNDk0MzU0IE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE2OjE4OjAwIC0wNzAwClN1YmplY3Q6IFtQQVRD SCAyLzZdIGdyZXA6IC16IG5vIGxvbmdlciBjb25zaWRlcnMgJ1wyMDAnIHRvIGJlIGJpbmFy eSBkYXRhCgpUaGlzIGF2b2lkcyBhIHByb2JsZW0gd2hlbiB1c2luZyBncmVwIC16IGluIGEg V2luZG93cy0xMjUyIGxvY2FsZS4KUGx1cywgaXQgbGV0cyAnZ3JlcCAteicgcnVuIGEgYml0 IGZhc3Rlci4KKiBORVdTOiBEb2N1bWVudCB0aGlzLgoqIHNyYy9ncmVwLmMgKGJ1ZmZlcl90 ZXh0YmluKTogRG9uJ3QgbG9vayBmb3IgJ1wyMDAnIGlmIC16LgoqIHRlc3RzL3BjcmUtejog VGVzdCBmb3IgbmV3IGJlaGF2aW9yLgotLS0KIE5FV1MgICAgICAgICB8ICAyICsrCiBzcmMv Z3JlcC5jICAgfCAxMiArKystLS0tLS0tLS0KIHRlc3RzL3BjcmUteiB8ICA0ICsrKysKIDMg ZmlsZXMgY2hhbmdlZCwgOSBpbnNlcnRpb25zKCspLCA5IGRlbGV0aW9ucygtKQoKZGlmZiAt LWdpdCBhL05FV1MgYi9ORVdTCmluZGV4IDkzNzdkN2QuLjUxYjYzZmIgMTAwNjQ0Ci0tLSBh L05FV1MKKysrIGIvTkVXUwpAQCAtMjYsNiArMjYsOCBAQCBHTlUgZ3JlcCBORVdTICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLSotIG91dGxpbmUgLSotCiAgIEluIGxv Y2FsZXMgd2l0aCBtdWx0aWJ5dGUgY2hhcmFjdGVyIGVuY29kaW5ncyBvdGhlciB0aGFuIFVU Ri04LAogICBncmVwIC1QIG5vdyByZXBvcnRzIGFuIGVycm9yIGFuZCBleGl0cyBpbnN0ZWFk IG9mIG1pc2JlaGF2aW5nLgogCisgIGdyZXAgLXogbm8gbG9uZ2VyIGF1dG9tYXRpY2FsbHkg dHJlYXRzIHRoZSBieXRlICdcMjAwJyBhcyBiaW5hcnkgZGF0YS4KKwogKiBOb3Rld29ydGh5 IGNoYW5nZXMgaW4gcmVsZWFzZSAyLjIwICgyMDE0LTA2LTAzKSBbc3RhYmxlXQogCiAqKiBC dWcgZml4ZXMKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IDFh YTY0ZGIuLjFjNmZlZTggMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAu YwpAQCAtNDYyLDE0ICs0NjIsMTAgQEAgdGV4dGJpbl9pc19iaW5hcnkgKGVudW0gdGV4dGJp biB0ZXh0YmluKQogc3RhdGljIGVudW0gdGV4dGJpbgogYnVmZmVyX3RleHRiaW4gKGNoYXIg Y29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiB7Ci0gIGNoYXIgYmFkYnl0ZSA9IGVvbGJ5dGUg PyAnXDAnIDogJ1wyMDAnOworICBpZiAoZW9sYnl0ZSAmJiBtZW1jaHIgKGJ1ZiwgJ1wwJywg c2l6ZSkpCisgICAgcmV0dXJuIFRFWFRCSU5fQklOQVJZOwogCi0gIGlmIChNQl9DVVJfTUFY IDw9IDEpCi0gICAgewotICAgICAgaWYgKG1lbWNociAoYnVmLCBiYWRieXRlLCBzaXplKSkK LSAgICAgICAgcmV0dXJuIFRFWFRCSU5fQklOQVJZOwotICAgIH0KLSAgZWxzZQorICBpZiAo MSA8IE1CX0NVUl9NQVgpCiAgICAgewogICAgICAgbWJzdGF0ZV90IG1icyA9IHsgMCB9Owog ICAgICAgc2l6ZV90IGNsZW47CkBAIC00NzcsOCArNDczLDYgQEAgYnVmZmVyX3RleHRiaW4g KGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAKICAgICAgIGZvciAocCA9IGJ1Zjsg cCA8IGJ1ZiArIHNpemU7IHAgKz0gY2xlbikKICAgICAgICAgewotICAgICAgICAgIGlmICgq cCA9PSBiYWRieXRlKQotICAgICAgICAgICAgcmV0dXJuIFRFWFRCSU5fQklOQVJZOwogICAg ICAgICAgIGNsZW4gPSBtYl9jbGVuIChwLCBidWYgKyBzaXplIC0gcCwgJm1icyk7CiAgICAg ICAgICAgaWYgKChzaXplX3QpIC0yIDw9IGNsZW4pCiAgICAgICAgICAgICByZXR1cm4gY2xl biA9PSAoc2l6ZV90KSAtMiA/IFRFWFRCSU5fVU5LTk9XTiA6IFRFWFRCSU5fQklOQVJZOwpk aWZmIC0tZ2l0IGEvdGVzdHMvcGNyZS16IGIvdGVzdHMvcGNyZS16CmluZGV4IDk5ZWJjNDMu LjZiYmRlOTQgMTAwNzU1Ci0tLSBhL3Rlc3RzL3BjcmUtegorKysgYi90ZXN0cy9wY3JlLXoK QEAgLTIwLDQgKzIwLDggQEAgZ3JlcCAtUHogIiRSRUdFWCIgaW4gPiBvdXQgMj5lcnIgfHwg ZmFpbD0xCiBjb21wYXJlIGV4cCBvdXQgfHwgZmFpbD0xCiBjb21wYXJlIC9kZXYvbnVsbCBl cnIgfHwgZmFpbD0xCiAKK3ByaW50ZiAnXDIwMFwwJyA+aW4wCitMQ19BTEw9QyBncmVwIC16 IC4gaW4wID5vdXQgfHwgZmFpbD0xCitjb21wYXJlIGluMCBvdXQgfHwgZmFpbD0xCisKIEV4 aXQgJGZhaWwKLS0gCjEuOS4zCgo= --------------090004040002010007040709 Content-Type: text/plain; charset=UTF-8; name="0003-grep-non-text-bytes-in-binary-data-may-be-treated-as.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*0="0003-grep-non-text-bytes-in-binary-data-may-be-treated-as.pa"; filename*1="tch" RnJvbSBjMGM2OTBiZTE1MGQyNjA5N2MzNTgxMDM5YzdlODgyYjJhM2UxOWQ4IE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE3OjE1OjA2IC0wNzAwClN1YmplY3Q6IFtQQVRD SCAzLzZdIGdyZXA6IG5vbi10ZXh0IGJ5dGVzIGluIGJpbmFyeSBkYXRhIG1heSBiZSB0cmVh dGVkIGFzCiBsaW5lIGVuZHMKCiogTkVXUywgZG9jL2dyZXAudGV4aSAoRmlsZSBhbmQgRGly ZWN0b3J5IFNlbGVjdGlvbik6CkRvY3VtZW50IHRoaXMgY2hhbmdlLgoqIHNyYy9ncmVwLmMg KHphcF9udWxzKTogTmV3IGZ1bmN0aW9uLgooZ3JlcCk6IFVzZSBpdC4KKiB0ZXN0cy9udWxs LWJ5dGU6IFJlbGF4IHRvIGFsbG93IG5ldyBiZWhhdmlvci4KLS0tCiBORVdTICAgICAgICAg ICAgfCAgMyArKysKIGRvYy9ncmVwLnRleGkgICB8ICAyICsrCiBzcmMvZ3JlcC5jICAgICAg fCAyOCArKysrKysrKysrKysrKysrKysrKysrKysrKystCiB0ZXN0cy9udWxsLWJ5dGUgfCAg NCArKy0tCiA0IGZpbGVzIGNoYW5nZWQsIDM0IGluc2VydGlvbnMoKyksIDMgZGVsZXRpb25z KC0pCgpkaWZmIC0tZ2l0IGEvTkVXUyBiL05FV1MKaW5kZXggNTFiNjNmYi4uNzMzMzE4ZCAx MDA2NDQKLS0tIGEvTkVXUworKysgYi9ORVdTCkBAIC0yNiw2ICsyNiw5IEBAIEdOVSBncmVw IE5FV1MgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAtKi0gb3V0bGluZSAt Ki0KICAgSW4gbG9jYWxlcyB3aXRoIG11bHRpYnl0ZSBjaGFyYWN0ZXIgZW5jb2RpbmdzIG90 aGVyIHRoYW4gVVRGLTgsCiAgIGdyZXAgLVAgbm93IHJlcG9ydHMgYW4gZXJyb3IgYW5kIGV4 aXRzIGluc3RlYWQgb2YgbWlzYmVoYXZpbmcuCiAKKyAgV2hlbiBzZWFyY2hpbmcgYmluYXJ5 IGRhdGEsIGdyZXAgbm93IG1heSB0cmVhdCBub24tdGV4dCBieXRlcyBhcworICBsaW5lIHRl cm1pbmF0b3JzLiAgVGhpcyBjYW4gYm9vc3QgcGVyZm9ybWFuY2Ugc2lnbmlmaWNhbnRseS4K KwogICBncmVwIC16IG5vIGxvbmdlciBhdXRvbWF0aWNhbGx5IHRyZWF0cyB0aGUgYnl0ZSAn XDIwMCcgYXMgYmluYXJ5IGRhdGEuCiAKICogTm90ZXdvcnRoeSBjaGFuZ2VzIGluIHJlbGVh c2UgMi4yMCAoMjAxNC0wNi0wMykgW3N0YWJsZV0KZGlmZiAtLWdpdCBhL2RvYy9ncmVwLnRl eGkgYi9kb2MvZ3JlcC50ZXhpCmluZGV4IDE0YmQ2OWUuLmQ3YWRjYWQgMTAwNjQ0Ci0tLSBh L2RvYy9ncmVwLnRleGkKKysrIGIvZG9jL2dyZXAudGV4aQpAQCAtNjAwLDYgKzYwMCw4IEBA IEJ5IGRlZmF1bHQsIEB2YXJ7dHlwZX0gaXMgQHNhbXB7YmluYXJ5fSwKIGFuZCBAY29tbWFu ZHtncmVwfSBub3JtYWxseSBvdXRwdXRzIGVpdGhlcgogYSBvbmUtbGluZSBtZXNzYWdlIHNh eWluZyB0aGF0IGEgYmluYXJ5IGZpbGUgbWF0Y2hlcywKIG9yIG5vIG1lc3NhZ2UgaWYgdGhl cmUgaXMgbm8gbWF0Y2guCitXaGVuIG1hdGNoaW5nIGJpbmFyeSBkYXRhLCBAY29tbWFuZHtn cmVwfSBtYXkgdHJlYXQgbm9uLXRleHQKK2J5dGVzIGFzIGxpbmUgdGVybWluYXRvcnMuCiAK IElmIEB2YXJ7dHlwZX0gaXMgQHNhbXB7d2l0aG91dC1tYXRjaH0sCiBAY29tbWFuZHtncmVw fSBhc3N1bWVzIHRoYXQgYSBiaW5hcnkgZmlsZSBkb2VzIG5vdCBtYXRjaDsKZGlmZiAtLWdp dCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IDFjNmZlZTguLjgzNTU5ZTIgMTAw NjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAuYwpAQCAtMTA5Myw5ICsxMDkz LDMwIEBAIHBydGV4dCAoY2hhciBjb25zdCAqYmVnLCBjaGFyIGNvbnN0ICpsaW0pCiAgIG91 dGxlZnQgLT0gbjsKIH0KIAorLyogUmVwbGFjZSBhbGwgTlVMIGJ5dGVzIGluIGJ1ZmZlciBQ ICh3aGljaCBlbmRzIGF0IExJTSkgd2l0aCBFT0wuCisgICBUaGlzIGF2b2lkcyBydW5uaW5n IG91dCBvZiBtZW1vcnkgd2hlbiBiaW5hcnkgaW5wdXQgY29udGFpbnMgYSBsb25nCisgICBz ZXF1ZW5jZSBvZiB6ZXJvcywgd2hpY2ggd291bGQgb3RoZXJ3aXNlIGJlIGNvbnNpZGVyZWQg dG8gYmUgcGFydAorICAgb2YgYSBsb25nIGxpbmUuICBQW0xJTV0gc2hvdWxkIGJlIEVPTC4g ICovCitzdGF0aWMgdm9pZAoremFwX251bHMgKGNoYXIgKnAsIGNoYXIgKmxpbSwgY2hhciBl b2wpCit7CisgIGlmIChlb2wpCisgICAgd2hpbGUgKHRydWUpCisgICAgICB7CisgICAgICAg ICpsaW0gPSAnXDAnOworICAgICAgICBwICs9IHN0cmxlbiAocCk7CisgICAgICAgICpsaW0g PSBlb2w7CisgICAgICAgIGlmIChwID09IGxpbSkKKyAgICAgICAgICBicmVhazsKKyAgICAg ICAgZG8KKyAgICAgICAgICAqcCsrID0gZW9sOworICAgICAgICB3aGlsZSAoISpwKTsKKyAg ICAgIH0KK30KKwogLyogU2NhbiB0aGUgc3BlY2lmaWVkIHBvcnRpb24gb2YgdGhlIGJ1ZmZl ciwgbWF0Y2hpbmcgbGluZXMgKG9yCiAgICBiZXR3ZWVuIG1hdGNoaW5nIGxpbmVzIGlmIE9V VF9JTlZFUlQgaXMgdHJ1ZSkuICBSZXR1cm4gYSBjb3VudCBvZgotICAgbGluZXMgcHJpbnRl ZC4gKi8KKyAgIGxpbmVzIHByaW50ZWQuICBSZXBsYWNlIGFsbCBOVUwgYnl0ZXMgd2l0aCBO VUxfWkFQUEVSIGFzIHdlIGdvLiAgKi8KIHN0YXRpYyBpbnRtYXhfdAogZ3JlcGJ1ZiAoY2hh ciBjb25zdCAqYmVnLCBjaGFyIGNvbnN0ICpsaW0pCiB7CkBAIC0xMTQ5LDYgKzExNzAsNyBA QCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgY2hhciAqYmVnOwog ICBjaGFyICpsaW07CiAgIGNoYXIgZW9sID0gZW9sYnl0ZTsKKyAgY2hhciBudWxfemFwcGVy ID0gJ1wwJzsKICAgYm9vbCBkb25lX29uX21hdGNoXzAgPSBkb25lX29uX21hdGNoOwogICBi b29sIG91dF9xdWlldF8wID0gb3V0X3F1aWV0OwogCkBAIC0xMTgyLDYgKzEyMDQsNyBAQCBn cmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgICAgICAgICBpZiAoYmlu YXJ5X2ZpbGVzID09IFdJVEhPVVRfTUFUQ0hfQklOQVJZX0ZJTEVTKQogICAgICAgICAgICAg cmV0dXJuIDA7CiAgICAgICAgICAgZG9uZV9vbl9tYXRjaCA9IG91dF9xdWlldCA9IHRydWU7 CisgICAgICAgICAgbnVsX3phcHBlciA9IGVvbDsKICAgICAgICAgfQogICAgIH0KIApAQCAt MTE5Nyw2ICsxMjIwLDggQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3Qp CiAgICAgICBpZiAoYmVnID09IGJ1ZmxpbSkKICAgICAgICAgYnJlYWs7CiAKKyAgICAgIHph cF9udWxzIChiZWcsIGJ1ZmxpbSwgbnVsX3phcHBlcik7CisKICAgICAgIC8qIERldGVybWlu ZSBuZXcgcmVzaWR1ZSAodGhlIGxlbmd0aCBvZiBhbiBpbmNvbXBsZXRlIGxpbmUgYXQgdGhl IGVuZCBvZgogICAgICAgICAgdGhlIGJ1ZmZlciwgMCBtZWFucyB0aGVyZSBpcyBubyBpbmNv bXBsZXRlIGxhc3QgbGluZSkuICAqLwogICAgICAgb2xkYyA9IGJlZ1stMV07CkBAIC0xMjY2 LDYgKzEyOTEsNyBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAg ICAgICAgICAgICAgICByZXR1cm4gMDsKICAgICAgICAgICAgICAgdGV4dGJpbiA9IHRiOwog ICAgICAgICAgICAgICBkb25lX29uX21hdGNoID0gb3V0X3F1aWV0ID0gdHJ1ZTsKKyAgICAg ICAgICAgICAgbnVsX3phcHBlciA9IGVvbDsKICAgICAgICAgICAgIH0KICAgICAgICAgfQog ICAgIH0KZGlmZiAtLWdpdCBhL3Rlc3RzL251bGwtYnl0ZSBiL3Rlc3RzL251bGwtYnl0ZQpp bmRleCBjOTY3ZGJjLi4xZDgwYmZlIDEwMDc1NQotLS0gYS90ZXN0cy9udWxsLWJ5dGUKKysr IGIvdGVzdHMvbnVsbC1ieXRlCkBAIC0zOCw4ICszOCw4IEBAIGZvciBsZWZ0IGluICcnIGEg JyMnICdcMCc7IGRvCiAgICAgICAgICAgcGF0PSIkaGF0JGZvcmNlX3JlZ2V4JGRhdGEkZG9s bGFyIgogICAgICAgICAgIHByaW50ZiAiJHBhdFxcbiIgPnBhdCB8fCBmcmFtZXdvcmtfZmFp bHVyZV8KICAgICAgICAgICBmb3IgbG9jYWxlIGluICRsb2NhbGVzOyBkbwotICAgICAgICAg ICAgTENfQUxMPSRsb2NhbGUgZ3JlcCAtZiBwYXQgaW4gfHwKLSAgICAgICAgICAgICAgZmFp bF8gIickcGF0JyBkb2VzIG5vdCBtYXRjaCAnJGRhdGEnIgorICAgICAgICAgICAgTENfQUxM PSRsb2NhbGUgZ3JlcCAtZiBwYXQgaW4KKyAgICAgICAgICAgIHRlc3QgJD8gLWVxIDAgfHwg dGVzdCAkPyAtZXEgMSB8fCBmYWlsXyAiJyRwYXQnIGNhdXNlZCBhbiBlcnJvciIKICAgICAg ICAgICAgIExDX0FMTD0kbG9jYWxlIGdyZXAgLWEgLWYgcGF0IGluIHwgY21wIC1zIC0gaW4g fHwKICAgICAgICAgICAgICAgZmFpbF8gIi1hICckcGF0JyBkb2VzIG5vdCBtYXRjaCAnJGRh dGEnIgogICAgICAgICAgIGRvbmUKLS0gCjEuOS4zCgo= --------------090004040002010007040709 Content-Type: text/plain; charset=UTF-8; name="0004-grep-minor-P-speedup-with-jit_stack.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="0004-grep-minor-P-speedup-with-jit_stack.patch" RnJvbSBmMTYzMTM4NzNjYzg2ODZjMTVkOWI3MWY5YWZlYTdiOWI0YzRkMTRiIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE5OjU2OjU1IC0wNzAwClN1YmplY3Q6IFtQQVRD SCA0LzZdIGdyZXA6IG1pbm9yIC1QIHNwZWVkdXAgd2l0aCBqaXRfc3RhY2sKCiogc3JjL3Bj cmVzZWFyY2guYyAoaml0X3N0YWNrKTogTm8gbG9uZ2VyIHN0YXRpYy4KLS0tCiBzcmMvcGNy ZXNlYXJjaC5jIHwgNiArKy0tLS0KIDEgZmlsZSBjaGFuZ2VkLCAyIGluc2VydGlvbnMoKyks IDQgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvc3JjL3BjcmVzZWFyY2guYyBiL3NyYy9w Y3Jlc2VhcmNoLmMKaW5kZXggYzQxZjdlZi4uMWIxNWU1MyAxMDA2NDQKLS0tIGEvc3JjL3Bj cmVzZWFyY2guYworKysgYi9zcmMvcGNyZXNlYXJjaC5jCkBAIC0zMyw5ICszMyw3IEBAIHN0 YXRpYyBwY3JlICpjcmU7CiAvKiBBZGRpdGlvbmFsIGluZm9ybWF0aW9uIGFib3V0IHRoZSBw YXR0ZXJuLiAgKi8KIHN0YXRpYyBwY3JlX2V4dHJhICpleHRyYTsKIAotIyBpZmRlZiBQQ1JF X1NUVURZX0pJVF9DT01QSUxFCi1zdGF0aWMgcGNyZV9qaXRfc3RhY2sgKmppdF9zdGFjazsK LSMgZWxzZQorIyBpZm5kZWYgUENSRV9TVFVEWV9KSVRfQ09NUElMRQogIyAgZGVmaW5lIFBD UkVfU1RVRFlfSklUX0NPTVBJTEUgMAogIyBlbmRpZgogI2VuZGlmCkBAIC0xMjYsNyArMTI0 LDcgQEAgUGNvbXBpbGUgKGNoYXIgY29uc3QgKnBhdHRlcm4sIHNpemVfdCBzaXplKQogICAg ICAgLyogQSAzMksgc3RhY2sgaXMgYWxsb2NhdGVkIGZvciB0aGUgbWFjaGluZSBjb2RlIGJ5 IGRlZmF1bHQsIHdoaWNoCiAgICAgICAgICBjYW4gZ3JvdyB0byA1MTJLIGlmIG5lY2Vzc2Fy eS4gU2luY2UgSklUIHVzZXMgZmFyIGxlc3MgbWVtb3J5CiAgICAgICAgICB0aGFuIHRoZSBp bnRlcnByZXRlciwgdGhpcyBzaG91bGQgYmUgZW5vdWdoIGluIHByYWN0aWNlLiAgKi8KLSAg ICAgIGppdF9zdGFjayA9IHBjcmVfaml0X3N0YWNrX2FsbG9jICgzMiAqIDEwMjQsIDUxMiAq IDEwMjQpOworICAgICAgcGNyZV9qaXRfc3RhY2sgKmppdF9zdGFjayA9IHBjcmVfaml0X3N0 YWNrX2FsbG9jICgzMiAqIDEwMjQsIDUxMiAqIDEwMjQpOwogICAgICAgaWYgKCFqaXRfc3Rh Y2spCiAgICAgICAgIGVycm9yIChFWElUX1RST1VCTEUsIDAsCiAgICAgICAgICAgICAgICBf KCJmYWlsZWQgdG8gYWxsb2NhdGUgbWVtb3J5IGZvciB0aGUgUENSRSBKSVQgc3RhY2siKSk7 Ci0tIAoxLjkuMwoK --------------090004040002010007040709 Content-Type: text/plain; charset=UTF-8; name="0005-grep-improve-P-performance-in-typical-cases.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="0005-grep-improve-P-performance-in-typical-cases.patch" RnJvbSBlN2NhMjUyMjAyYTM0ODUwYjhiODVlODUwN2I0YjllZDc1ZWY4Y2RmIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBUdWUsIDE2IFNlcCAyMDE0IDE1OjQ4OjQ0IC0wNzAwClN1YmplY3Q6IFtQQVRD SCA1LzZdIGdyZXA6IGltcHJvdmUgLVAgcGVyZm9ybWFuY2UgaW4gdHlwaWNhbCBjYXNlcwoK KiBzcmMvZ3JlcC5jLCBzcmMvZ3JlcC5oIChlbnVtIHRleHRiaW4pOiBNb3ZlIHRvIGdyZXAu aC4KKGlucHV0X3RleHRiaW4sIHZhbGlkYXRlZF9ib3VuZGFyeSk6IE5ldyB2YXJzLgoqIHNy Yy9ncmVwLmMgKGdyZXBidWYsIGdyZXApOiBJbml0aWFsaXplIHRoZW0uCiogc3JjL3BjcmVz ZWFyY2guYyAoUGV4ZWN1dGUpOiBEbyBhIG11bHRpbGluZSBzZWFyY2gKd2hlbiB0aGUgaW5w dXQgaXMga25vd24gdG8gYmUgZnJlZSBvZiBlbmNvZGluZyBlcnJvcnMuClF1aWNrbHkgZGlz Y2FyZCBieXRlcyB0aGF0IGFyZSBvYnZpb3VzbHkgZW5jb2RpbmcgZXJyb3JzLgpRdWlja2x5 IG1hdGNoIGVtcHR5IHN0cmluZ3MuCi0tLQogc3JjL2dyZXAuYyAgICAgICB8ICAxOSArKy0t LS0tLS0KIHNyYy9ncmVwLmggICAgICAgfCAgMjIgKysrKysrKysrKwogc3JjL3BjcmVzZWFy Y2guYyB8IDEyMCArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr KysrKy0tLS0tLS0tCiAzIGZpbGVzIGNoYW5nZWQsIDEzMCBpbnNlcnRpb25zKCspLCAzMSBk ZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5jIGIvc3JjL2dyZXAuYwppbmRl eCA4MzU1OWUyLi5lM2M0OTI1IDEwMDY0NAotLS0gYS9zcmMvZ3JlcC5jCisrKyBiL3NyYy9n cmVwLmMKQEAgLTM1MSw2ICszNTEsOCBAQCBib29sIG1hdGNoX2ljYXNlOwogYm9vbCBtYXRj aF93b3JkczsKIGJvb2wgbWF0Y2hfbGluZXM7CiB1bnNpZ25lZCBjaGFyIGVvbGJ5dGU7Citl bnVtIHRleHRiaW4gaW5wdXRfdGV4dGJpbjsKK2NoYXIgY29uc3QgKnZhbGlkYXRlZF9ib3Vu ZGFyeTsKIAogc3RhdGljIGNoYXIgY29uc3QgKm1hdGNoZXI7CiAKQEAgLTQzNywyMSArNDM5 LDYgQEAgY2xlYW5fdXBfc3Rkb3V0ICh2b2lkKQogICAgIGNsb3NlX3N0ZG91dCAoKTsKIH0K IAotLyogQW4gZW51bSB0ZXh0YmluIGRlc2NyaWJlcyB0aGUgZmlsZSdzIHR5cGUsIGluZmVy cmVkIGZyb20gZGF0YSByZWFkCi0gICBiZWZvcmUgdGhlIGZpcnN0IGxpbmUgaXMgc2VsZWN0 ZWQgZm9yIG91dHB1dC4gICovCi1lbnVtIHRleHRiaW4KLSAgewotICAgIC8qIEJpbmFyeSwg YXMgaXQgY29udGFpbnMgbnVsbCBieXRlcyBhbmQgdGhlIC16IG9wdGlvbiBpcyBub3QgaW4g ZWZmZWN0LAotICAgICAgIG9yIGl0IGNvbnRhaW5zIGVuY29kaW5nIGVycm9ycy4gICovCi0g ICAgVEVYVEJJTl9CSU5BUlkgPSAtMSwKLQotICAgIC8qIE5vdCBrbm93biB5ZXQuICBPbmx5 IHRleHQgaGFzIGJlZW4gc2VlbiBzbyBmYXIuICAqLwotICAgIFRFWFRCSU5fVU5LTk9XTiA9 IDAsCi0KLSAgICAvKiBUZXh0LiAgKi8KLSAgICBURVhUQklOX1RFWFQgPSAxCi0gIH07Ci0K IHN0YXRpYyBib29sCiB0ZXh0YmluX2lzX2JpbmFyeSAoZW51bSB0ZXh0YmluIHRleHRiaW4p CiB7CkBAIC0xMTIzLDYgKzExMTAsNyBAQCBncmVwYnVmIChjaGFyIGNvbnN0ICpiZWcsIGNo YXIgY29uc3QgKmxpbSkKICAgaW50bWF4X3Qgb3V0bGVmdDAgPSBvdXRsZWZ0OwogICBjaGFy IGNvbnN0ICpwOwogICBjaGFyIGNvbnN0ICplbmRwOworICB2YWxpZGF0ZWRfYm91bmRhcnkg PSBiZWc7CiAKICAgZm9yIChwID0gYmVnOyBwIDwgbGltOyBwID0gZW5kcCkKICAgICB7CkBA IC0xMjEwLDYgKzExOTgsNyBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpz dCkKIAogICBmb3IgKDs7KQogICAgIHsKKyAgICAgIGlucHV0X3RleHRiaW4gPSB0ZXh0Ymlu OwogICAgICAgbGFzdG5sID0gYnVmYmVnOwogICAgICAgaWYgKGxhc3RvdXQpCiAgICAgICAg IGxhc3RvdXQgPSBidWZiZWc7CmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5oIGIvc3JjL2dyZXAu aAppbmRleCA1NDk2ZWIyLi4yM2Q0ZTk1IDEwMDY0NAotLS0gYS9zcmMvZ3JlcC5oCisrKyBi L3NyYy9ncmVwLmgKQEAgLTI5LDQgKzI5LDI2IEBAIGV4dGVybiBib29sIG1hdGNoX3dvcmRz OwkvKiAtdyAqLwogZXh0ZXJuIGJvb2wgbWF0Y2hfbGluZXM7CS8qIC14ICovCiBleHRlcm4g dW5zaWduZWQgY2hhciBlb2xieXRlOwkvKiAteiAqLwogCisvKiBBbiBlbnVtIHRleHRiaW4g ZGVzY3JpYmVzIHRoZSBmaWxlJ3MgdHlwZSwgaW5mZXJyZWQgZnJvbSBkYXRhIHJlYWQKKyAg IGJlZm9yZSB0aGUgZmlyc3QgbGluZSBpcyBzZWxlY3RlZCBmb3Igb3V0cHV0LiAgKi8KK2Vu dW0gdGV4dGJpbgorICB7CisgICAgLyogQmluYXJ5LCBhcyBpdCBjb250YWlucyBudWxsIGJ5 dGVzIGFuZCB0aGUgLXogb3B0aW9uIGlzIG5vdCBpbiBlZmZlY3QsCisgICAgICAgb3IgaXQg Y29udGFpbnMgZW5jb2RpbmcgZXJyb3JzLiAgKi8KKyAgICBURVhUQklOX0JJTkFSWSA9IC0x LAorCisgICAgLyogTm90IGtub3duIHlldC4gIE9ubHkgdGV4dCBoYXMgYmVlbiBzZWVuIHNv IGZhci4gICovCisgICAgVEVYVEJJTl9VTktOT1dOID0gMCwKKworICAgIC8qIFRleHQuICAq LworICAgIFRFWFRCSU5fVEVYVCA9IDEKKyAgfTsKKworLyogSW5wdXQgZmlsZSB0eXBlLiAg Ki8KK2V4dGVybiBlbnVtIHRleHRiaW4gaW5wdXRfdGV4dGJpbjsKKworLyogVmFsaWRhdGlv biBib3VuZGFyeS4gIEVhcmxpZXIgYnl0ZXMgaGF2ZSBhbHJlYWR5IGJlZW4gdmFsaWRhdGVk IGJ5CisgICB0aGUgUENSRSBtYXRjaGVyLCB3aGljaCBjYXJlcyBhYm91dCB0aGlzIHNvcnQg b2YgdGhpbmcuICAqLworZXh0ZXJuIGNoYXIgY29uc3QgKnZhbGlkYXRlZF9ib3VuZGFyeTsK KwogI2VuZGlmCmRpZmYgLS1naXQgYS9zcmMvcGNyZXNlYXJjaC5jIGIvc3JjL3BjcmVzZWFy Y2guYwppbmRleCAxYjE1ZTUzLi42ZjAxNmI2IDEwMDY0NAotLS0gYS9zcmMvcGNyZXNlYXJj aC5jCisrKyBiL3NyYy9wY3Jlc2VhcmNoLmMKQEAgLTE1NiwyOCArMTU2LDkxIEBAIFBleGVj dXRlIChjaGFyIGNvbnN0ICpidWYsIHNpemVfdCBzaXplLCBzaXplX3QgKm1hdGNoX3NpemUs CiAgIGNoYXIgY29uc3QgKmxpbmVfc3RhcnQgPSBidWY7CiAgIGludCBlID0gUENSRV9FUlJP Ul9OT01BVENIOwogICBjaGFyIGNvbnN0ICpsaW5lX2VuZDsKKyAgY2hhciBjb25zdCAqdmFs aWRhdGVkID0gdmFsaWRhdGVkX2JvdW5kYXJ5OworCisgIC8qIElmIHRoZSBpbnB1dCB0eXBl IGlzIHVua25vd24sIHRoZSBjYWxsZXIgaXMgc3RpbGwgdGVzdGluZyB0aGUKKyAgICAgaW5w dXQsIHdoaWNoIG1lYW5zIHRoZSBjdXJyZW50IGJ1ZmZlciBjYW5ub3QgY29udGFpbiBlbmNv ZGluZworICAgICBlcnJvcnMgYW5kIGEgbXVsdGlsaW5lIHNlYXJjaCBpcyB0eXBpY2FsbHkg bW9yZSBlZmZpY2llbnQuCisgICAgIE90aGVyd2lzZSwgYSBzaW5nbGUtbGluZSBzZWFyY2gg aXMgdHlwaWNhbGx5IGZhc3Rlciwgc28gdGhhdAorICAgICBwY3JlX2V4ZWMgZG9lc24ndCB3 YXN0ZSB0aW1lIHZhbGlkYXRpbmcgdGhlIGVudGlyZSBpbnB1dAorICAgICBidWZmZXIuICAq LworICBib29sIG11bHRpbGluZSA9IGlucHV0X3RleHRiaW4gPT0gVEVYVEJJTl9VTktOT1dO OwogCi0gIC8qIHBjcmVfZXhlYyBtaXNoYW5kbGVzIG1hdGNoZXMgdGhhdCBjcm9zcyBsaW5l IGJvdW5kYXJpZXMuCi0gICAgIFBDUkVfTVVMVElMSU5FIGlzbid0IGEgd2luLCBwYXJ0bHkg YmVjYXVzZSBpdCdzIGluY29tcGF0aWJsZSB3aXRoCi0gICAgIC16LCBhbmQgcGFydGx5IGJl Y2F1c2UgaXQgY2hlY2tzIHRoZSBlbnRpcmUgaW5wdXQgYnVmZmVyIGFuZCBpcwotICAgICB0 aGVyZWZvcmUgc2xvdyBvbiBhIGxhcmdlIGJ1ZmZlciBjb250YWluaW5nIG1hbnkgbWF0Y2hl cy4KLSAgICAgQXZvaWQgdGhlc2UgcHJvYmxlbXMgYnkgbWF0Y2hpbmcgbGluZS1ieS1saW5l LiAgKi8KICAgZm9yICg7IHAgPCBidWYgKyBzaXplOyBwID0gbGluZV9zdGFydCA9IGxpbmVf ZW5kICsgMSkKICAgICB7Ci0gICAgICBsaW5lX2VuZCA9IG1lbWNociAocCwgZW9sYnl0ZSwg YnVmICsgc2l6ZSAtIHApOworICAgICAgYm9vbCB0b29fYmlnOwogCi0gICAgICBpZiAoSU5U X01BWCA8IGxpbmVfZW5kIC0gcCkKKyAgICAgIGlmIChtdWx0aWxpbmUpCisgICAgICAgIHsK KyAgICAgICAgICBzaXplX3QgcGNyZV9zaXplX21heCA9IE1JTiAoSU5UX01BWCwgU0laRV9N QVggLSAxKTsKKyAgICAgICAgICBzaXplX3Qgc2Nhbl9zaXplID0gTUlOIChwY3JlX3NpemVf bWF4ICsgMSwgYnVmICsgc2l6ZSAtIHApOworICAgICAgICAgIGxpbmVfZW5kID0gbWVtcmNo ciAocCwgZW9sYnl0ZSwgc2Nhbl9zaXplKTsKKyAgICAgICAgICB0b29fYmlnID0gISBsaW5l X2VuZDsKKyAgICAgICAgfQorICAgICAgZWxzZQorICAgICAgICB7CisgICAgICAgICAgbGlu ZV9lbmQgPSBtZW1jaHIgKHAsIGVvbGJ5dGUsIGJ1ZiArIHNpemUgLSBwKTsKKyAgICAgICAg ICB0b29fYmlnID0gSU5UX01BWCA8IGxpbmVfZW5kIC0gcDsKKyAgICAgICAgfQorCisgICAg ICBpZiAodG9vX2JpZykKICAgICAgICAgZXJyb3IgKEVYSVRfVFJPVUJMRSwgMCwgXygiZXhj ZWVkZWQgUENSRSdzIGxpbmUgbGVuZ3RoIGxpbWl0IikpOwogCi0gICAgICAvKiBUcmVhdCBl bmNvZGluZy1lcnJvciBieXRlcyBhcyBkYXRhIHRoYXQgY2Fubm90IG1hdGNoLiAgKi8KICAg ICAgIGZvciAoOzspCiAgICAgICAgIHsKLSAgICAgICAgICBpbnQgb3B0aW9ucyA9IGJvbCA/ IDAgOiBQQ1JFX05PVEJPTDsKLSAgICAgICAgICBpbnQgdmFsaWRfYnl0ZXM7Ci0gICAgICAg ICAgZSA9IHBjcmVfZXhlYyAoY3JlLCBleHRyYSwgcCwgbGluZV9lbmQgLSBwLCAwLCBvcHRp b25zLCBzdWIsIE5TVUIpOwotICAgICAgICAgIGlmIChlICE9IFBDUkVfRVJST1JfQkFEVVRG OCkKLSAgICAgICAgICAgIGJyZWFrOwotICAgICAgICAgIHZhbGlkX2J5dGVzID0gc3ViWzBd OworICAgICAgICAgIC8qIFNraXAgcGFzdCBieXRlcyB0aGF0IGFyZSBlYXNpbHkgZGV0ZXJt aW5lZCB0byBiZSBlbmNvZGluZworICAgICAgICAgICAgIGVycm9ycywgdHJlYXRpbmcgdGhl bSBhcyBkYXRhIHRoYXQgY2Fubm90IG1hdGNoLiAgVGhpcyBpcworICAgICAgICAgICAgIGZh c3RlciB0aGFuIGhhdmluZyBwY3JlX2V4ZWMgY2hlY2sgdGhlbS4gICovCisgICAgICAgICAg d2hpbGUgKG1iY2xlbl9jYWNoZVt0b191Y2hhciAoKnApXSA9PSAoc2l6ZV90KSAtMSkKKyAg ICAgICAgICAgIHsKKyAgICAgICAgICAgICAgcCsrOworICAgICAgICAgICAgICBib2wgPSBm YWxzZTsKKyAgICAgICAgICAgIH0KKworICAgICAgICAgIC8qIENoZWNrIGZvciBhbiBlbXB0 eSBtYXRjaDsgdGhpcyBpcyBmYXN0ZXIgdGhhbiBsZXR0aW5nCisgICAgICAgICAgICAgcGNy ZV9leGVjIGRvIGl0LiAgKi8KKyAgICAgICAgICBpbnQgc2VhcmNoX2J5dGVzID0gbGluZV9l bmQgLSBwOworICAgICAgICAgIGlmIChzZWFyY2hfYnl0ZXMgPT0gMCkKKyAgICAgICAgICAg IHsKKyAgICAgICAgICAgICAgc3ViWzBdID0gc3ViWzFdID0gMDsKKyAgICAgICAgICAgICAg ZSA9IGVtcHR5X21hdGNoW2JvbF07CisgICAgICAgICAgICAgIGJyZWFrOworICAgICAgICAg ICAgfQorCisgICAgICAgICAgaW50IG9wdGlvbnMgPSAwOworICAgICAgICAgIGlmICghYm9s KQorICAgICAgICAgICAgb3B0aW9ucyB8PSBQQ1JFX05PVEJPTDsKKyAgICAgICAgICBpZiAo bXVsdGlsaW5lIHx8IHAgKyBzZWFyY2hfYnl0ZXMgPD0gdmFsaWRhdGVkKQorICAgICAgICAg ICAgb3B0aW9ucyB8PSBQQ1JFX05PX1VURjhfQ0hFQ0s7CisKKyAgICAgICAgICBpbnQgdmFs aWRfYnl0ZXMgPSB2YWxpZGF0ZWQgLSBwOworICAgICAgICAgIGlmICh2YWxpZF9ieXRlcyA8 IDApCisgICAgICAgICAgICB7CisgICAgICAgICAgICAgIGUgPSBwY3JlX2V4ZWMgKGNyZSwg ZXh0cmEsIHAsIHNlYXJjaF9ieXRlcywgMCwKKyAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgb3B0aW9ucywgc3ViLCBOU1VCKTsKKyAgICAgICAgICAgICAgaWYgKGUgIT0gUENSRV9F UlJPUl9CQURVVEY4KQorICAgICAgICAgICAgICAgIHsKKyAgICAgICAgICAgICAgICAgIHZh bGlkYXRlZCA9IHAgKyBzZWFyY2hfYnl0ZXM7CisgICAgICAgICAgICAgICAgICBpZiAoMCA8 IGUgJiYgbXVsdGlsaW5lICYmIHN1YlsxXSAtIHN1YlswXSAhPSAwKQorICAgICAgICAgICAg ICAgICAgICB7CisgICAgICAgICAgICAgICAgICAgICAgY2hhciBjb25zdCAqbmwgPSBtZW1j aHIgKHAgKyBzdWJbMF0sIGVvbGJ5dGUsCisgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgIHN1YlsxXSAtIHN1YlswXSk7CisgICAgICAgICAgICAgICAg ICAgICAgaWYgKG5sKQorICAgICAgICAgICAgICAgICAgICAgICAgeworICAgICAgICAgICAg ICAgICAgICAgICAgICAvKiBUaGlzIG1hdGNoIGNyb3NzZXMgYSBsaW5lIGJvdW5kYXJ5OyBy ZWplY3QgaXQuICAqLworICAgICAgICAgICAgICAgICAgICAgICAgICBwICs9IHN1YlswXTsK KyAgICAgICAgICAgICAgICAgICAgICAgICAgbGluZV9lbmQgPSBubDsKKyAgICAgICAgICAg ICAgICAgICAgICAgICAgY29udGludWU7CisgICAgICAgICAgICAgICAgICAgICAgICB9Cisg ICAgICAgICAgICAgICAgICAgIH0KKyAgICAgICAgICAgICAgICAgIGJyZWFrOworICAgICAg ICAgICAgICAgIH0KKyAgICAgICAgICAgICAgdmFsaWRfYnl0ZXMgPSBzdWJbMF07CisgICAg ICAgICAgICAgIHZhbGlkYXRlZCA9IHAgKyB2YWxpZF9ieXRlczsKKyAgICAgICAgICAgIH0K KworICAgICAgICAgIC8qIFRyeSB0byBtYXRjaCB0aGUgc3RyaW5nIGJlZm9yZSB0aGUgZW5j b2RpbmcgZXJyb3IuCisgICAgICAgICAgICAgQWdhaW4sIGhhbmRsZSB0aGUgZW1wdHktbWF0 Y2ggY2FzZSBzcGVjaWFsbHksIGZvciBzcGVlZC4gICovCiAgICAgICAgICAgaWYgKHZhbGlk X2J5dGVzID09IDApCiAgICAgICAgICAgICB7CiAgICAgICAgICAgICAgIHN1YlsxXSA9IDA7 CkBAIC0xODksNiArMjUyLDggQEAgUGV4ZWN1dGUgKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90 IHNpemUsIHNpemVfdCAqbWF0Y2hfc2l6ZSwKICAgICAgICAgICAgICAgICAgICAgICAgICAg IHN1YiwgTlNVQik7CiAgICAgICAgICAgaWYgKGUgIT0gUENSRV9FUlJPUl9OT01BVENIKQog ICAgICAgICAgICAgYnJlYWs7CisKKyAgICAgICAgICAvKiBUcmVhdCB0aGUgZW5jb2Rpbmcg ZXJyb3IgYXMgZGF0YSB0aGF0IGNhbm5vdCBtYXRjaC4gICovCiAgICAgICAgICAgcCArPSB2 YWxpZF9ieXRlcyArIDE7CiAgICAgICAgICAgYm9sID0gZmFsc2U7CiAgICAgICAgIH0KQEAg LTE5OCw2ICsyNjMsOCBAQCBQZXhlY3V0ZSAoY2hhciBjb25zdCAqYnVmLCBzaXplX3Qgc2l6 ZSwgc2l6ZV90ICptYXRjaF9zaXplLAogICAgICAgYm9sID0gdHJ1ZTsKICAgICB9CiAKKyAg dmFsaWRhdGVkX2JvdW5kYXJ5ID0gdmFsaWRhdGVkOworCiAgIGlmIChlIDw9IDApCiAgICAg ewogICAgICAgc3dpdGNoIChlKQpAQCAtMjI0LDggKzI5MSwyOSBAQCBQZXhlY3V0ZSAoY2hh ciBjb25zdCAqYnVmLCBzaXplX3Qgc2l6ZSwgc2l6ZV90ICptYXRjaF9zaXplLAogICAgIH0K ICAgZWxzZQogICAgIHsKLSAgICAgIGNoYXIgY29uc3QgKmJlZyA9IHN0YXJ0X3B0ciA/IHAg KyBzdWJbMF0gOiBsaW5lX3N0YXJ0OwotICAgICAgY2hhciBjb25zdCAqZW5kID0gc3RhcnRf cHRyID8gcCArIHN1YlsxXSA6IGxpbmVfZW5kICsgMTsKKyAgICAgIGNoYXIgY29uc3QgKm1h dGNoYmVnID0gcCArIHN1YlswXTsKKyAgICAgIGNoYXIgY29uc3QgKm1hdGNoZW5kID0gcCAr IHN1YlsxXTsKKyAgICAgIGNoYXIgY29uc3QgKmJlZzsKKyAgICAgIGNoYXIgY29uc3QgKmVu ZDsKKyAgICAgIGlmIChzdGFydF9wdHIpCisgICAgICAgIHsKKyAgICAgICAgICBiZWcgPSBt YXRjaGJlZzsKKyAgICAgICAgICBlbmQgPSBtYXRjaGVuZDsKKyAgICAgICAgfQorICAgICAg ZWxzZSBpZiAobXVsdGlsaW5lKQorICAgICAgICB7CisgICAgICAgICAgY2hhciBjb25zdCAq cHJldl9ubCA9IG1lbXJjaHIgKGxpbmVfc3RhcnQgLSAxLCBlb2xieXRlLAorICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBtYXRjaGJlZyAtIChsaW5lX3N0YXJ0 IC0gMSkpOworICAgICAgICAgIGNoYXIgY29uc3QgKm5leHRfbmwgPSBtZW1jaHIgKG1hdGNo ZW5kLCBlb2xieXRlLAorICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg IGxpbmVfZW5kICsgMSAtIG1hdGNoZW5kKTsKKyAgICAgICAgICBiZWcgPSBwcmV2X25sICsg MTsKKyAgICAgICAgICBlbmQgPSBuZXh0X25sICsgMTsKKyAgICAgICAgfQorICAgICAgZWxz ZQorICAgICAgICB7CisgICAgICAgICAgYmVnID0gbGluZV9zdGFydDsKKyAgICAgICAgICBl bmQgPSBsaW5lX2VuZCArIDE7CisgICAgICAgIH0KICAgICAgICptYXRjaF9zaXplID0gZW5k IC0gYmVnOwogICAgICAgcmV0dXJuIGJlZyAtIGJ1ZjsKICAgICB9Ci0tIAoxLjkuMwoK --------------090004040002010007040709 Content-Type: text/plain; charset=UTF-8; name="0006-grep-skip-past-holes-efficiently.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="0006-grep-skip-past-holes-efficiently.patch" RnJvbSAxZjFhNzRlMTQyMzk1NGE2ODdiZmY0Yjc5NDUxMDE0NDVkZDU3MDYxIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBUdWUsIDE2IFNlcCAyMDE0IDE4OjE5OjUzIC0wNzAwClN1YmplY3Q6IFtQQVRD SCA2LzZdIGdyZXA6IHNraXAgcGFzdCBob2xlcyBlZmZpY2llbnRseQoKVGFrZSBhZHZhbnRh Z2Ugb2YgdGhlIHJlbGF4ZWQgcnVsZXMgZm9yIHRyZWF0aW5nIG5vbi10ZXh0IGJ5dGVzIGlu CmJpbmFyeSBkYXRhLCBieSBlZmZpY2llbnRseSBza2lwcGluZyBwYXN0IGhvbGVzIG9uIHBs YXRmb3JtcwpzdXBwb3J0aW5nIGxzZWVrJ3MgU0VFS19EQVRBIGZsYWcuCk9uIG9uZSB0ZXN0 IG9uIGEgY2lyY2EtMjAwOCBTdW4gRmlyZSBWNDB6IHJ1bm5pbmcgU29sYXJpcyAxMS4yLAon Z3JlcCB4JyB0b29rIDAuMDA5IHJlYWwtdGltZSBzZWNvbmRzIHRvIHNjYW4gYSBob2xleSBm aWxlIG9mIHNpemUKOSwyMjMsMzcyLDAzNiw4NTQsNzc1LDgwMiBieXRlcywgZm9yIGEgbm9t aW5hbCBzY2FuIHJhdGUgb2YgMSBaQi9zLgpncmVwIDIuMjAncyBzY2FuIHJhdGUgb24gdGhp cyBwbGF0Zm9ybSB3YXMgODQzIE1CL3MsIHNvIHRoaXMgaXMgYQpzcGVlZHVwIGJ5IGEgZmFj dG9yIG9mIDEuMiB0cmlsbGlvbi4gIFRoZSBzcGVlZHVwIGZhY3RvciBpcyBub3QKYXMgZ3Jl YXQgb24gR05VL0xpbnV4IGhvc3RzLCBkdWUgdG8gd2hhdCBhcHBlYXIgdG8gYmUgU0VFS19E QVRBCmluZWZmaWNpZW5jaWVzLCBidXQgcHJlc3VtYWJseSB0aGlzIHdpbGwgYmUgY2xlYXJl ZCB1cCBpbiB0aW1lLgoqIE5FV1M6IERvY3VtZW50IHRoaXMuCiogc3JjL2dyZXAuYywgc3Jj L2dyZXAuaCAoZW9sYnl0ZSk6IE5vdyBjaGFyLCBub3QgdW5zaWduZWQgY2hhci4KVGhpcyBp cyBmb3IgY29tcGF0aWJpbGl0eSB3aXRoIHRoZSByZXN0IG9mIHRoZSBjb2RlLgpUaGUgb2xk IChwZXJmb3JtYW5jZT8pIHJlYXNvbnMgZm9yICd1bnNpZ25lZCBjaGFyJyBhcmUgbm90IG1v b3QuCiogc3JjL2dyZXAuYyAoc2tpcF9udWxzLCBza2lwX2VtcHR5X2xpbmVzLCBzZWVrX2Rh dGFfZmFpbGVkKToKTmV3IHN0YXRpYyB2YXJzLgoodG90YWxubCk6IE1vdmUgdXAsIHNpbmNl IGl0J3MgYWJvdXQgaW5wdXQsIG5vdCBvdXRwdXQsIGFuZApmaWxsYnVmIG5vdyB1c2VzIGl0 LgooYWRkX2NvdW50KTogTW92ZSB1cCwgc2luY2UgZmlsbGJ1ZiBub3cgdXNlcyBpdC4KKGFs bF96ZXJvcyk6IE5ldyBmdW5jdGlvbi4KKGZpbGxidWYpOiBVc2UgU0VFS19EQVRBIHRvIHNr aXAgcGFzdCBob2xlcyBlZmZpY2llbnRseSwKb24gc3lzdGVtcyB0aGF0IHN1cHBvcnQgdGhp cy4KKGdyZXAsIG1haW4pOiBTZXQgdGhlIG5ldyBzdGF0aWMgdmFycy4KLS0tCiBORVdTICAg ICAgIHwgIDMgKysrCiBzcmMvZ3JlcC5jIHwgNzYgKysrKysrKysrKysrKysrKysrKysrKysr KysrKysrKysrKysrKysrKysrKysrKystLS0tLS0tLS0tLS0tLS0KIHNyYy9ncmVwLmggfCAg MiArLQogMyBmaWxlcyBjaGFuZ2VkLCA2MiBpbnNlcnRpb25zKCspLCAxOSBkZWxldGlvbnMo LSkKCmRpZmYgLS1naXQgYS9ORVdTIGIvTkVXUwppbmRleCA3MzMzMThkLi40ZTAxOTVjIDEw MDY0NAotLS0gYS9ORVdTCisrKyBiL05FV1MKQEAgLTQsNiArNCw5IEBAIEdOVSBncmVwIE5F V1MgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAtKi0gb3V0bGluZSAtKi0K IAogKiogSW1wcm92ZW1lbnRzCiAKKyAgUGVyZm9ybWFuY2UgaGFzIGJlZW4gZ3JlYXRseSBp bXByb3ZlZCBmb3Igc2VhcmNoaW5nIGZpbGVzIGNvbnRhaW5pbmcKKyAgaG9sZXMsIG9uIHBs YXRmb3JtcyB3aGVyZSBsc2VlaydzIFNFRUtfSE9MRSBmbGFnIHdvcmtzIGVmZmljaWVudGx5 LgorCiAgIFBlcmZvcm1hbmNlIGhhcyBpbXByb3ZlZCBmb3IgdmVyeSBsb25nIHN0cmluZ3Mg aW4gcGF0dGVybnMuCiAKICAgSWYgYSBmaWxlIGNvbnRhaW5zIGRhdGEgaW1wcm9wZXJseSBl bmNvZGVkIGZvciB0aGUgY3VycmVudCBsb2NhbGUsCmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5j IGIvc3JjL2dyZXAuYwppbmRleCBlM2M0OTI1Li4zZTk0ODA0IDEwMDY0NAotLS0gYS9zcmMv Z3JlcC5jCisrKyBiL3NyYy9ncmVwLmMKQEAgLTM1MCw3ICszNTAsNyBAQCBzdGF0aWMgc3Ry dWN0IG9wdGlvbiBjb25zdCBsb25nX29wdGlvbnNbXSA9CiBib29sIG1hdGNoX2ljYXNlOwog Ym9vbCBtYXRjaF93b3JkczsKIGJvb2wgbWF0Y2hfbGluZXM7Ci11bnNpZ25lZCBjaGFyIGVv bGJ5dGU7CitjaGFyIGVvbGJ5dGU7CiBlbnVtIHRleHRiaW4gaW5wdXRfdGV4dGJpbjsKIGNo YXIgY29uc3QgKnZhbGlkYXRlZF9ib3VuZGFyeTsKIApAQCAtNTYzLDYgKzU2MywxMCBAQCBz dGF0aWMgb2ZmX3QgYnVmb2Zmc2V0OwkJLyogUmVhZCBvZmZzZXQ7IGRlZmluZWQgb24gcmVn dWxhciBmaWxlcy4gICovCiBzdGF0aWMgb2ZmX3QgYWZ0ZXJfbGFzdF9tYXRjaDsJLyogUG9p bnRlciBhZnRlciBsYXN0IG1hdGNoaW5nIGxpbmUgdGhhdAogICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICB3b3VsZCBoYXZlIGJlZW4gb3V0cHV0IGlmIHdlIHdlcmUKICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgb3V0cHV0dGluZyBjaGFyYWN0ZXJz LiAqLworc3RhdGljIGJvb2wgc2tpcF9udWxzOwkJLyogU2tpcCAnXDAnIGluIGRhdGEuICAq Lworc3RhdGljIGJvb2wgc2tpcF9lbXB0eV9saW5lczsJLyogU2tpcCBlbXB0eSBsaW5lcyBp biBkYXRhLiAgKi8KK3N0YXRpYyBib29sIHNlZWtfZGF0YV9mYWlsZWQ7CS8qIGxzZWVrIHdp dGggU0VFS19EQVRBIGZhaWxlZC4gICovCitzdGF0aWMgdWludG1heF90IHRvdGFsbmw7CS8q IFRvdGFsIG5ld2xpbmUgY291bnQgYmVmb3JlIGxhc3RubC4gKi8KIAogLyogUmV0dXJuIFZB TCBhbGlnbmVkIHRvIHRoZSBuZXh0IG11bHRpcGxlIG9mIEFMSUdOTUVOVC4gIFZBTCBjYW4g YmUKICAgIGFuIGludGVnZXIgb3IgYSBwb2ludGVyLiAgQm90aCBhcmdzIG11c3QgYmUgZnJl ZSBvZiBzaWRlIGVmZmVjdHMuICAqLwpAQCAtNTcxLDYgKzU3NSwyNyBAQCBzdGF0aWMgb2Zm X3QgYWZ0ZXJfbGFzdF9tYXRjaDsJLyogUG9pbnRlciBhZnRlciBsYXN0IG1hdGNoaW5nIGxp bmUgdGhhdAogICAgPyAodmFsKSBcCiAgICA6ICh2YWwpICsgKChhbGlnbm1lbnQpIC0gKHNp emVfdCkgKHZhbCkgJSAoYWxpZ25tZW50KSkpCiAKKy8qIEFkZCB0d28gbnVtYmVycyB0aGF0 IGNvdW50IGlucHV0IGJ5dGVzIG9yIGxpbmVzLCBhbmQgcmVwb3J0IGFuCisgICBlcnJvciBp ZiB0aGUgYWRkaXRpb24gb3ZlcmZsb3dzLiAgKi8KK3N0YXRpYyB1aW50bWF4X3QKK2FkZF9j b3VudCAodWludG1heF90IGEsIHVpbnRtYXhfdCBiKQoreworICB1aW50bWF4X3Qgc3VtID0g YSArIGI7CisgIGlmIChzdW0gPCBhKQorICAgIGVycm9yIChFWElUX1RST1VCTEUsIDAsIF8o ImlucHV0IGlzIHRvbyBsYXJnZSB0byBjb3VudCIpKTsKKyAgcmV0dXJuIHN1bTsKK30KKwor LyogUmV0dXJuIHRydWUgaWYgQlVGIChvZiBzaXplIFNJWkUpIGlzIGFsbCB6ZXJvcy4gICov CitzdGF0aWMgYm9vbAorYWxsX3plcm9zIChjaGFyIGNvbnN0ICpidWYsIHNpemVfdCBzaXpl KQoreworICBmb3IgKGNoYXIgY29uc3QgKnAgPSBidWY7IHAgPCBidWYgKyBzaXplOyBwKysp CisgICAgaWYgKCpwKQorICAgICAgcmV0dXJuIGZhbHNlOworICByZXR1cm4gdHJ1ZTsKK30K KwogLyogUmVzZXQgdGhlIGJ1ZmZlciBmb3IgYSBuZXcgZmlsZSwgcmV0dXJuaW5nIGZhbHNl IGlmIHdlIHNob3VsZCBza2lwIGl0LgogICAgSW5pdGlhbGl6ZSBvbiB0aGUgZmlyc3QgdGlt ZSB0aHJvdWdoLiAqLwogc3RhdGljIGJvb2wKQEAgLTY3NCwxMyArNjk5LDMzIEBAIGZpbGxi dWYgKHNpemVfdCBzYXZlLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAgIHJlYWRzaXplID0g YnVmZmVyICsgYnVmYWxsb2MgLSByZWFkYnVmOwogICByZWFkc2l6ZSAtPSByZWFkc2l6ZSAl IHBhZ2VzaXplOwogCi0gIGZpbGxzaXplID0gc2FmZV9yZWFkIChidWZkZXNjLCByZWFkYnVm LCByZWFkc2l6ZSk7Ci0gIGlmIChmaWxsc2l6ZSA9PSBTQUZFX1JFQURfRVJST1IpCisgIHdo aWxlICh0cnVlKQogICAgIHsKLSAgICAgIGZpbGxzaXplID0gMDsKLSAgICAgIGNjID0gZmFs c2U7CisgICAgICBmaWxsc2l6ZSA9IHNhZmVfcmVhZCAoYnVmZGVzYywgcmVhZGJ1ZiwgcmVh ZHNpemUpOworICAgICAgaWYgKGZpbGxzaXplID09IFNBRkVfUkVBRF9FUlJPUikKKyAgICAg ICAgeworICAgICAgICAgIGZpbGxzaXplID0gMDsKKyAgICAgICAgICBjYyA9IGZhbHNlOwor ICAgICAgICB9CisgICAgICBidWZvZmZzZXQgKz0gZmlsbHNpemU7CisKKyAgICAgIGlmIChm aWxsc2l6ZSA9PSAwIHx8ICFza2lwX251bHMgfHwgIWFsbF96ZXJvcyAocmVhZGJ1ZiwgZmls bHNpemUpKQorICAgICAgICBicmVhazsKKyAgICAgIHRvdGFsbmwgPSBhZGRfY291bnQgKHRv dGFsbmwsIGZpbGxzaXplKTsKKworICAgICAgaWYgKCFzZWVrX2RhdGFfZmFpbGVkKQorICAg ICAgICB7CisgICAgICAgICAgb2ZmX3QgZGF0YV9zdGFydCA9IGxzZWVrIChidWZkZXNjLCBi dWZvZmZzZXQsIFNFRUtfREFUQSk7CisgICAgICAgICAgaWYgKGRhdGFfc3RhcnQgPCAwKQor ICAgICAgICAgICAgc2Vla19kYXRhX2ZhaWxlZCA9IHRydWU7CisgICAgICAgICAgZWxzZQor ICAgICAgICAgICAgeworICAgICAgICAgICAgICB0b3RhbG5sID0gYWRkX2NvdW50ICh0b3Rh bG5sLCBkYXRhX3N0YXJ0IC0gYnVmb2Zmc2V0KTsKKyAgICAgICAgICAgICAgYnVmb2Zmc2V0 ID0gZGF0YV9zdGFydDsKKyAgICAgICAgICAgIH0KKyAgICAgICAgfQogICAgIH0KLSAgYnVm b2Zmc2V0ICs9IGZpbGxzaXplOworCiAgIGZpbGxzaXplID0gdW5kb3NzaWZ5X2lucHV0IChy ZWFkYnVmLCBmaWxsc2l6ZSk7CiAgIGJ1ZmxpbSA9IHJlYWRidWYgKyBmaWxsc2l6ZTsKICAg cmV0dXJuIGNjOwpAQCAtNzE3LDcgKzc2Miw2IEBAIHN0YXRpYyBjaGFyIGNvbnN0ICpsYXN0 bmw7CS8qIFBvaW50ZXIgYWZ0ZXIgbGFzdCBuZXdsaW5lIGNvdW50ZWQuICovCiBzdGF0aWMg Y2hhciBjb25zdCAqbGFzdG91dDsJLyogUG9pbnRlciBhZnRlciBsYXN0IGNoYXJhY3RlciBv dXRwdXQ7CiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE5VTEwgaWYgbm8g Y2hhcmFjdGVyIGhhcyBiZWVuIG91dHB1dAogICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICBvciBpZiBpdCdzIGNvbmNlcHR1YWxseSBiZWZvcmUgYnVmYmVnLiAqLwotc3Rh dGljIHVpbnRtYXhfdCB0b3RhbG5sOwkvKiBUb3RhbCBuZXdsaW5lIGNvdW50IGJlZm9yZSBs YXN0bmwuICovCiBzdGF0aWMgaW50bWF4X3Qgb3V0bGVmdDsJLyogTWF4aW11bSBudW1iZXIg b2YgbGluZXMgdG8gYmUgb3V0cHV0LiAgKi8KIHN0YXRpYyBpbnRtYXhfdCBwZW5kaW5nOwkv KiBQZW5kaW5nIGxpbmVzIG9mIG91dHB1dC4KICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgQWx3YXlzIGtlcHQgMCBpZiBvdXRfcXVpZXQgaXMgdHJ1ZS4gICovCkBAIC03 MjYsMTcgKzc3MCw2IEBAIHN0YXRpYyBib29sIGV4aXRfb25fbWF0Y2g7CS8qIEV4aXQgb24g Zmlyc3QgbWF0Y2guICAqLwogCiAjaW5jbHVkZSAiZG9zYnVmLmMiCiAKLS8qIEFkZCB0d28g bnVtYmVycyB0aGF0IGNvdW50IGlucHV0IGJ5dGVzIG9yIGxpbmVzLCBhbmQgcmVwb3J0IGFu Ci0gICBlcnJvciBpZiB0aGUgYWRkaXRpb24gb3ZlcmZsb3dzLiAgKi8KLXN0YXRpYyB1aW50 bWF4X3QKLWFkZF9jb3VudCAodWludG1heF90IGEsIHVpbnRtYXhfdCBiKQotewotICB1aW50 bWF4X3Qgc3VtID0gYSArIGI7Ci0gIGlmIChzdW0gPCBhKQotICAgIGVycm9yIChFWElUX1RS T1VCTEUsIDAsIF8oImlucHV0IGlzIHRvbyBsYXJnZSB0byBjb3VudCIpKTsKLSAgcmV0dXJu IHN1bTsKLX0KLQogc3RhdGljIHZvaWQKIG5sc2NhbiAoY2hhciBjb25zdCAqbGltKQogewpA QCAtMTE3MSw2ICsxMjA0LDggQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAq c3QpCiAgIG91dGxlZnQgPSBtYXhfY291bnQ7CiAgIGFmdGVyX2xhc3RfbWF0Y2ggPSAwOwog ICBwZW5kaW5nID0gMDsKKyAgc2tpcF9udWxzID0gc2tpcF9lbXB0eV9saW5lcyAmJiAhZW9s OworICBzZWVrX2RhdGFfZmFpbGVkID0gZmFsc2U7CiAKICAgbmxpbmVzID0gMDsKICAgcmVz aWR1ZSA9IDA7CkBAIC0xMTkzLDYgKzEyMjgsNyBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBz dGF0IGNvbnN0ICpzdCkKICAgICAgICAgICAgIHJldHVybiAwOwogICAgICAgICAgIGRvbmVf b25fbWF0Y2ggPSBvdXRfcXVpZXQgPSB0cnVlOwogICAgICAgICAgIG51bF96YXBwZXIgPSBl b2w7CisgICAgICAgICAgc2tpcF9udWxzID0gc2tpcF9lbXB0eV9saW5lczsKICAgICAgICAg fQogICAgIH0KIApAQCAtMTI4MSw2ICsxMzE3LDcgQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qg c3RhdCBjb25zdCAqc3QpCiAgICAgICAgICAgICAgIHRleHRiaW4gPSB0YjsKICAgICAgICAg ICAgICAgZG9uZV9vbl9tYXRjaCA9IG91dF9xdWlldCA9IHRydWU7CiAgICAgICAgICAgICAg IG51bF96YXBwZXIgPSBlb2w7CisgICAgICAgICAgICAgIHNraXBfbnVscyA9IHNraXBfZW1w dHlfbGluZXM7CiAgICAgICAgICAgICB9CiAgICAgICAgIH0KICAgICB9CkBAIC0yMzkwLDYg KzI0MjcsOSBAQCBtYWluIChpbnQgYXJnYywgY2hhciAqKmFyZ3YpCiAKICAgY29tcGlsZSAo a2V5cywga2V5Y2MpOwogICBmcmVlIChrZXlzKTsKKyAgc2l6ZV90IG1hdGNoX3NpemU7Cisg IHNraXBfZW1wdHlfbGluZXMgPSAoKGV4ZWN1dGUgKCZlb2xieXRlLCAxLCAmbWF0Y2hfc2l6 ZSwgTlVMTCkgPT0gMCkKKyAgICAgICAgICAgICAgICAgICAgICA9PSBvdXRfaW52ZXJ0KTsK IAogICBpZiAoKGFyZ2MgLSBvcHRpbmQgPiAxICYmICFub19maWxlbmFtZXMpIHx8IHdpdGhf ZmlsZW5hbWVzKQogICAgIG91dF9maWxlID0gMTsKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmgg Yi9zcmMvZ3JlcC5oCmluZGV4IDIzZDRlOTUuLjg2MjU5ZmIgMTAwNjQ0Ci0tLSBhL3NyYy9n cmVwLmgKKysrIGIvc3JjL2dyZXAuaApAQCAtMjcsNyArMjcsNyBAQAogZXh0ZXJuIGJvb2wg bWF0Y2hfaWNhc2U7CS8qIC1pICovCiBleHRlcm4gYm9vbCBtYXRjaF93b3JkczsJLyogLXcg Ki8KIGV4dGVybiBib29sIG1hdGNoX2xpbmVzOwkvKiAteCAqLwotZXh0ZXJuIHVuc2lnbmVk IGNoYXIgZW9sYnl0ZTsJLyogLXogKi8KK2V4dGVybiBjaGFyIGVvbGJ5dGU7CQkvKiAteiAq LwogCiAvKiBBbiBlbnVtIHRleHRiaW4gZGVzY3JpYmVzIHRoZSBmaWxlJ3MgdHlwZSwgaW5m ZXJyZWQgZnJvbSBkYXRhIHJlYWQKICAgIGJlZm9yZSB0aGUgZmlyc3QgbGluZSBpcyBzZWxl Y3RlZCBmb3Igb3V0cHV0LiAgKi8KLS0gCjEuOS4zCgo= --------------090004040002010007040709-- From debbugs-submit-bounces@debbugs.gnu.org Tue Sep 16 21:44:57 2014 Received: (at control) by debbugs.gnu.org; 17 Sep 2014 01:44:57 +0000 Received: from localhost ([127.0.0.1]:42811 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU4Ii-0001pG-Ay for submit@debbugs.gnu.org; Tue, 16 Sep 2014 21:44:56 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:58037) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU4If-0001p8-Qz for control@debbugs.gnu.org; Tue, 16 Sep 2014 21:44:54 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 5DFE239E8014 for ; Tue, 16 Sep 2014 18:44:53 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id loqLtkGaDwBy for ; Tue, 16 Sep 2014 18:44:44 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id BCEA8A60005 for ; Tue, 16 Sep 2014 18:44:44 -0700 (PDT) Message-ID: <5418E78C.2000808@cs.ucla.edu> Date: Tue, 16 Sep 2014 18:44:44 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: 18454 has patches now Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.0 (---) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) tags 18454 + patch From debbugs-submit-bounces@debbugs.gnu.org Tue Sep 16 21:48:40 2014 Received: (at control) by debbugs.gnu.org; 17 Sep 2014 01:48:40 +0000 Received: from localhost ([127.0.0.1]:42817 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU4MK-0001vh-Bu for submit@debbugs.gnu.org; Tue, 16 Sep 2014 21:48:40 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:58137) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU4MH-0001vZ-Tj for control@debbugs.gnu.org; Tue, 16 Sep 2014 21:48:38 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 1DEB5A60005 for ; Tue, 16 Sep 2014 18:48:37 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id erksaaDWmIda for ; Tue, 16 Sep 2014 18:48:28 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 7BBD539E8014 for ; Tue, 16 Sep 2014 18:48:28 -0700 (PDT) Message-ID: <5418E86C.9030203@cs.ucla.edu> Date: Tue, 16 Sep 2014 18:48:28 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: control@debbugs.gnu.org Subject: 18454 performance fix is more than wishlist Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.0 (---) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) severity 18454 normal From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 17 Sep 2014 04:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141092989930261 (code B ref 18454); Wed, 17 Sep 2014 04:59:02 +0000 Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 04:58:19 +0000 Received: from localhost ([127.0.0.1]:42882 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU7Jq-0007s1-Fy for submit@debbugs.gnu.org; Wed, 17 Sep 2014 00:58:18 -0400 Received: from mail-wg0-f44.google.com ([74.125.82.44]:46490) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU7Jm-0007rq-KM for 18454@debbugs.gnu.org; Wed, 17 Sep 2014 00:58:15 -0400 Received: by mail-wg0-f44.google.com with SMTP id y10so781829wgg.27 for <18454@debbugs.gnu.org>; Tue, 16 Sep 2014 21:58:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=GV6E0EpkpTuCf2wPRFIEFWLkKDY70w9p7nLGCyKmYVM=; b=oPszO+gLZMllHQoJEPPwFRJdaAbA6tEOyMfQv7K+zfBA7hTQqyGRgoM44v6Ez/WEZN QCbc2w9RLpTbkr6FHEE3o8kxyRpAnfb4s9Q0ngzeBb1E04jzLPfeX8Na/0o7SA5fFoxu mnIts1AyypOJKtGhvhfzya9xq1lz1rpdBPavXpNVcSZl/vycfTThfMsDqj5snuCTpfYi 2N4DlwRXQfZnE7neZEeunvi1ZCJ151MyjyVEkU527XbF9AwYjtqzm5EFV1m3F2/Zq3Tb Zb8GNHA8MdXeD5HvPDJct5XCy6w0XNTsqynbfxGdkGvzeFdp2Z4suaDW+QM+GXq4CYAO ngiw== X-Received: by 10.194.108.73 with SMTP id hi9mr21906578wjb.88.1410929893327; Tue, 16 Sep 2014 21:58:13 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.41.202 with HTTP; Tue, 16 Sep 2014 21:57:53 -0700 (PDT) In-Reply-To: <5418E73E.2050002@cs.ucla.edu> References: <20140912012449.GB18162@xvii.vinc17.org> <5418E73E.2050002@cs.ucla.edu> From: Jim Meyering Date: Tue, 16 Sep 2014 21:57:53 -0700 X-Google-Sender-Auth: 6GSyT1-p4nX5kjMZxDil4qE7M8A Message-ID: Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Tue, Sep 16, 2014 at 6:43 PM, Paul Eggert wrote: > I worked on this some more, and came up with the attached patches proposed > against the current grep Savannah master (commit > 9ea9254ea58456b84ed2f0c1481ca91cdd325bf7). > > For years I've been wanting to write that last patch and I finally got > around to it. It improves grep -P's performance by a factor of 1.2 trillion > on one (admittedly artificial) benchmark. I hope its 1 ZB/s scan rate is > some kind of record. The last patch probably won't help your test cases, > though I hope the other patches do help somewhat. Awesome :-) I found time to look through all but the 5th. Slightly surprised that 4/6 makes a measurable performance difference (didn't check), but moving away from file-scoped is an improvement in any case. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 17 Sep 2014 05:19:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Jim Meyering Cc: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141093111932513 (code B ref 18454); Wed, 17 Sep 2014 05:19:01 +0000 Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 05:18:39 +0000 Received: from localhost ([127.0.0.1]:42891 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU7dW-0008SJ-8y for submit@debbugs.gnu.org; Wed, 17 Sep 2014 01:18:38 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:35997) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XU7dT-0008S9-FA for 18454@debbugs.gnu.org; Wed, 17 Sep 2014 01:18:36 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 9C23739E8015; Tue, 16 Sep 2014 22:18:34 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id afEVcAYIBKKK; Tue, 16 Sep 2014 22:18:29 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id C403B39E8012; Tue, 16 Sep 2014 22:18:29 -0700 (PDT) Message-ID: <541919A5.5030604@cs.ucla.edu> Date: Tue, 16 Sep 2014 22:18:29 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <5418E73E.2050002@cs.ucla.edu> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) Jim Meyering wrote: > Slightly surprised that 4/6 makes a measurable performance > difference (didn't check), It's not measurable. I should have written it up as a cleanup more than as a speedup. I did like breaking the nominal ZB/s barrier, though, in patch 6/6. That's waaaay more than the total throughput of the Internet. It tops even the hypothetical throughput if one recruited the entire US freight industry to move nothing but MicroSD cards, which xkcd estimates would get only 0.06 ZB/s or so. http://what-if.xkcd.com/31/ From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 17 Sep 2014 14:19:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141096348225105 (code B ref 18454); Wed, 17 Sep 2014 14:19:02 +0000 Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 14:18:02 +0000 Received: from localhost ([127.0.0.1]:43537 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUG3V-0006Wr-UG for submit@debbugs.gnu.org; Wed, 17 Sep 2014 10:18:02 -0400 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:50240) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUG3R-0006WP-VI for 18454@debbugs.gnu.org; Wed, 17 Sep 2014 10:17:59 -0400 Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 82E1CE80019 for <18454@debbugs.gnu.org>; Wed, 17 Sep 2014 23:17:55 +0900 (JST) Received: from mail09.kcn.ne.jp ([61.86.6.188]) by imp01 with bizsmtp id sEHv1o00B43QJrh01EHvBg; Wed, 17 Sep 2014 23:17:55 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.54] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail09.kcn.ne.jp (Postfix) with ESMTPA id B4CFF1BD00C6; Wed, 17 Sep 2014 23:17:54 +0900 (JST) Date: Wed, 17 Sep 2014 23:17:50 +0900 From: Norihiro Tanaka In-Reply-To: <5418E73E.2050002@cs.ucla.edu> References: <20140912012449.GB18162@xvii.vinc17.org> <5418E73E.2050002@cs.ucla.edu> Message-Id: <20140917231749.B8AD.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Thanks for many improvements. I applied six patches to grep and tried to compile it, but after the sixth patch, I recevied 'SEEK_DATA' undeclared error. I looked for it on CentOS 5.10, but I couldn't find it in standard header files (glibc 2.5.1) and gnulib files. == gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I.. -I../lib -I../lib -g -O2 -MT searchutils.o -MD -MP -MF $depbase.Tpo -c -o searchutils.o searchutils.c && \ mv -f $depbase.Tpo $depbase.Po grep.c: In function 'fillbuf': grep.c:718: error: 'SEEK_DATA' undeclared (first use in this function) grep.c:718: error: (Each undeclared identifier is reported only once grep.c:718: error: for each function it appears in.) Makefile:1309: recipe for target 'grep.o' failed make[2]: *** [grep.o] Error 1 make[2]: *** Waiting for unfinished jobs.... make[2]: Leaving directory '/b/grep-2.20/src' Makefile:1238: recipe for target 'all-recursive' failed make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory '/b/grep-2.20' Makefile:1179: recipe for target 'all' failed make: *** [all] Error 2 From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Eric Blake Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 17 Sep 2014 14:26:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka , Paul Eggert Cc: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141096393325789 (code B ref 18454); Wed, 17 Sep 2014 14:26:02 +0000 Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 14:25:33 +0000 Received: from localhost ([127.0.0.1]:43544 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUGAl-0006hs-UZ for submit@debbugs.gnu.org; Wed, 17 Sep 2014 10:25:32 -0400 Received: from mx1.redhat.com ([209.132.183.28]:13725) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUGAi-0006hi-Qv for 18454@debbugs.gnu.org; Wed, 17 Sep 2014 10:25:29 -0400 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s8HEPPEa009147 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 17 Sep 2014 10:25:26 -0400 Received: from [10.3.113.56] (ovpn-113-56.phx2.redhat.com [10.3.113.56]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s8HEPOs7019319; Wed, 17 Sep 2014 10:25:25 -0400 Message-ID: <541999D4.6070802@redhat.com> Date: Wed, 17 Sep 2014 08:25:24 -0600 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <5418E73E.2050002@cs.ucla.edu> <20140917231749.B8AD.27F6AC2D@kcn.ne.jp> In-Reply-To: <20140917231749.B8AD.27F6AC2D@kcn.ne.jp> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Spam-Score: -5.7 (-----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.7 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 09/17/2014 08:17 AM, Norihiro Tanaka wrote: > Thanks for many improvements. I applied six patches to grep and tried = to > compile it, but after the sixth patch, I recevied 'SEEK_DATA' undeclare= d > error. I looked for it on CentOS 5.10, but I couldn't find it in stand= ard > header files (glibc 2.5.1) and gnulib files. >=20 It should be fairly easy for gnulib to fake SEEK_DATA/SEEK_HOLE (by treating all files as non-sparse). I guess we haven't needed to do that before now, because other GNU clients (such as coreutils and tar) of this have been doing conditional compilation. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg iQEcBAEBCAAGBQJUGZnUAAoJEKeha0olJ0NqegEH+wSkvcyVkCo2ydhzSfO6C+JT bzzbGHod4eXVibvgc6JKKN7SiuPRa8/xgzcyMUqFOgTo3EMpsq6SW1Ilz+OWBU9P 7dCRSvg7zIR0S21BMvZF+7OGYR2lsRdIcF1dL3Oi+VT/0IIc6Nwi/ksri8eYljDl yoUpJFf8uhxIT463d1txZ8SRGGJbEnYPlI1qTBPmD446ePPzbprFyV2APJxDFOSI 8wGKtkmp5EC96K6zSBzP7c1u0DAjSE3+l3XCRKoeurT9cea5tXEaUrsi3MAvG5+o dJcbE+FmAOCc/hGnPs29nfDhdPkvVnurnq+Axg+nlrFG67dEdyHcqlXcEa5T+7k= =qr+M -----END PGP SIGNATURE----- --2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q-- From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 17 Sep 2014 19:56:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka Cc: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141098376125231 (code B ref 18454); Wed, 17 Sep 2014 19:56:02 +0000 Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 19:56:01 +0000 Received: from localhost ([127.0.0.1]:43624 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XULKa-0006Ys-Vw for submit@debbugs.gnu.org; Wed, 17 Sep 2014 15:56:01 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:42594) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XULKX-0006Yg-Ur for 18454@debbugs.gnu.org; Wed, 17 Sep 2014 15:55:59 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id A9D2DA60001; Wed, 17 Sep 2014 12:55:54 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SZ-Mflp5SCq5; Wed, 17 Sep 2014 12:55:50 -0700 (PDT) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 6076D39E8015; Wed, 17 Sep 2014 12:55:50 -0700 (PDT) Message-ID: <5419E746.7070604@cs.ucla.edu> Date: Wed, 17 Sep 2014 12:55:50 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.0 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <5418E73E.2050002@cs.ucla.edu> <20140917231749.B8AD.27F6AC2D@kcn.ne.jp> In-Reply-To: <20140917231749.B8AD.27F6AC2D@kcn.ne.jp> Content-Type: multipart/mixed; boundary="------------060303070004010300080405" X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) This is a multi-part message in MIME format. --------------060303070004010300080405 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Thanks for reporting that, I forgot that the code defaulted SEEK_HOLE but not SEEK_DATA. The first attached patch should fix it. The second one should improve performance further on Solaris for files that end in holes. --------------060303070004010300080405 Content-Type: text/x-patch; name="0001-grep-port-to-platforms-lacking-SEEK_DATA.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-grep-port-to-platforms-lacking-SEEK_DATA.patch" >From 78497a2aaaaeae439f9546223b45b3b553146f36 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 17 Sep 2014 12:33:55 -0700 Subject: [PATCH 1/2] grep: port to platforms lacking SEEK_DATA Reported by Norihiro Tanaka in: http://bugs.gnu.org/18454#38 * src/grep.c (SEEK_DATA): Default to SEEK_SET if not defined. (SEEK_HOLE): Move to top level, and default it to SEEK_SET. (file_textbin): Adjust to new default. (fillbuf): Don't bother with SEEK_DATA if it defaults to SEEK_SET. --- src/grep.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/src/grep.c b/src/grep.c index 3e94804..a08fa41 100644 --- a/src/grep.c +++ b/src/grep.c @@ -415,6 +415,15 @@ usable_st_size (struct stat const *st) return S_ISREG (st->st_mode) || S_TYPEISSHM (st) || S_TYPEISTMO (st); } +/* Lame substitutes for SEEK_DATA and SEEK_HOLE on platforms lacking them. + Do not rely on these finding data or holes if they equal SEEK_SET. */ +#ifndef SEEK_DATA +enum { SEEK_DATA = SEEK_SET }; +#endif +#ifndef SEEK_HOLE +enum { SEEK_HOLE = SEEK_SET }; +#endif + /* Functions we'll use to search. */ typedef void (*compile_fp_t) (char const *, size_t); typedef size_t (*execute_fp_t) (char const *, size_t, size_t *, char const *); @@ -474,10 +483,6 @@ buffer_textbin (char const *buf, size_t size) static enum textbin file_textbin (char const *buf, size_t bufsize, int fd, struct stat const *st) { - #ifndef SEEK_HOLE - enum { SEEK_HOLE = SEEK_END }; - #endif - enum textbin textbin = buffer_textbin (buf, bufsize); if (textbin_is_binary (textbin)) return textbin; @@ -488,7 +493,7 @@ file_textbin (char const *buf, size_t bufsize, int fd, struct stat const *st) return textbin == TEXTBIN_UNKNOWN ? TEXTBIN_BINARY : textbin; /* If the file has holes, it must contain a null byte somewhere. */ - if (SEEK_HOLE != SEEK_END && eolbyte) + if (SEEK_HOLE != SEEK_SET && eolbyte) { off_t cur = bufsize; if (O_BINARY || fd == STDIN_FILENO) @@ -713,7 +718,7 @@ fillbuf (size_t save, struct stat const *st) break; totalnl = add_count (totalnl, fillsize); - if (!seek_data_failed) + if (SEEK_DATA != SEEK_SET && !seek_data_failed) { off_t data_start = lseek (bufdesc, bufoffset, SEEK_DATA); if (data_start < 0) -- 1.9.3 --------------060303070004010300080405 Content-Type: text/x-patch; name="0002-grep-speed-up-processing-of-holes-before-EOF-on-Sola.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="0002-grep-speed-up-processing-of-holes-before-EOF-on-Sola.pa"; filename*1="tch" >From 0d6febac38c03391d7eecb5335620a0ec5ba8278 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 17 Sep 2014 12:53:17 -0700 Subject: [PATCH 2/2] grep: speed up processing of holes before EOF on Solaris * src/grep.c (fillbuf): If SEEK_DATA fails with errno == ENXIO, skip over the hole at EOF. --- src/grep.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/grep.c b/src/grep.c index a08fa41..35d3358 100644 --- a/src/grep.c +++ b/src/grep.c @@ -720,7 +720,12 @@ fillbuf (size_t save, struct stat const *st) if (SEEK_DATA != SEEK_SET && !seek_data_failed) { + /* Solaris SEEK_DATA fails with errno == ENXIO in a hole at EOF. */ off_t data_start = lseek (bufdesc, bufoffset, SEEK_DATA); + if (data_start < 0 && errno == ENXIO + && usable_st_size (st) && bufoffset < st->st_size) + data_start = lseek (bufdesc, 0, SEEK_END); + if (data_start < 0) seek_data_failed = true; else -- 1.9.3 --------------060303070004010300080405-- From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> In-Reply-To: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Thu, 18 Sep 2014 06:02:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141102007219608 (code B ref 18454); Thu, 18 Sep 2014 06:02:01 +0000 Received: (at 18454) by debbugs.gnu.org; 18 Sep 2014 06:01:12 +0000 Received: from localhost ([127.0.0.1]:43842 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUUmF-00056C-HI for submit@debbugs.gnu.org; Thu, 18 Sep 2014 02:01:11 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:36146) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUUmE-000564-1s for 18454@debbugs.gnu.org; Thu, 18 Sep 2014 02:01:10 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id D4A9339E801C for <18454@debbugs.gnu.org>; Wed, 17 Sep 2014 23:01:08 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zYupyWTfHc+2 for <18454@debbugs.gnu.org>; Wed, 17 Sep 2014 23:00:56 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 402F4A60003 for <18454@debbugs.gnu.org>; Wed, 17 Sep 2014 23:00:49 -0700 (PDT) Message-ID: <541A750E.2050606@cs.ucla.edu> Date: Wed, 17 Sep 2014 23:00:46 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) I've installed all the patches mentioned so far. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Thu, 18 Sep 2014 08:34:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14110292171838 (code B ref 18454); Thu, 18 Sep 2014 08:34:02 +0000 Received: (at 18454) by debbugs.gnu.org; 18 Sep 2014 08:33:37 +0000 Received: from localhost ([127.0.0.1]:43906 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUX9k-0000Ta-9l for submit@debbugs.gnu.org; Thu, 18 Sep 2014 04:33:36 -0400 Received: from mx1.riseup.net ([198.252.153.129]:41158) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUX9h-0000TR-8O for 18454@debbugs.gnu.org; Thu, 18 Sep 2014 04:33:34 -0400 Received: from berryeater.riseup.net (berryeater-pn.riseup.net [10.0.1.120]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (Client CN "*.riseup.net", Issuer "Gandi Standard SSL CA" (not verified)) by mx1.riseup.net (Postfix) with ESMTPS id D9D9C51A50; Thu, 18 Sep 2014 01:33:31 -0700 (PDT) Received: from [127.0.0.1] (localhost [127.0.0.1]) (Authenticated sender: santiagorr) with ESMTPSA id EC5EA4202F Received: by nomada (sSMTP sendmail emulation); Thu, 18 Sep 2014 10:33:27 +0200 Date: Thu, 18 Sep 2014 10:33:27 +0200 From: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Message-ID: <20140918083327.GA16324@nomada> References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <541A750E.2050606@cs.ucla.edu> User-Agent: Mutt/1.5.23 (2014-03-12) X-Virus-Scanned: clamav-milter 0.98.4 at mx1 X-Virus-Status: Clean X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) El 17/09/14 a las 23:00, Paul Eggert escribió: > I've installed all the patches mentioned so far. > I've successfully build the latest commit (f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any performance boost. Rather the opposite. Comparing with debian's grep 2.20-3, that includes your first patch to solve this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch: grep -P asdf /usr/bin/* 12,42s user 0,12s system 99% cpu 12,545 total src/grep -P asdf /usr/bin/* 14,37s user 0,12s system 99% cpu 14,492 total Note that basic grep also slowdowns: grep asdf /usr/bin/* 0,22s user 0,16s system 99% cpu 0,382 total src/grep asdf /usr/bin/* 1,26s user 0,12s system 99% cpu 1,384 total Cheers, and thanks for your work, Santiago From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Thu, 18 Sep 2014 19:38:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Cc: Paul Eggert , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141106904118707 (code B ref 18454); Thu, 18 Sep 2014 19:38:01 +0000 Received: (at 18454) by debbugs.gnu.org; 18 Sep 2014 19:37:21 +0000 Received: from localhost ([127.0.0.1]:44785 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUhW3-0004re-T7 for submit@debbugs.gnu.org; Thu, 18 Sep 2014 15:37:20 -0400 Received: from mail-we0-f180.google.com ([74.125.82.180]:54556) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XUhW1-0004rW-Ry for 18454@debbugs.gnu.org; Thu, 18 Sep 2014 15:37:18 -0400 Received: by mail-we0-f180.google.com with SMTP id q59so1448701wes.25 for <18454@debbugs.gnu.org>; Thu, 18 Sep 2014 12:37:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type:content-transfer-encoding; bh=cTyiBHYsccPZ4mHMDRlQHyfGgZlMHthxCLurLkdYxik=; b=CAIhLB4nIUKmR1ugY6xAxrs1m0daW2vNEZVyovS8NJ5NJMmqWIEWQlJt8OAat2bwIP siyWAQaY5RhbiaUcu33BPRSbpxZZoXWAc0HRdqg1y8Py0m3neVdGJn4lOPvHOByzGSal Z98QGZyZCEu1U+LA0LjqDv1PL9gAXJD4hVoviO2PvPjM7rChyIFJZtDFKWBLHepQf2YN i+aoU+z4m4ZkJ1srArQcQ6UgsYzyQNfYCJpHdPLPIigeim4Jv3oasjED/NDNeSXlU3sc uPdAxSSBhOIEY/CVSMtXojTf9agwgFAFoxKF+t29xINIlkBjb635hNPd73VxrYl6Za49 2oNg== X-Received: by 10.194.8.232 with SMTP id u8mr7508156wja.64.1411069037110; Thu, 18 Sep 2014 12:37:17 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.86.131 with HTTP; Thu, 18 Sep 2014 12:36:57 -0700 (PDT) In-Reply-To: <20140918083327.GA16324@nomada> References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> <20140918083327.GA16324@nomada> From: Jim Meyering Date: Thu, 18 Sep 2014 12:36:57 -0700 X-Google-Sender-Auth: cafI-Ke3C0iV4ZGt9nrph0sbAwY Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Thu, Sep 18, 2014 at 1:33 AM, Santiago Ruano Rinc=F3n wrote: > El 17/09/14 a las 23:00, Paul Eggert escribi=F3: >> I've installed all the patches mentioned so far. >> > > I've successfully build the latest commit > (f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any > performance boost. Rather the opposite. > > Comparing with debian's grep 2.20-3, that includes your first patch to so= lve > this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch: > > grep -P asdf /usr/bin/* 12,42s user 0,12s system 99% cpu 12,545 total > src/grep -P asdf /usr/bin/* 14,37s user 0,12s system 99% cpu 14,492 tota= l > > Note that basic grep also slowdowns: > > grep asdf /usr/bin/* 0,22s user 0,16s system 99% cpu 0,382 total > src/grep asdf /usr/bin/* 1,26s user 0,12s system 99% cpu 1,384 total Thank you for running timing comparisons. Once I verified that I had no large, sparse files in my grep working direct= ory, I ran the same test there (du -sh . reports 176M, du --app -sh . reports 13= 9M) The following shows a performance regression when searching files like those in my grep working directory. The new grep (v2.20-46-gf6de00f) takes 2.5x longer than 2.20.14. This is with a hot cache (best of several runs) on a Intel(R) Xeon(R) CPU E5-2660, compiled with gcc-5.x $ diff -u <(env time grep -r asdf . 2>&1) <(PATH=3Dsrc:$PATH env time grep -r asdf . 2>&1) --- /proc/self/fd/11 2014-09-18 12:07:43.169721947 -0700 +++ /proc/self/fd/12 2014-09-18 12:07:43.169721947 -0700 @@ -1,3 +1,3 @@ ./src/grep.c: printf 'asdfqwerzxcv\rASDF\tZXCV\n' -0.08user 0.10system 0:00.18elapsed 100%CPU (0avgtext+0avgdata 6256maxresident)k -0inputs+0outputs (0major+670minor)pagefaults 0swaps +0.40user 0.11system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 5328maxresid= ent)k +0inputs+0outputs (0major+634minor)pagefaults 0swaps It looks like most of the difference is the result of commit cd36abd46c5e0768606979ea75a51732062f5624, "grep: treat a file as binary if its prefix contains encoding errors", with its new, locale-sensitive "is_binary" test. I saw the above timing difference even with LC_ALL=3DC, so one quick fix would be to skip the use of mbrlen when possible. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 19 Sep 2014 16:08:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Cc: Paul Eggert , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14111428325821 (code B ref 18454); Fri, 19 Sep 2014 16:08:02 +0000 Received: (at 18454) by debbugs.gnu.org; 19 Sep 2014 16:07:12 +0000 Received: from localhost ([127.0.0.1]:45966 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XV0iF-0001Vo-A8 for submit@debbugs.gnu.org; Fri, 19 Sep 2014 12:07:11 -0400 Received: from mail-wg0-f48.google.com ([74.125.82.48]:61038) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XV0iC-0001Vf-JO for 18454@debbugs.gnu.org; Fri, 19 Sep 2014 12:07:09 -0400 Received: by mail-wg0-f48.google.com with SMTP id m15so2692552wgh.7 for <18454@debbugs.gnu.org>; Fri, 19 Sep 2014 09:07:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=hm2NjWOybrqFQ9m2Hjr65mqjutZgSBiVs8hhzn8ZeDo=; b=WsJ4i9MIacK+OvSkNCmdk4TPbXeNgnx/bsybcRGRCYJeG1MIeYt2kd2KS73BwpRAk1 8T7L6hImPESsO30M/b8RKPEKb7dUzTzwErpM2KwwRZ5A2T8wviYnt4mZHVW9pcPSXJ2d zBOkSv5bE0Bvpuc0GsO7xlx9kMlQ5oW5GILjWwMzLK696Rr8tKGXPJDRCQbyCsVAW//q Qm1ayd7DJRdQTBTA0ZZ6/uegC595h8ssnUYpuMZ8Ts0EySQqzz5vPEIzkfB/83yxAddZ dLuuLhnVXd42qMBXb198b5HC1iNSTDa4mLEZP2eujwCpDe8kCPWM1urQzOvelGJrR3L9 RqIw== X-Received: by 10.180.78.226 with SMTP id e2mr7077703wix.68.1411142827829; Fri, 19 Sep 2014 09:07:07 -0700 (PDT) MIME-Version: 1.0 Received: by 10.194.86.131 with HTTP; Fri, 19 Sep 2014 09:06:47 -0700 (PDT) In-Reply-To: References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> <20140918083327.GA16324@nomada> From: Jim Meyering Date: Fri, 19 Sep 2014 09:06:47 -0700 X-Google-Sender-Auth: gfBoNdOkl51BZat2ai718EeTNoM Message-ID: Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) On Thu, Sep 18, 2014 at 12:36 PM, Jim Meyering wrote: > It looks like most of the difference is the result of > commit cd36abd46c5e0768606979ea75a51732062f5624, > "grep: treat a file as binary if its prefix contains encoding errors", Hi Paul, I found that the above commit induces a large performance hit. Over 50x in this example: seq 99999999 > k LC_ALL=C diff -u \ <(PATH=.bin/2.20-31:$PATH env time -f %e grep asdf k 2>&1) \ <(PATH=.bin/2.20-32:$PATH env time -f %e grep asdf k 2>&1) ... -0.21 +11.47 The problem is that the new function is processing all of the input, not just a prefix. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> In-Reply-To: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 21 Sep 2014 21:24:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 18454@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.14113346095209 (code B ref -1); Sun, 21 Sep 2014 21:24:01 +0000 Received: (at submit) by debbugs.gnu.org; 21 Sep 2014 21:23:29 +0000 Received: from localhost ([127.0.0.1]:47720 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XVobP-0001Lw-IL for submit@debbugs.gnu.org; Sun, 21 Sep 2014 17:23:28 -0400 Received: from eggs.gnu.org ([208.118.235.92]:50129) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XVavU-0004TY-QP for submit@debbugs.gnu.org; Sun, 21 Sep 2014 02:47:17 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XVavM-0007eg-7Z for submit@debbugs.gnu.org; Sun, 21 Sep 2014 02:47:16 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, UNPARSEABLE_RELAY autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:56581) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XVavM-0007eB-4G for submit@debbugs.gnu.org; Sun, 21 Sep 2014 02:47:08 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51806) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XVav9-0000ml-P1 for bug-grep@gnu.org; Sun, 21 Sep 2014 02:47:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XVav2-0007Vm-Gs for bug-grep@gnu.org; Sun, 21 Sep 2014 02:46:55 -0400 Received: from iwiw03d.mail.t-online.hu ([84.2.42.68]:15254) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XVav2-0007VY-9a for bug-grep@gnu.org; Sun, 21 Sep 2014 02:46:48 -0400 Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74]) by iwiw03d.mail.t-online.hu (Postfix) with SMTP id F1AEC4E8B7A for ; Sun, 21 Sep 2014 08:46:35 +0200 (CEST) Received: (qmail 38945 invoked by uid 151); 21 Sep 2014 08:46:39 +0200 Received: from fm-haproxy01.freemail.hu (HELO fmxmldata04.freemail.hu) (195.228.245.211) by fmx24.freemail.hu with SMTP; 21 Sep 2014 08:46:39 +0200 Received: from webmail by smtp gw id s8L6kda2057330; Sun, 21 Sep 2014 08:46:39 +0200 (CEST) Date: Sun, 21 Sep 2014 08:46:39 +0200 (CEST) From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Message-ID: X-Originating-IP: [91.83.38.253] X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 X-Original-User: hzmester MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=UTF-8 X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -5.0 (-----) X-Mailman-Approved-At: Sun, 21 Sep 2014 17:23:25 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Hi, I am the developer of the JIT compiler in PCRE. I am frequently checking the discussions about PCRE and found this comment here on bug-grep@gnu.org: > There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening > the data. That's the fundamental problem here. This requires too much effort with no benefit. Reasons: - what should you do if you encounter an invalid UTF-8 opcode: ignore it? decode it to some random value? For example, what should happen if you find a stray 0xe9? Does it match \xe9? Everybody has different opinion about handling invalid UTF opcodes, and this would lead to never ending arguing on pcre-dev. - the bigger problem is performance. Handling invalid UTF codes require a lot of extra checks and kills many optimizations. For example, when we encounter a 0xc5, we know that the input buffer has at least one more byte. We did not check the input buffer size. We also assume that the highest 2 bits are 10 for the second byte, and did not check this when we decode that character. This would also kill other optimizations like boyer-moore like search in JIT. The major problem is, everybody would suffer this performance regression, including those, who pass valid UTF strings. Therefore such change will never happen due to these reasons. But there are alternatives. * The best solution is multi-threaded grepping: one thread reads file data, and replace/remove invalid UTF8 opcodes to something valid. The other thread runs PCRE on the filtered thread. Alternatively, you can convert everything to UTF32, and use pcre32. * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed, i.e. if you search /ab/, and the invalid UTF sequence contains ab, this might not be found (or might be found with interpreter, but not with JIT or vice versa). If you use pcre32, there is no need for any extra byte extension. Regards, Zoltan From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 26 Sep 2014 00:24:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Jim Meyering , Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141169099231334 (code B ref 18454); Fri, 26 Sep 2014 00:24:01 +0000 Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 00:23:12 +0000 Received: from localhost ([127.0.0.1]:52412 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXJJX-00089J-P7 for submit@debbugs.gnu.org; Thu, 25 Sep 2014 20:23:12 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:38256) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXJJU-000898-6w for 18454@debbugs.gnu.org; Thu, 25 Sep 2014 20:23:09 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 4A3EEA60036; Thu, 25 Sep 2014 17:23:07 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id MBSAwqAjtzLB; Thu, 25 Sep 2014 17:23:03 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id D807FA6000A; Thu, 25 Sep 2014 17:23:02 -0700 (PDT) Message-ID: <5424B1E6.8090502@cs.ucla.edu> Date: Thu, 25 Sep 2014 17:23:02 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> <20140918083327.GA16324@nomada> In-Reply-To: Content-Type: multipart/mixed; boundary="------------070505040701010404060607" X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) This is a multi-part message in MIME format. --------------070505040701010404060607 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Thanks for looking into that. The attached patches solve those performance problems for me. --------------070505040701010404060607 Content-Type: text/plain; charset=UTF-8; name="0001-grep-scan-for-valid-multibyte-strings-more-quickly.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*0="0001-grep-scan-for-valid-multibyte-strings-more-quickly.patc"; filename*1="h" RnJvbSA0ZWY2N2EyNzJhZjg1YjQ2ZjQ3NjlkMWM1OTMxNzhhNjNmNjIwNWRhIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBUaHUsIDI1IFNlcCAyMDE0IDE3OjA0OjQ5IC0wNzAwClN1YmplY3Q6IFtQQVRD SCAxLzJdIGdyZXA6IHNjYW4gZm9yIHZhbGlkIG11bHRpYnl0ZSBzdHJpbmdzIG1vcmUgcXVp Y2tseQoKU2NhbiB2YWxpZCBtdWx0aWJ5dGUgc3RyaW5ncyBtb3JlIHF1aWNrbHkgaW4gdGhl IGNvbW1vbiBjYXNlIG9mCmVuY29kaW5ncyB0aGF0IGFyZSB1cHdhcmQgY29tcGF0aWJsZSB3 aXRoIEFTQ0lJLCBzdWNoIGFzIFVURi04LgpZb3UnZCB0aGluayB0aGVyZSdkIGJlIGEgZmFz dCBzdGFuZGFyZCB3YXkgdG8gZG8gdGhpcyBub3dhZGF5cywKYnV0IG5vb29vby4uLi4KUHJv YmxlbSByZXBvcnRlZCBieSBKaW0gTWV5ZXJpbmcgaW46IGh0dHA6Ly9idWdzLmdudS5vcmcv MTg0NTQjNTYKKiBzcmMvZ3JlcC5jIChISUJZVEUpOiBOZXcgY29uc3RhbnQuCihlYXN5X2Vu Y29kaW5nKTogTmV3IHN0YXRpYyB2YXIuCihpbml0X2Vhc3lfZW5jb2RpbmcsIHNraXBfZWFz eV9ieXRlcyk6IE5ldyBmdW5jdGlvbnMuCihidWZmZXJfdGV4dGJpbik6IFNraXAgZWFzeSBi eXRlcyBxdWlja2x5LgpEb24ndCBib3RoZXIgd2l0aCBtYl9jbGVuIGhlcmUsIHNpbmNlIHNr aXBfZWFzeV9ieXRlcyB0eXBpY2FsbHkKY2FwdHVyZXMgdGhlIGVhc3kgY2FzZXM7IGp1c3Qg dXNlIG1icmxlbiBkaXJlY3RseS4KKGJ1ZmZlcl90ZXh0YmluLCBmaWxlX3RleHRiaW4pOiBG aXJzdCBhcmcgaXMgbm8gbG9uZ2VyIGEgY29uc3QKcG9pbnRlciwgc2luY2UgdGhlIGJ5dGUg cGFzdCB0aGUgZW5kIGlzIG5vdyBhbiBvdmVyd3JpdHRlbiBzZW50aW5lbC4KKG1haW4pOiBD YWxsIGluaXRfZWFzeV9lbmNvZGluZy4KLS0tCiBzcmMvZ3JlcC5jIHwgNTcgKysrKysrKysr KysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKystLS0tCiAxIGZp bGUgY2hhbmdlZCwgNTMgaW5zZXJ0aW9ucygrKSwgNCBkZWxldGlvbnMoLSkKCmRpZmYgLS1n aXQgYS9zcmMvZ3JlcC5jIGIvc3JjL2dyZXAuYwppbmRleCAzNWQzMzU4Li45NDhlNDI3IDEw MDY0NAotLS0gYS9zcmMvZ3JlcC5jCisrKyBiL3NyYy9ncmVwLmMKQEAgLTQ1NCw5ICs0NTQs NTYgQEAgdGV4dGJpbl9pc19iaW5hcnkgKGVudW0gdGV4dGJpbiB0ZXh0YmluKQogICByZXR1 cm4gdGV4dGJpbiA8IFRFWFRCSU5fVU5LTk9XTjsKIH0KIAorLyogVGhlIGhpZ2gtb3JkZXIg Yml0IG9mIGEgYnl0ZS4gICovCitlbnVtIHsgSElCWVRFID0gMHg4MCB9OworCisvKiBUcnVl IGlmIGV2ZXJ5IGJ5dGUgd2l0aCBISUJZVEUgb2ZmIGlzIGEgc2luZ2xlLWJ5dGUgY2hhcmFj dGVyLgorICAgVVRGLTggaGFzIHRoaXMgcHJvcGVydHkuICAqLworc3RhdGljIGJvb2wgZWFz eV9lbmNvZGluZzsKKworc3RhdGljIHZvaWQKK2luaXRfZWFzeV9lbmNvZGluZyAodm9pZCkK K3sKKyAgZWFzeV9lbmNvZGluZyA9IHRydWU7CisgIGZvciAoaW50IGkgPSAwOyBpIDwgSElC WVRFOyBpKyspCisgICAgZWFzeV9lbmNvZGluZyAmPSBtYmNsZW5fY2FjaGVbaV0gPT0gMTsK K30KKworLyogU2tpcCB0aGUgZWFzeSBieXRlcyBpbiBhIGJ1ZmZlciB0aGF0IGlzIGd1YXJh bnRlZWQgdG8gaGF2ZSBhIHNlbnRpbmVsCisgICB0aGF0IGlzIG5vdCBlYXN5LCBhbmQgcmV0 dXJuIGEgcG9pbnRlciB0byB0aGUgZmlyc3Qgbm9uLWVhc3kgYnl0ZS4KKyAgIEluIGVhc3kg ZW5jb2RpbmdzLCB0aGUgZWFzeSBieXRlcyBhbGwgaGF2ZSBISUJZVEUgb2ZmLgorICAgSW4g b3RoZXIgZW5jb2RpbmdzLCBubyBieXRlIGlzIGVhc3kuICAqLworc3RhdGljIGNoYXIgY29u c3QgKiBfR0xfQVRUUklCVVRFX1BVUkUKK3NraXBfZWFzeV9ieXRlcyAoY2hhciBjb25zdCAq YnVmKQoreworICBpZiAoIWVhc3lfZW5jb2RpbmcpCisgICAgcmV0dXJuIGJ1ZjsKKworICAv KiBBbiB1bnNpZ25lZCB0eXBlIHN1aXRhYmxlIGZvciBmYXN0IG1hdGNoaW5nLiAgKi8KKyAg dHlwZWRlZiB1aW50bWF4X3QgdXdvcmQ7CisKKyAgLyogMHg4MDgwLi4uLCBleHRlbmRlZCB0 byBiZSB3aWRlIGVub3VnaCBmb3IgdXdvcmQuICAqLworICB1d29yZCBoaWJ5dGVfbWFzayA9 ICh1d29yZCkgLTEgLyBVQ0hBUl9NQVggKiBISUJZVEU7CisKKyAgLyogU2VhcmNoIGEgYnl0 ZSBhdCBhIHRpbWUgdW50aWwgdGhlIHBvaW50ZXIgaXMgYWxpZ25lZCwgdGhlbiBhCisgICAg IHV3b3JkIGF0IGEgdGltZSB1bnRpbCBhIG1hdGNoIGlzIGZvdW5kLCB0aGVuIGEgYnl0ZSBh dCBhIHRpbWUgdG8KKyAgICAgaWRlbnRpZnkgdGhlIGV4YWN0IGJ5dGUuICBUaGUgdXdvcmQg c2VhcmNoIG1heSBnbyBzbGlnaHRseSBwYXN0CisgICAgIHRoZSBidWZmZXIgZW5kLCBidXQg dGhhdCdzIGJlbmlnbi4gICovCisgIGNoYXIgY29uc3QgKnA7CisgIHV3b3JkIGNvbnN0ICpz OworICBmb3IgKHAgPSBidWY7ICh1aW50cHRyX3QpIHAgJSBzaXplb2YgKHV3b3JkKSAhPSAw OyBwKyspCisgICAgaWYgKCpwICYgSElCWVRFKQorICAgICAgcmV0dXJuIHA7CisgIGZvciAo cyA9ICh1d29yZCBjb25zdCAqKSBwOyAhICgqcyAmIGhpYnl0ZV9tYXNrKTsgcysrKQorICAg IGNvbnRpbnVlOworICBmb3IgKHAgPSAoY2hhciBjb25zdCAqKSBzOyAhICgqcCAmIEhJQllU RSk7IHArKykKKyAgICBjb250aW51ZTsKKyAgcmV0dXJuIHA7Cit9CisKIC8qIFJldHVybiB0 aGUgdGV4dCB0eXBlIG9mIGRhdGEgaW4gQlVGLCBvZiBzaXplIFNJWkUuICAqLwogc3RhdGlj IGVudW0gdGV4dGJpbgotYnVmZmVyX3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90 IHNpemUpCitidWZmZXJfdGV4dGJpbiAoY2hhciAqYnVmLCBzaXplX3Qgc2l6ZSkKIHsKICAg aWYgKGVvbGJ5dGUgJiYgbWVtY2hyIChidWYsICdcMCcsIHNpemUpKQogICAgIHJldHVybiBU RVhUQklOX0JJTkFSWTsKQEAgLTQ2Nyw5ICs1MTQsMTAgQEAgYnVmZmVyX3RleHRiaW4gKGNo YXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAgICAgICBzaXplX3QgY2xlbjsKICAgICAg IGNoYXIgY29uc3QgKnA7CiAKLSAgICAgIGZvciAocCA9IGJ1ZjsgcCA8IGJ1ZiArIHNpemU7 IHAgKz0gY2xlbikKKyAgICAgIGJ1ZltzaXplXSA9IC0xOworICAgICAgZm9yIChwID0gYnVm OyAocCA9IHNraXBfZWFzeV9ieXRlcyAocCkpIDwgYnVmICsgc2l6ZTsgcCArPSBjbGVuKQog ICAgICAgICB7Ci0gICAgICAgICAgY2xlbiA9IG1iX2NsZW4gKHAsIGJ1ZiArIHNpemUgLSBw LCAmbWJzKTsKKyAgICAgICAgICBjbGVuID0gbWJybGVuIChwLCBidWYgKyBzaXplIC0gcCwg Jm1icyk7CiAgICAgICAgICAgaWYgKChzaXplX3QpIC0yIDw9IGNsZW4pCiAgICAgICAgICAg ICByZXR1cm4gY2xlbiA9PSAoc2l6ZV90KSAtMiA/IFRFWFRCSU5fVU5LTk9XTiA6IFRFWFRC SU5fQklOQVJZOwogICAgICAgICB9CkBAIC00ODEsNyArNTI5LDcgQEAgYnVmZmVyX3RleHRi aW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAvKiBSZXR1cm4gdGhlIHRleHQg dHlwZSBvZiBhIGZpbGUuICBCVUYsIG9mIHNpemUgQlVGU0laRSwgaXMgdGhlIGluaXRpYWwK ICAgIGJ1ZmZlciByZWFkIGZyb20gdGhlIGZpbGUgd2l0aCBkZXNjcmlwdG9yIEZEIGFuZCBz dGF0dXMgU1QuICAqLwogc3RhdGljIGVudW0gdGV4dGJpbgotZmlsZV90ZXh0YmluIChjaGFy IGNvbnN0ICpidWYsIHNpemVfdCBidWZzaXplLCBpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0 ICpzdCkKK2ZpbGVfdGV4dGJpbiAoY2hhciAqYnVmLCBzaXplX3QgYnVmc2l6ZSwgaW50IGZk LCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiB7CiAgIGVudW0gdGV4dGJpbiB0ZXh0YmluID0g YnVmZmVyX3RleHRiaW4gKGJ1ZiwgYnVmc2l6ZSk7CiAgIGlmICh0ZXh0YmluX2lzX2JpbmFy eSAodGV4dGJpbikpCkBAIC0yNDE3LDYgKzI0NjUsNyBAQCBtYWluIChpbnQgYXJnYywgY2hh ciAqKmFyZ3YpCiAgICAgdXNhZ2UgKEVYSVRfVFJPVUJMRSk7CiAKICAgYnVpbGRfbWJjbGVu X2NhY2hlICgpOworICBpbml0X2Vhc3lfZW5jb2RpbmcgKCk7CiAKICAgLyogSWYgZmdyZXAg aW4gYSBtdWx0aWJ5dGUgbG9jYWxlLCB0aGVuIHVzZSBncmVwIGlmIGVpdGhlcgogICAgICAo MSkgY2FzZSBpcyBpZ25vcmVkICh3aGVyZSBncmVwIGlzIHR5cGljYWxseSBmYXN0ZXIpLCBv cgotLSAKMS45LjMKCg== --------------070505040701010404060607 Content-Type: text/plain; charset=UTF-8; name="0002-grep-don-t-check-extensively-for-invalid-prefix-byte.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*0="0002-grep-don-t-check-extensively-for-invalid-prefix-byte.pa"; filename*1="tch" RnJvbSA0NjZjYTQ0YjBiNTA5MDdlNDdlZjZmMWI0ZTEyODNlMzJkNjY3ZjM3IE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBUaHUsIDI1IFNlcCAyMDE0IDE3OjE0OjU2IC0wNzAwClN1YmplY3Q6IFtQQVRD SCAyLzJdIGdyZXA6IGRvbid0IGNoZWNrIGV4dGVuc2l2ZWx5IGZvciBpbnZhbGlkIHByZWZp eCBieXRlcwogdW5sZXNzIC1QCgpQcm9ibGVtIHJlcG9ydGVkIGJ5IEppbSBNZXllcmluZyBp bjogaHR0cDovL2J1Z3MuZ251Lm9yZy8xODQ1NCM1NgoqIHNyYy9ncmVwLmMgKGdyZXApOiBB ZnRlciB0aGUgZmlyc3QgYnVmZmVyIGlzIGNoZWNrZWQsIGxlYXZlIHRoZQpmaWxlLXR5cGUg Y2hlY2tlciBpbiBURVhUQklOX1VOS05PV04gc3RhdGUgb25seSB3aGVuIC1QIGlzIHVzZWQu Ck9ubHkgdGhlIC1QIG1hdGNoZXIgaGFzIHBlcmZvcm1hbmNlIHByb2JsZW1zIHdpdGggY2hl Y2tpbmcgYmluYXJ5CmRhdGEgdGhhdCBtYWtlIGl0IHdvcnRod2hpbGUgdG8gY2hlY2sgZXZl cnkgcHJlZml4IGlucHV0IGJ5dGUgc28KdGhlIC1QIG1hdGNoZXIncyBURVhUQklOX1VOS05P V04gb3B0aW1pemF0aW9ucyBjYW4gY29tZSBpbnRvIHBsYXkuCk90aGVyIG1hdGNoZXJzIGNh biBzaW1wbHkgY2hlY2sgdGhlIGRhdGEgZGlyZWN0bHksIGFuZCB1c2luZwpURVhUQklOX1VO S05PV04gd2l0aCB0aGVtIHNsb3dzICdncmVwJyBkb3duIGZvciBubyBiZW5lZml0LgotLS0K IHNyYy9ncmVwLmMgfCAyICsrCiAxIGZpbGUgY2hhbmdlZCwgMiBpbnNlcnRpb25zKCspCgpk aWZmIC0tZ2l0IGEvc3JjL2dyZXAuYyBiL3NyYy9ncmVwLmMKaW5kZXggOTQ4ZTQyNy4uM2E4 ZDlmNSAxMDA2NDQKLS0tIGEvc3JjL2dyZXAuYworKysgYi9zcmMvZ3JlcC5jCkBAIC0xMjg4 LDYgKzEyODgsOCBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAg ICAgICAgICBudWxfemFwcGVyID0gZW9sOwogICAgICAgICAgIHNraXBfbnVscyA9IHNraXBf ZW1wdHlfbGluZXM7CiAgICAgICAgIH0KKyAgICAgIGVsc2UgaWYgKGV4ZWN1dGUgIT0gUGV4 ZWN1dGUpCisgICAgICAgIHRleHRiaW4gPSBURVhUQklOX1RFWFQ7CiAgICAgfQogCiAgIGZv ciAoOzspCi0tIAoxLjkuMwoK --------------070505040701010404060607-- From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 26 Sep 2014 01:20:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14116943704327 (code B ref 18454); Fri, 26 Sep 2014 01:20:02 +0000 Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 01:19:30 +0000 Received: from localhost ([127.0.0.1]:52442 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXKC1-00017i-LN for submit@debbugs.gnu.org; Thu, 25 Sep 2014 21:19:30 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:40072) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXKBz-00017Z-3P for 18454@debbugs.gnu.org; Thu, 25 Sep 2014 21:19:28 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 0547FA6000C; Thu, 25 Sep 2014 18:19:26 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id STXuGlK7GRi6; Thu, 25 Sep 2014 18:19:21 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 203A4A6000A; Thu, 25 Sep 2014 18:19:21 -0700 (PDT) Message-ID: <5424BF18.7030809@cs.ucla.edu> Date: Thu, 25 Sep 2014 18:19:20 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) Zoltán, thanks for your comments on this subject. Some thoughts and suggestions: > - what should you do if you encounter an invalid UTF-8 opcode Do whatever plain 'grep' does, which is what the glibc regular expression matcher does. If I recall correctly, an encoding error in the pattern matches the same encoding error in the string. It shouldn't be that complicated. > Everybody has different opinion about handling invalid UTF opcodes I doubt whether users would care all that much, so long as the default is reasonable. We don't get complaints about it with 'grep', anyway. But if it's a real problem in the PCRE world, you could provide compile-time or run-time options to satisfy the different opinions. > everybody would suffer this performance regression, including those, who pass valid UTF strings. I don't see why. libpcre can continue with its current implementation, for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; that's not a problem. The problem is the case where users pass possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK. libpcre has a slow implementation for this case, and this slow implementation's performance should be improvable without affecting the performance for the PCRE_NO_UTF8_CHECK case. > * The best solution is multi-threaded grepping That would chew up CPU resources unnecessarily, by requiring two passes over the input, one for checking UTF-8, the other for doing the actual match. Granted, it might be faster in real-time than what we have now, but overall it'd probably be more expensive (e.g., more energy consumption) than what we have now, and this doesn't sound promising. > * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed That doesn't sound like a win, I'm afraid. The use case that prompted this bug report is someone using 'grep -r' to search for strings like 'foobar' in binary data, and this use case would not work with this suggested solution. I'm hoping that the recent set of changes to 'grep' lessens the urgency of improving libpcre. On my platform (Fedora 20 x86-64) Jim Meyering's benchmark says that with grep 2.18, grep -P is 6.4x slower than plain grep, and that with the latest experimental grep (including the patches I just posted in ), grep -P is 5.6x slower than plain grep. So it's plausible that the latest set of fixes is good enough, in the sense that, sure, PCRE is slower, but it's always been slower and if that used to be good enough then it should still be good enough. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 26 Sep 2014 06:38:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141171342314997 (code B ref 18454); Fri, 26 Sep 2014 06:38:02 +0000 Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 06:37:03 +0000 Received: from localhost ([127.0.0.1]:52509 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXP9K-0003ta-Qj for submit@debbugs.gnu.org; Fri, 26 Sep 2014 02:37:03 -0400 Received: from iwiw01d.mail.t-online.hu ([84.2.42.53]:51868 helo=fmxout01.freemail.hu) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXP9H-0003sn-KV for 18454@debbugs.gnu.org; Fri, 26 Sep 2014 02:37:01 -0400 Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74]) by fmxout01.freemail.hu (Postfix) with SMTP id B16901293E for <18454@debbugs.gnu.org>; Fri, 26 Sep 2014 08:36:40 +0200 (CEST) Received: (qmail 98182 invoked by uid 151); 26 Sep 2014 08:36:40 +0200 Received: from 195.228.245.211 (HELO fmxmldata02.freemail.hu) (160.114.36.201) by fmx24.freemail.hu with SMTP; 26 Sep 2014 08:36:40 +0200 Received: from webmail by smtp gw id s8Q6aerH006726; Fri, 26 Sep 2014 08:36:40 +0200 (CEST) Date: Fri, 26 Sep 2014 08:36:40 +0200 (CEST) From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg In-Reply-To: <5424BF18.7030809@cs.ucla.edu> Message-ID: X-Originating-IP: [160.114.36.201] X-HTTP-User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:32.0) Gecko/20100101 Firefox/32.0 X-Original-User: hzmester MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=UTF-8 X-Spam-Score: 1.5 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi Paul, thank you for the feedback. >I doubt whether users would care all that much, so long as the default >is reasonable. We don't get complaints about it with 'grep', anyway. >But if it's a real problem in the PCRE world, you could provide >compile-time or run-time options to satisfy the different opinions. [...] Content analysis details: (1.5 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (hzmester[at]freemail.hu) -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [84.2.42.53 listed in list.dnswl.org] 1.5 MALFORMED_FREEMAIL Bad headers on message from free email service 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay lines X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi Paul, thank you for the feedback. >I doubt whether users would care all that much, so long as the default >is reasonable. We don't get complaints about it with 'grep', anyway. >But if it's a real problem in the PCRE world, you could provide >compile-time or run-time options to satisfy the different opinions. The situation is worse :( Reasonable has a different meaning for everybody. Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint: /(?<=\x9c)#/ Does it match \xd5\x9c# starting from #? Noticing errors during a backward scan is complicated. /[\x9c-\x{ffff}]/ What does this range defines exactly? What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds? Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence? In general, UTF defines several character properties. What unicode properties does an invalid codepoint have? Believe me, depending on their needs, everybody has different answers to these questions. We don't want to force the view of one group to others, since PCRE is a library. You have much more freedom to define any behavior, since grep is an end-user program. >I don't see why. libpcre can continue with its current implementation, >for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; >that's not a problem. The problem is the case where users pass >possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK. libpcre has >a slow implementation for this case, and this slow implementation's >performance should be improvable without affecting the performance for >the PCRE_NO_UTF8_CHECK case. Regarding performance, this comes from the interpreter: #define GETUTF8(c, eptr) \ { \ if ((c & 0x20) == 0) \ c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \ else if ((c & 0x10) == 0) \ c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \ else if ((c & 0x08) == 0) \ c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \ ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \ else if ((c & 0x04) == 0) \ c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \ ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \ (eptr[4] & 0x3f); \ else \ c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \ ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \ ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \ } Imagine if you would need to add buffer end and other bit checks. Furthermore unicode expects that any character should be encoded with the least amount of bytes. More checks. You also need to check the current mode. Of course we have several macros similar like this (due to performance reasons), and there are code paths where we have assumptions about valid UTF strings. This would increase complexity a lot, we would need a lot of extra regression tests, we need a correct JIT implementation, and so on. This would also kill optimizations. For example, if you define a character range, where all characters are two byte long, JIT cleverly detect this, and use a fast case to discard any non-two byte UTF codepoints. The question is, who would be willing to do this work. >That would chew up CPU resources unnecessarily, by requiring two passes >over the input, one for checking UTF-8, the other for doing the actual >match. Granted, it might be faster in real-time than what we have now, >but overall it'd probably be more expensive (e.g., more energy >consumption) than what we have now, and this doesn't sound promising. Yeah but you could add a flag to enable this :) I feel this would be much less work than the former. >That doesn't sound like a win, I'm afraid. The use case that prompted >this bug report is someone using 'grep -r' to search for strings like >'foobar' in binary data, and this use case would not work with this >suggested solution. In this case, I would simply disable UTF-8 decoding. You could search UTF character codes encoded in ascii (i.e. searching \xe9 as \xc3\xa9) Regards, Zoltan From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 26 Sep 2014 08:49:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by submit@debbugs.gnu.org id=B.141172132727941 (code B ref -1); Fri, 26 Sep 2014 08:49:02 +0000 Received: (at submit) by debbugs.gnu.org; 26 Sep 2014 08:48:47 +0000 Received: from localhost ([127.0.0.1]:52722 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXRCo-0007Ga-Aq for submit@debbugs.gnu.org; Fri, 26 Sep 2014 04:48:46 -0400 Received: from eggs.gnu.org ([208.118.235.92]:45379) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXRCl-0007GR-CV for submit@debbugs.gnu.org; Fri, 26 Sep 2014 04:48:44 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XXRCe-0005U7-0v for submit@debbugs.gnu.org; Fri, 26 Sep 2014 04:48:42 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:50128) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XXRCd-0005TI-Sy for submit@debbugs.gnu.org; Fri, 26 Sep 2014 04:48:35 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46999) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XXRCS-0005p4-OP for bug-grep@gnu.org; Fri, 26 Sep 2014 04:48:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XXRCM-0005Pu-Pd for bug-grep@gnu.org; Fri, 26 Sep 2014 04:48:24 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:43139) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XXRCM-0005LZ-IM for bug-grep@gnu.org; Fri, 26 Sep 2014 04:48:18 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 70FE639E8012; Fri, 26 Sep 2014 01:48:05 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4eoyP0fY7-XT; Fri, 26 Sep 2014 01:48:00 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 8E0AC39E801B; Fri, 26 Sep 2014 01:48:00 -0700 (PDT) Message-ID: <54252840.2020409@cs.ucla.edu> Date: Fri, 26 Sep 2014 01:48:00 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) Zolt=C3=A1n Herczeg wrote: > Just consider these two examples, where \x9c is an incorrectly encoded = unicode codepoint: > > /(?<=3D\x9c)#/ > > Does it match \xd5\x9c# starting from #? No, because the input does not contain a \x9c encoding error. Encoding e= rrors=20 match only themselves, not parts of other characters. That is how the gl= ibc=20 matchers behave, and it's what users expect. > Noticing errors during a backward scan is complicated. It's doable, and it's the right thing to do. > /[\x9c-\x{ffff}]/ > > What does this range defines exactly? Range expressions have implementation-defined semantics in POSIX. For PC= RE you=20 can do what you like. I suggest mapping encoding-error bytes into charac= ters=20 outside the Unicode range; that's what Emacs does, I think, and it's simp= le and=20 easy to explain to users. It's not a big deal either way. > What kind of invalid and valid UTF byte sequences are inside (and outsi= de) the bounds? Just treat encoding-error bytes like everything else. In effect, extend = the=20 encoding to allow any byte sequence, and add a few "characters" outside t= he=20 Unicode range, one for each invalid UTF-8 byte. > Caseless matching is also another question: does /\xe9/ matches to \xc3= \x89 or \xc9 invalid UTF byte sequence? Sorry, I don't quite follow, but encoding errors aren't letters and don't= have=20 case. They match only themselves. > What unicode properties does an invalid codepoint have? The minimal ones. > depending on their needs, everybody has different answers to these ques= tions. That's fine. Just implement reasonable defaults, and provide options if = people=20 have needs that differ from the defaults. That's easier for libpcre than= for=20 grep, since libpcre users (who are programmers) can reasonably be expecte= d to be=20 more sophisticated about this sort of thing than grep users (who are not=20 necessarily programmers). > Imagine if you would need to add buffer end and other bit checks. Of course it will be more expensive to check for UTF-8 as you go, than to= assume=20 the input is valid UTF-8. But again, we're not talking about the=20 PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talki= ng=20 about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether t= he=20 input is valid UTF-8, and currently does so inefficiently. In the=20 non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you= go,=20 than to have a prepass that checks for UTF-8. This is because the prepas= s must=20 be stupid (it must check the entire input buffer) whereas the matcher can= be=20 smart (it often can do its work without checking the entire input buffer)= . This=20 is one reason libpcre is slower than the glibc matchers. Obviously it would be some work to build a libpcre that runs faster in th= e=20 non-PCRE_NO_UTF8_CHECK case, without hurting performance in the=20 PCRE_NO_UTF8_CHECK case. But it could be done, if someone had the time t= o do it. > The question is, who would be willing to do this work. Not me. :-) >> That would chew up CPU resources unnecessarily > Yeah but you could add a flag to enable this :) I'm not sure it'd be popular to add a --drain-battery option to grep. :) >> The use case that prompted >> this bug report is someone using 'grep -r' to search for strings like >> 'foobar' in binary data, and this use case would not work with this >> suggested solution. > > In this case, I would simply disable UTF-8 decoding. I suggested that already, but the user (e.g., see the last paragraph of=20 ) says he wants to check for more-complicat= ed=20 UTF-8 patterns in binary data. For example, I expect the user wants the = pattern=20 'Lef.vre' to match the UTF-8 string 'Lef=C3=A8vre' in a binary file. So = he can't=20 simply use unibyte processing. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 26 Sep 2014 18:05:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141175467923785 (code B ref 18454); Fri, 26 Sep 2014 18:05:02 +0000 Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 18:04:39 +0000 Received: from localhost ([127.0.0.1]:53262 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXZsk-0006BY-9N for submit@debbugs.gnu.org; Fri, 26 Sep 2014 14:04:38 -0400 Received: from iwiw03d.mail.t-online.hu ([84.2.42.68]:15763) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXZsh-0006BO-AW for 18454@debbugs.gnu.org; Fri, 26 Sep 2014 14:04:36 -0400 Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74]) by iwiw03d.mail.t-online.hu (Postfix) with SMTP id 171D24EA6A8 for <18454@debbugs.gnu.org>; Fri, 26 Sep 2014 20:04:26 +0200 (CEST) Received: (qmail 27675 invoked by uid 151); 26 Sep 2014 20:04:33 +0200 Received: from 195.228.245.211 (HELO fmxmldata06.freemail.hu) (79.120.253.101) by fmx24.freemail.hu with SMTP; 26 Sep 2014 20:04:33 +0200 Received: from webmail by smtp gw id s8QI4X7j055427; Fri, 26 Sep 2014 20:04:33 +0200 (CEST) Date: Fri, 26 Sep 2014 20:04:33 +0200 (CEST) From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg In-Reply-To: <54252840.2020409@cs.ucla.edu> Message-ID: X-Originating-IP: [79.120.253.101] X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 X-Original-User: hzmester MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE X-Spam-Score: 0.6 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi,=0A=0Athis is a very interesting discussion.=0A=0A>> /(?<=3D\x9c)#/=0A>>= =0A>> Does it match \xd5\x9c# starting from #?=0A>=0A>No, because the input= does not contain a \x9c encoding error. Encoding errors =0A>match only th= emselves, not parts of other characters. That is how the glibc =0A>matcher= s behave, and it's what users expect.=0A=0AWhy \xc9 is part of another char= acter? It depends how you interpret \xd5. And this was just a simple exampl= e.=0A=0A>> Noticing errors during a backward scan is complicated.=0A>=0A>It= 's doable, and it's the right thing to do.=0A=0AThe problem is, you do it s= ome way, and others need something else. Just think about the example above= .=0A=0A>Range expressions have implementation-defined semantics in POSIX. = For PCRE you =0A>can do what you like. I suggest mapping encoding-error by= tes into characters =0A>outside the Unicode range; that's what Emacs does, = I think, and it's simple and =0A>easy to explain to users. It's not a big = deal either way.=0A=0AThis mapping idea is clever. Basically invalid codepo= ints are converted to something valid.=0A=0A>> What kind of invalid and val= id UTF byte sequences are inside (and outside) the bounds?=0A>=0A>Just trea= t encoding-error bytes like everything else. In effect, extend the =0A>enc= oding to allow any byte sequence, and add a few "characters" outside the = =0A>Unicode range, one for each invalid UTF-8 byte.=0A=0AIn other words, \x= c9 really is an encoding error (since it is an invalid UTF-8 byte, followin= g another invalid UTF-8 byte). This is what I said from the beginning, depe= nding on the context, people choose different interpretations of handling U= TF fragments. Usually they choose what is more convenient from that viewpoi= nt. But if you put all pieces together, the result is full of contradiction= s.=0A=0A>Sorry, I don't quite follow, but encoding errors aren't letters an= d don't have =0A>case. They match only themselves.=0A=0ANot necessarily. I= t depends on your mapping: if more than one invalid UTF fragment is mapped = to the same codepoint, they will match. Especially when you define range of= characters.=0A=0A> > What unicode properties does an invalid codepoint hav= e?=0A>=0A>The minimal ones.=0A=0AWe could use the same flags as for charact= ers between \x{d800}=E2=80=93\x{dfff}=0A=0A>> The question is, who would be= willing to do this work.=0A>=0A>Not me. :-)=0A=0AI know this would be a l= ot of work. And I have doubts that slowing down PCRE would increase grep pe= rformance. Regardless, if somebody is willing to work on this, I can help. = Please keep in mind that PCRE1 is considered done, and our efforts are limi= ted to bugfixing. We are currently busy with PCRE2, and such a big change c= ould only go there.=0A=0A>I'm not sure it'd be popular to add a --drain-bat= tery option to grep. :)=0A=0AI don't think on performance hungry desktop or= server environments this really matters. On phone, you likely don't need t= his feature.=0A=0A>I suggested that already, but the user (e.g., see the la= st paragraph of =0A>) says he wants to check = for more-complicated =0A>UTF-8 patterns in binary data. For example, I exp= ect the user wants the pattern =0A>'Lef.vre' to match the UTF-8 string 'Lef= =C3=A8vre' in a binary file. So he can't =0A>simply use unibyte processing= .=0A=0AThis is exactly the use case where filtering is needed. His input is= a combination of binary and UTF data, and he needs matches only in the UTF= part.=0A=0ARegards,=0AZoltan=0A From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 26 Sep 2014 19:21:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141175925931538 (code B ref 18454); Fri, 26 Sep 2014 19:21:01 +0000 Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 19:20:59 +0000 Received: from localhost ([127.0.0.1]:53299 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXb4c-0008Cb-UK for submit@debbugs.gnu.org; Fri, 26 Sep 2014 15:20:59 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:48734) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXb4a-0008CS-EP for 18454@debbugs.gnu.org; Fri, 26 Sep 2014 15:20:57 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 199E6A60001; Fri, 26 Sep 2014 12:20:55 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id SiFITjrpXlTK; Fri, 26 Sep 2014 12:20:46 -0700 (PDT) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 4CEB439E8011; Fri, 26 Sep 2014 12:20:46 -0700 (PDT) Message-ID: <5425BC8D.9040305@cs.ucla.edu> Date: Fri, 26 Sep 2014 12:20:45 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) On 09/26/2014 11:04 AM, Zoltán Herczeg wrote: > this is a very interesting discussion. Yes, I have a lot of other things I'm *supposed* to be doing, but this thread is more fun.... >>> /(?<=\x9c)#/ >>> >>> Does it match \xd5\x9c# starting from #? >> No, because the input does not contain a \x9c encoding error. Encoding errors >> match only themselves, not parts of other characters. That is how the glibc >> matchers behave, and it's what users expect. > Why \xc9 is part of another character? It depends how you interpret \xd5. Sorry, I assume you meant \x9c here? Anyway, the point is that conceptually you walk through the input byte sequence left-to-right, converting it to characters as you go, and if you encounter an encoding error in the process you convert the error to the corresponding "character" outside the Unicode range. You then do all matching against the converted sequence. So there is no question about interpretation: it's the left-to-right interpretation. This simple and easy-to-explain approach is used by grep's other matchers, by Emacs, etc. Obviously you don't want to *implement* it the way I described; instead, you want to convert on-the-fly, lazily. But whatever optimizations you do, you do consistently with the conceptual model. > The problem is, you do it some way, and others need something else. In practice, the simple approach explained above works well enough to satisfy the vast majority of users. It's conceivable some special cases in the PCRE world would have trouble fitting into this model, but to be honest I expect this won't be a problem, and that there won't be any serious conceptual issues here, though admittedly there will be some nontrivial programming effort. . > I have doubts that slowing down PCRE would increase grep performance. Again, the proposed change should not slow down libpcre. It should speed it up. That's the point. In the PCRE_NO_UTF8_CHECK case, libpcre could use exactly the same code it has now, so performance would be unaffected. And in the non-PCRE_NO_UTF8_CHECK case, libpcre should typically be faster than it is now, because it would avoid unnecessary UTF-8 validation for the parts of the input string that it does not examine. > This is exactly the use case where filtering is needed. His input is a > combination of binary and UTF data, and he needs matches only in the > UTF part. Regards, Zoltan Filtering would not be needed if libpcre were like grep's other matchers and simply worked with arbitrary binary data. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 27 Sep 2014 17:53:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141184036122739 (code B ref 18454); Sat, 27 Sep 2014 17:53:01 +0000 Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 17:52:41 +0000 Received: from localhost ([127.0.0.1]:53877 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXwAi-0005ug-3d for submit@debbugs.gnu.org; Sat, 27 Sep 2014 13:52:40 -0400 Received: from mail-lb0-f169.google.com ([209.85.217.169]:49791) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXwAf-0005uR-1w for 18454@debbugs.gnu.org; Sat, 27 Sep 2014 13:52:38 -0400 Received: by mail-lb0-f169.google.com with SMTP id u10so2603011lbd.14 for <18454@debbugs.gnu.org>; Sat, 27 Sep 2014 10:52:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=RGLCmnsLsHk8uv6nuL4NnEuS+GB66112nKROZVrIAWw=; b=AQ9mqXdsz6t909OH5GtrbuQnIxNPP+OXexqXT7ZQcS8B5wBcJDbyp0q/++m0Xlz4tb hVCFiZV55KFHP5efxCrQaOuGXruaQWQdSWcTgSwHSRNwJ7JH0CSnaAYgLGbUvfXmHriz aCoO96VkzqXfeH5DsPIDMnRf7ysKM6OKau8SBFMlwLW+UkO9w3GK8kyR7qp09XadLeYH /IzN86F0GYS2hsNtZ2g9fgrX8ypJFT2VjhGepJV49QJ02Bp+OId797Jq0iJ9/gnwlGfG X1LrjP9jR1Lu3NVnkgO+oq6fcAJYH4nYdQXuoXKyOFLjDaEt8V5Ckcyx/sve3JoNZlgJ VBnQ== X-Received: by 10.152.87.193 with SMTP id ba1mr4895393lab.83.1411840355593; Sat, 27 Sep 2014 10:52:35 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.23.89 with HTTP; Sat, 27 Sep 2014 10:52:15 -0700 (PDT) In-Reply-To: <5424B1E6.8090502@cs.ucla.edu> References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> <20140918083327.GA16324@nomada> <5424B1E6.8090502@cs.ucla.edu> From: Jim Meyering Date: Sat, 27 Sep 2014 10:52:15 -0700 X-Google-Sender-Auth: dU-j5j3A9z-jgSEzlbx10klNN5A Message-ID: Content-Type: multipart/mixed; boundary=001a11c345c2e00fbe05040fb2e8 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) --001a11c345c2e00fbe05040fb2e8 Content-Type: text/plain; charset=ISO-8859-1 On Thu, Sep 25, 2014 at 5:23 PM, Paul Eggert wrote: > Thanks for looking into that. The attached patches solve those performance > problems for me. I've pushed this follow-up patch to suppress a new warning: --001a11c345c2e00fbe05040fb2e8 Content-Type: application/octet-stream; name="0001-maint-suppress-a-false-positive-Wcast-align-warning.patch" Content-Disposition: attachment; filename="0001-maint-suppress-a-false-positive-Wcast-align-warning.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_i0l99gju2 RnJvbSAzZjk5ZmYwMWVhYmQwN2Q0NjBjZDBjMDg2MDJhZWIzNGUzOGUwZDJiIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog U2F0LCAyNyBTZXAgMjAxNCAwOTo0NDo0NyAtMDcwMApTdWJqZWN0OiBbUEFUQ0hdIG1haW50OiBz dXBwcmVzcyBhIGZhbHNlLXBvc2l0aXZlIC1XY2FzdC1hbGlnbiB3YXJuaW5nCgpCdWlsZGluZyB3 aXRoIC0tZW5hYmxlLWdjYy13YXJuaW5ncyBhbmQgZ2NjLTQuOS4xIHdvdWxkIHByb3Zva2UgdGhp czoKICBncmVwLmM6NDk5OjEyOiBlcnJvcjogY2FzdCBmcm9tICdjb25zdCBjaGFyIConIHRvICdj b25zdCB1d29yZCAqJ1wKICAgICAgKGFrYSAnY29uc3QgdW5zaWduZWQgbG9uZyAqJykgaW5jcmVh c2VzIHJlcXVpcmVkIGFsaWdubWVudCBmcm9tXAogICAgICAxIHRvIDggWy1XZXJyb3IsLVdjYXN0 LWFsaWduXQogICAgZm9yIChzID0gKHV3b3JkIGNvbnN0ICopIHA7ICEgKCpzICYgaGlieXRlX21h c2spOyBzKyspCgkgICAgIF5+fn5+fn5+fn5+fn5+fn5+Ciogc3JjL2dyZXAuYyAoc2tpcF9lYXN5 X2J5dGVzKTogVXNlIGEgcHJhZ21hIHRvIHN1cHByZXNzCmdjYydzIGZhbHNlLXBvc2l0aXZlIGNh c3QtYWxpZ25tZW50IHdhcm5pbmcuCi0tLQogc3JjL2dyZXAuYyB8IDcgKysrKysrKwogMSBmaWxl IGNoYW5nZWQsIDcgaW5zZXJ0aW9ucygrKQoKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMv Z3JlcC5jCmluZGV4IDA0NmYxN2YuLjIwN2JkZWEgMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysr IGIvc3JjL2dyZXAuYwpAQCAtNDk2LDggKzQ5NiwxNSBAQCBza2lwX2Vhc3lfYnl0ZXMgKGNoYXIg Y29uc3QgKmJ1ZikKICAgZm9yIChwID0gYnVmOyAodWludHB0cl90KSBwICUgc2l6ZW9mICh1d29y ZCkgIT0gMDsgcCsrKQogICAgIGlmICgqcCAmIEhJQllURSkKICAgICAgIHJldHVybiBwOworCisj cHJhZ21hIEdDQyBkaWFnbm9zdGljIHB1c2gKKyNwcmFnbWEgR0NDIGRpYWdub3N0aWMgaWdub3Jl ZCAiLVdjYXN0LWFsaWduIgorICAvKiBXZSBoYXZlIGFsaWduZWQgUCB0byBhIHV3b3JkIGJvdW5k YXJ5LCBzbyB3ZSBjYW4gc2FmZWx5CisgICAgIHRlbGwgZ2NjIHRvIHN1cHByZXNzIGl0cyBjYXN0 LWFsaWdubWVudCB3YXJuaW5nLiAgKi8KICAgZm9yIChzID0gKHV3b3JkIGNvbnN0ICopIHA7ICEg KCpzICYgaGlieXRlX21hc2spOyBzKyspCiAgICAgY29udGludWU7CisjcHJhZ21hIEdDQyBkaWFn bm9zdGljIHBvcAorCiAgIGZvciAocCA9IChjaGFyIGNvbnN0ICopIHM7ICEgKCpwICYgSElCWVRF KTsgcCsrKQogICAgIGNvbnRpbnVlOwogICByZXR1cm4gcDsKLS0gCjIuMS4wCgo= --001a11c345c2e00fbe05040fb2e8-- From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 27 Sep 2014 18:17:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141184181325081 (code B ref 18454); Sat, 27 Sep 2014 18:17:02 +0000 Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 18:16:53 +0000 Received: from localhost ([127.0.0.1]:53890 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXwY7-0006WR-W6 for submit@debbugs.gnu.org; Sat, 27 Sep 2014 14:16:52 -0400 Received: from iwiw01d.mail.t-online.hu ([84.2.42.53]:36244 helo=fmxout01.freemail.hu) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXwY3-0006WG-N4 for 18454@debbugs.gnu.org; Sat, 27 Sep 2014 14:16:49 -0400 Received: from fmx25.freemail.hu (fmx25.freemail.hu [195.228.245.75]) by fmxout01.freemail.hu (Postfix) with SMTP id 420AE361C for <18454@debbugs.gnu.org>; Sat, 27 Sep 2014 20:16:45 +0200 (CEST) Received: (qmail 38138 invoked by uid 151); 27 Sep 2014 20:16:45 +0200 Received: from 195.228.245.211 (HELO fmxmldata01.freemail.hu) (91.82.212.146) by fmx25.freemail.hu with SMTP; 27 Sep 2014 20:16:45 +0200 Received: from webmail by smtp gw id s8RIGjrK067745; Sat, 27 Sep 2014 20:16:45 +0200 (CEST) Date: Sat, 27 Sep 2014 20:16:45 +0200 (CEST) From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg In-Reply-To: <5425BC8D.9040305@cs.ucla.edu> Message-ID: X-Originating-IP: [91.82.212.146] X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 X-Original-User: hzmester MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=UTF-8 X-Spam-Score: 0.6 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi, >Sorry, I assume you meant \x9c here? Anyway, the point is that >conceptually you walk through the input byte sequence left-to-right, >converting it to characters as you go, and if you encounter an encoding >error in the process you convert the error to the corresponding >"character" outside the Unicode range. You then do all matching against >the converted sequence. So there is no question about interpretation: >it's the left-to-right interpretation. This simple and easy-to-explain >approach is used by grep's other matchers, by Emacs, etc. This was one of my proposal, we need a converter before we run PCRE. To be more precise, we likely need several converters, and users can select the appropriate for their use case. >Obviously you don't want to *implement* it the way I described; instead, >you want to convert on-the-fly, lazily. But whatever optimizations you >do, you do consistently with the conceptual model. I would implement exactly as you described. PCRE is a complex backtracking engine, we need to decode the input forward and backward directions from any starting position, and several characters parsed multiple times depending on the pattern and input. We also employ many optimizations to make this as fast as possible, especially in JIT. The overhead of decoding invalid characters accumulates for every character regardless they are valid or not. >In practice, the simple approach explained above works well enough to >satisfy the vast majority of users. It's conceivable some special cases >in the PCRE world would have trouble fitting into this model, but to be >honest I expect this won't be a problem, and that there won't be any >serious conceptual issues here, though admittedly there will be some >nontrivial programming effort. The approach might sound simple, but its side effects are non-trivial. For example, if we would implement the way suggested before, the guy, who you quoted, would not be satisfied. He said 'I still want "." to match a single (valid) UTF-8 character.' The dot character matches anything but newline. According to you, the invalid code points should have a "minimal" type, so they would match. In the regex world, matching performance is the key aspect of an engine, and this is our focus in PCRE. But every advantage has a trade-of. In PCRE, this is source code complexity. A "simple" change like this would require a major redesign of the engine. >Again, the proposed change should not slow down libpcre. It should >speed it up. That's the point. In the PCRE_NO_UTF8_CHECK case, libpcre >could use exactly the same code it has now, so performance would be >unaffected. And in the non-PCRE_NO_UTF8_CHECK case, libpcre should >typically be faster than it is now, because it would avoid unnecessary >UTF-8 validation for the parts of the input string that it does not examine. Partial matching was invented exactly for this purpose. You can divide the input into small chunks, filter them, and perform matches. Btw, how partial matching is affected by your proposed solution? What should happen, if the starting offset is inside an otherwise valid UTF character? >Filtering would not be needed if libpcre were like grep's other matchers >and simply worked with arbitrary binary data. This might be efficient for engines which scans the input only forward direction and read every character once. This is not true for PCRE. Regards, Zoltan From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 27 Sep 2014 20:55:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141185128712826 (code B ref 18454); Sat, 27 Sep 2014 20:55:02 +0000 Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 20:54:47 +0000 Received: from localhost ([127.0.0.1]:53965 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXz0w-0003Kn-Hc for submit@debbugs.gnu.org; Sat, 27 Sep 2014 16:54:47 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:35712) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XXz0p-0003KW-MY for 18454@debbugs.gnu.org; Sat, 27 Sep 2014 16:54:41 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 7355639E8014; Sat, 27 Sep 2014 13:54:38 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id XCXEgSYvvXcj; Sat, 27 Sep 2014 13:54:29 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id A3C6739E8011; Sat, 27 Sep 2014 13:54:29 -0700 (PDT) Message-ID: <54272400.1020704@cs.ucla.edu> Date: Sat, 27 Sep 2014 13:54:24 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -3.2 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.2 (---) Zoltán Herczeg wrote: > He said 'I still want "." to match a single (valid) UTF-8 character.' That's what the GNU matchers do, yes. '.' does not match an invalid byte. It's a reasonable default. If you have some users who want '.' to match an invalid byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for users who want '.' to match newline. That being said, I doubt whether users will care enough to need such a flag. (After all, they're evidently not caring *now*, as libpcre can't search such data at *all*.) > In the regex world, matching performance is the key aspect of an engine Absolutely. That's why we're having this discussion: libpcre is slow when matching binary data. > A "simple" change like this would require a major redesign of the engine. It'd be nontrivial, yes. But it's clearly doable. (Not that I'm volunteering....) > What should happen, if the starting offset is inside an otherwise valid UTF character? The same thing that would happen if an input file started with the tail end of a UTF-8 sequence. The leading bytes are invalid. 'grep' deals with this already; it's not a problem. >> Filtering would not be needed if libpcre were like grep's other matchers >> and simply worked with arbitrary binary data. > > This might be efficient for engines which scans the input only forward direction > and read every character once. It can also be efficient for matchers, like grep's, that don't necessarily do that. It just takes more implementation work, that's all. It's not rocket science to go backwards through a UTF-8 string and to catch decoding errors as you go. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 27 Sep 2014 22:37:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Jim Meyering Cc: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141185740822141 (code B ref 18454); Sat, 27 Sep 2014 22:37:01 +0000 Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 22:36:48 +0000 Received: from localhost ([127.0.0.1]:53995 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XY0bf-0005l2-5U for submit@debbugs.gnu.org; Sat, 27 Sep 2014 18:36:47 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:38366) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XY0bc-0005ks-5C for 18454@debbugs.gnu.org; Sat, 27 Sep 2014 18:36:45 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 2C70A39E8015; Sat, 27 Sep 2014 15:36:43 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id AXgGeRZCknfA; Sat, 27 Sep 2014 15:36:38 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 13C6039E8011; Sat, 27 Sep 2014 15:36:38 -0700 (PDT) Message-ID: <54273BF5.2060605@cs.ucla.edu> Date: Sat, 27 Sep 2014 15:36:37 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> <20140918083327.GA16324@nomada> <5424B1E6.8090502@cs.ucla.edu> In-Reply-To: Content-Type: multipart/mixed; boundary="------------070206070402090608040206" X-Spam-Score: -3.2 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.2 (---) This is a multi-part message in MIME format. --------------070206070402090608040206 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Jim Meyering wrote: > I've pushed this follow-up patch to suppress a new warning: Thanks, I expect I didn't get that warning because I built on x86-64, which allows unaligned accesses so GCC doesn't complain. (Incidentally, I had already tried modifying the code to exploit the fact that unaligned accesses are OK on x86ish platforms, but that made the word-by-word loop go slower, so no dice.) Too bad GCC isn't smart enough to notice that the pointer must be aligned. It strikes me that this problem must come up elsewhere, and that it's worth writing a macro to encapsulate the situation. I pushed the attached follow-up patch, which is an attempt to move in that direction. --------------070206070402090608040206 Content-Type: text/plain; charset=UTF-8; name="0001-maint-generalize-the-Wcast-align-fix.patch" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="0001-maint-generalize-the-Wcast-align-fix.patch" RnJvbSA2MTMzYjhlMDBhODg3NmFhYTY5Y2U1MWQyZmUyNWExNzA0MGFmZTFhIE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1 PgpEYXRlOiBTYXQsIDI3IFNlcCAyMDE0IDE1OjMxOjEyIC0wNzAwClN1YmplY3Q6IFtQQVRD SF0gbWFpbnQ6IGdlbmVyYWxpemUgdGhlIC1XY2FzdC1hbGlnbiBmaXgKCiogc3JjL2dyZXAu YyAoQ0FTVF9BTElHTkVEKTogTmV3IG1hY3JvLgooc2tpcF9lYXN5X2J5dGVzKTogVXNlIGl0 LgotLS0KIHNyYy9ncmVwLmMgfCAyNCArKysrKysrKysrKysrKysrLS0tLS0tLS0KIDEgZmls ZSBjaGFuZ2VkLCAxNiBpbnNlcnRpb25zKCspLCA4IGRlbGV0aW9ucygtKQoKZGlmZiAtLWdp dCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IDIwN2JkZWEuLmJiNWJhMWMgMTAw NjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAuYwpAQCAtNDY5LDYgKzQ2OSwy MSBAQCBpbml0X2Vhc3lfZW5jb2RpbmcgKHZvaWQpCiAgICAgZWFzeV9lbmNvZGluZyAmPSBt YmNsZW5fY2FjaGVbaV0gPT0gMTsKIH0KIAorLyogQSBjYXN0IHRvIFRZUEUgb2YgVkFMLiAg VXNlIHRoaXMgd2hlbiBUWVBFIGlzIGEgcG9pbnRlciB0eXBlLCBWQUwKKyAgIGlzIHByb3Bl cmx5IGFsaWduZWQgZm9yIFRZUEUsIGFuZCAnZ2NjIC1XY2FzdC1hbGlnbicgY2Fubm90IGlu ZmVyCisgICB0aGUgYWxpZ25tZW50IGFuZCB3b3VsZCBvdGhlcndpc2UgY29tcGxhaW4gYWJv dXQgdGhlIGNhc3QuICAqLworI2lmIDQgPCBfX0dOVUNfXyArICg2IDw9IF9fR05VQ19NSU5P Ul9fKQorIyBkZWZpbmUgQ0FTVF9BTElHTkVEKHR5cGUsIHZhbCkgICAgICAgICAgICAgICAg ICAgICAgICAgICBcCisgICAgKHsgX190eXBlb2ZfXyAodmFsKSB2YWxfID0gdmFsOyAgICAg ICAgICAgICAgICAgICAgICAgIFwKKyAgICAgICBfUHJhZ21hICgiR0NDIGRpYWdub3N0aWMg cHVzaCIpICAgICAgICAgICAgICAgICAgICAgXAorICAgICAgIF9QcmFnbWEgKCJHQ0MgZGlh Z25vc3RpYyBpZ25vcmVkIFwiLVdjYXN0LWFsaWduXCIiKSBcCisgICAgICAgKHR5cGUpIHZh bF87ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIFwKKyAgICAgICBf UHJhZ21hICgiR0NDIGRpYWdub3N0aWMgcG9wIikgICAgICAgICAgICAgICAgICAgICAgXAor ICAgIH0pCisjZWxzZQorIyBkZWZpbmUgQ0FTVF9BTElHTkVEKHR5cGUsIHZhbCkgKCh0eXBl KSAodmFsKSkKKyNlbmRpZgorCiAvKiBBbiB1bnNpZ25lZCB0eXBlIHN1aXRhYmxlIGZvciBm YXN0IG1hdGNoaW5nLiAgKi8KIHR5cGVkZWYgdWludG1heF90IHV3b3JkOwogCkBAIC00OTYs MTUgKzUxMSw4IEBAIHNraXBfZWFzeV9ieXRlcyAoY2hhciBjb25zdCAqYnVmKQogICBmb3Ig KHAgPSBidWY7ICh1aW50cHRyX3QpIHAgJSBzaXplb2YgKHV3b3JkKSAhPSAwOyBwKyspCiAg ICAgaWYgKCpwICYgSElCWVRFKQogICAgICAgcmV0dXJuIHA7Ci0KLSNwcmFnbWEgR0NDIGRp YWdub3N0aWMgcHVzaAotI3ByYWdtYSBHQ0MgZGlhZ25vc3RpYyBpZ25vcmVkICItV2Nhc3Qt YWxpZ24iCi0gIC8qIFdlIGhhdmUgYWxpZ25lZCBQIHRvIGEgdXdvcmQgYm91bmRhcnksIHNv IHdlIGNhbiBzYWZlbHkKLSAgICAgdGVsbCBnY2MgdG8gc3VwcHJlc3MgaXRzIGNhc3QtYWxp Z25tZW50IHdhcm5pbmcuICAqLwotICBmb3IgKHMgPSAodXdvcmQgY29uc3QgKikgcDsgISAo KnMgJiBoaWJ5dGVfbWFzayk7IHMrKykKKyAgZm9yIChzID0gQ0FTVF9BTElHTkVEICh1d29y ZCBjb25zdCAqLCBwKTsgISAoKnMgJiBoaWJ5dGVfbWFzayk7IHMrKykKICAgICBjb250aW51 ZTsKLSNwcmFnbWEgR0NDIGRpYWdub3N0aWMgcG9wCi0KICAgZm9yIChwID0gKGNoYXIgY29u c3QgKikgczsgISAoKnAgJiBISUJZVEUpOyBwKyspCiAgICAgY29udGludWU7CiAgIHJldHVy biBwOwotLSAKMS45LjMKCg== --------------070206070402090608040206-- From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 28 Sep 2014 10:12:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by submit@debbugs.gnu.org id=B.141189911921344 (code B ref -1); Sun, 28 Sep 2014 10:12:01 +0000 Received: (at submit) by debbugs.gnu.org; 28 Sep 2014 10:11:59 +0000 Received: from localhost ([127.0.0.1]:54105 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XYBSQ-0005YB-NW for submit@debbugs.gnu.org; Sun, 28 Sep 2014 06:11:59 -0400 Received: from eggs.gnu.org ([208.118.235.92]:39126) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XYBSN-0005Y2-Vi for submit@debbugs.gnu.org; Sun, 28 Sep 2014 06:11:56 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XYBSD-000432-C3 for submit@debbugs.gnu.org; Sun, 28 Sep 2014 06:11:55 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: ** X-Spam-Status: No, score=2.4 required=5.0 tests=BAYES_50,FREEMAIL_FROM, MALFORMED_FREEMAIL,UNPARSEABLE_RELAY autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:55316) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XYBSD-00042r-8k for submit@debbugs.gnu.org; Sun, 28 Sep 2014 06:11:45 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:40841) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XYBRz-0004Xv-T8 for bug-grep@gnu.org; Sun, 28 Sep 2014 06:11:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XYBRr-00041T-JE for bug-grep@gnu.org; Sun, 28 Sep 2014 06:11:31 -0400 Received: from iwiw02d.mail.t-online.hu ([84.2.42.67]:46040) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XYBRr-000417-9u for bug-grep@gnu.org; Sun, 28 Sep 2014 06:11:23 -0400 Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74]) by iwiw02d.mail.t-online.hu (Postfix) with SMTP id 86BB94875F2 for ; Sun, 28 Sep 2014 12:11:26 +0200 (CEST) Received: (qmail 78673 invoked by uid 151); 28 Sep 2014 12:11:16 +0200 Received: from 195.228.245.211 (HELO fmxmldata04.freemail.hu) (193.226.212.27) by fmx24.freemail.hu with SMTP; 28 Sep 2014 12:11:16 +0200 Received: from webmail by smtp gw id s8SABGei083174; Sun, 28 Sep 2014 12:11:16 +0200 (CEST) Date: Sun, 28 Sep 2014 12:11:16 +0200 (CEST) From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg In-Reply-To: <54272400.1020704@cs.ucla.edu> Message-ID: X-Originating-IP: [193.226.212.27] X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 X-Original-User: hzmester MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=UTF-8 X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.4 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) >> In the regex world, matching performance is the key aspect of an engine > >Absolutely. That's why we're having this discussion: libpcre is slow when >matching binary data. For me the question is whether binary search needs to supported on PCRE level. There are other questions like this. People ask string replacement support in PCRE from time to time. It can be implemented of course, we could add a whole string handling API to PCRE. But we feel this is outside the scope of PCRE. Any string management library can use PCRE for string replacement, we don't need another one. Binary matching is similar thing. PCRE is used by some (I think closed source) projects for network data filtering, which is obviously binary data. They use some kind of pre-filtering, data arranging and partial matching to efficiently check TCP stream data (without waiting for the whole stream to arrive). >> A "simple" change like this would require a major redesign of the engine. > >It'd be nontrivial, yes. But it's clearly doable. (Not that I'm volunteering....) Anything can be done, which as an algorithmic solution, this was never a question. The question is whether it is worth to do it on PCRE or higher level. Perl/PCRE is all about text processing, characters which has meanings, types, sub-types, other case(s). Unicode is also about defining characters, not binary data. UTF is an encoding format, not mapping random bytes to characters. If this task would be trivial, I wouldn't mind doing it myself. But it seems this task is about destroying what we built so far. A lot of extra checks to process invalid bytes, a large code size increase, and removing a lot of optimizations. The result might be much slower than using clever pre-filtering. >> What should happen, if the starting offset is inside an otherwise valid UTF character? > >The same thing that would happen if an input file started with the tail end of a >UTF-8 sequence. The leading bytes are invalid. 'grep' deals with this already; >it's not a problem. The question was about intermediate start offsets. You have a 100 byte long buffer, and you start matching from byte 50. That is part of a valid UTF byte. Your pattern starts with an invalid character, which matches to that UTF fragment. You said invalid UTF character matches only themselves, not part of other characters. A lot of extra check again, preparing for the worst case. >> This might be efficient for engines which scans the input only forward direction > > and read every character once. > >It can also be efficient for matchers, like grep's, that don't necessarily do >that. It just takes more implementation work, that's all. It's not rocket >science to go backwards through a UTF-8 string and to catch decoding errors as >you go. My problem is the lot of "ifs", you need to execute. Lets compare the current and the proposed solution. char* c_ptr /* Current string position. */ Current: if (c_ptr == input_start) return FAIL; c_ptr--; while (*cptr & 0xc0 == 0x80) cptr--; Proposed solution: if (c_ptr == input_start) return FAIL; c_ptr--; char* saved_c_ptr = c_ptr; /* We need to save the starting position, loosing a CPU register for that. */ while (*cptr & 0xc0 == 0x80) { if (c_ptr == input_start) return FAIL; cptr--; } /* We moved back a lot, we don't know where are we. Check character length. */ int length = utf_length[*cptr]; /* Another lost register. Compiler life is difficult. */ if (cptr + length != saved_c_ptr + 1) c_ptr = saved_c_ptr; else { /* We need to check whether the character is encoded in the minimum number of bytes. */ if (length == 1) { /* Great, nothing to do. */ } else if (length == 2) { if (*c_ptr < 0xc2) /* Character is <= 127, can be encoded in a single byte. */ c_ptr = saved_c_ptr; } else if (length == 3) { if (*c_ptr == 0xe0 && cptr[1] < 0xa0) /* Character is <= 0x800, can be encoded in less bytes. */ c_ptr = saved_c_ptr; } else .... } For me this is way too much checks, and affects compiler optimizations too much. Regards, Zoltan From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 28 Sep 2014 15:10:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141191698622227 (code B ref 18454); Sun, 28 Sep 2014 15:10:03 +0000 Received: (at 18454) by debbugs.gnu.org; 28 Sep 2014 15:09:46 +0000 Received: from localhost ([127.0.0.1]:54465 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XYG6c-0005mQ-8K for submit@debbugs.gnu.org; Sun, 28 Sep 2014 11:09:46 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:33496) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XYG6a-0005mG-HS for 18454@debbugs.gnu.org; Sun, 28 Sep 2014 11:09:45 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id EC78E39E8014; Sun, 28 Sep 2014 08:09:42 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oB9og1yzqvxt; Sun, 28 Sep 2014 08:09:33 -0700 (PDT) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id CFFF439E8012; Sun, 28 Sep 2014 08:09:33 -0700 (PDT) Message-ID: <542824AD.8090501@cs.ucla.edu> Date: Sun, 28 Sep 2014 08:09:33 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 References: In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -2.9 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.9 (--) Zoltán Herczeg wrote: > For me the question is whether binary search needs to supported on PCRE level. It's purely a performance question. GNU grep already uses libpcre to search binary data, and it works now. It's just slow, that all. I'm willing to live with this, and tell users "Sorry, but libpcre is not designed to search binary data quickly; if you want speed then don't use grep's -P option." If you're willing to live with this too, we're done. > removing a lot of optimizations. You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case. Keep them all. It should be just as fast before. The idea is to have one matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it goes). The former matcher will be just as fast as now, and the latter matcher will be faster than what libpcre has now. I readily concede that this will require some nontrivial coding, but I don't concede that it will remove optimizations or make libpcre slower. It should make libpcre faster; that's the point. > You have a 100 byte long buffer, and you start matching from byte 50. Grep already does that sort of thing. And it's smart enough to start matching only at character boundaries. It's not libpcre's job to worry about this; the caller can worry about it. > For me this is way too much checks, and affects compiler optimizations too much. The code you posted could be made faster than that; among other things there should not be an unbounded backward scan. And even the code you posted would often be faster than what's in libpcre now. That early UTF-8 validity prepass is a killer. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Jim Meyering Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 28 Sep 2014 22:08:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: Santiago Ruano =?UTF-8?Q?Rinc=C3=B3n?= , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141194202129746 (code B ref 18454); Sun, 28 Sep 2014 22:08:01 +0000 Received: (at 18454) by debbugs.gnu.org; 28 Sep 2014 22:07:01 +0000 Received: from localhost ([127.0.0.1]:54612 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XYMcP-0007jd-4v for submit@debbugs.gnu.org; Sun, 28 Sep 2014 18:07:01 -0400 Received: from mail-la0-f49.google.com ([209.85.215.49]:45807) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XYMcM-0007jT-6H for 18454@debbugs.gnu.org; Sun, 28 Sep 2014 18:06:59 -0400 Received: by mail-la0-f49.google.com with SMTP id ge10so2414070lab.36 for <18454@debbugs.gnu.org>; Sun, 28 Sep 2014 15:06:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc:content-type; bh=8nZi/ItILd6TGncGD8eRXVZqVxmAHSboHPWuZ67/mQc=; b=s1NHATUPfEovvYZL7dngSYHthmLO+YZYUoWlivcX/Ll1QJOz/CN9EjSmY9RlyPcdu4 R9GEayfVKgnX6R9NuToGvgLc5KKNtMjk1FVC9qyptlS2XwBJlEEHzrHdqTFdBnDM4m4l FpGFcNtPuDm5PBqpuKQZFT6TgA0wD1e4qt5/4BFHJqlg7j5grhgOX8u2QEIAjJbVssNN F9tZt56SMLEP83JcT6853ipkwtXdYy5BsIHclKuju5gSi6k7nL1BEQMoSjGLHGKnhxWQ xAAU1psOIZ91HfMcl9AS81sWBlTmNBge+sCgnoeu5E6/UYGe154KW81CvY3t9IIcskkz cADQ== X-Received: by 10.152.27.66 with SMTP id r2mr5239439lag.84.1411942016891; Sun, 28 Sep 2014 15:06:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.25.23.89 with HTTP; Sun, 28 Sep 2014 15:06:36 -0700 (PDT) In-Reply-To: <54273BF5.2060605@cs.ucla.edu> References: <20140912012449.GB18162@xvii.vinc17.org> <541A750E.2050606@cs.ucla.edu> <20140918083327.GA16324@nomada> <5424B1E6.8090502@cs.ucla.edu> <54273BF5.2060605@cs.ucla.edu> From: Jim Meyering Date: Sun, 28 Sep 2014 15:06:36 -0700 X-Google-Sender-Auth: --ULz_nY4WxVw-w0nvEVbPZk7v8 Message-ID: Content-Type: text/plain; charset=ISO-8859-1 X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Nice! I didn't know about _Pragma. It's much better to encapsulate that, keeping the #pragma directives out of function bodies. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales References: <20140912012449.GB18162@xvii.vinc17.org> Resent-From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Tue, 30 Sep 2014 18:12:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: 18454@debbugs.gnu.org X-Debbugs-Original-Cc: bug-grep@gnu.org, 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141210066411631 (code B ref 18454); Tue, 30 Sep 2014 18:12:02 +0000 Received: (at 18454) by debbugs.gnu.org; 30 Sep 2014 18:11:04 +0000 Received: from localhost ([127.0.0.1]:56606 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XZ1t9-00031W-Or for submit@debbugs.gnu.org; Tue, 30 Sep 2014 14:11:04 -0400 Received: from iwiw03d.mail.t-online.hu ([84.2.42.68]:38068) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XZ1t6-00030y-6A for 18454@debbugs.gnu.org; Tue, 30 Sep 2014 14:11:01 -0400 Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74]) by iwiw03d.mail.t-online.hu (Postfix) with SMTP id EDE184E824E for <18454@debbugs.gnu.org>; Tue, 30 Sep 2014 20:10:48 +0200 (CEST) Received: (qmail 32285 invoked by uid 151); 30 Sep 2014 20:10:58 +0200 Received: from 195.228.245.211 (HELO fmxmldata07.freemail.hu) (91.83.55.54) by fmx24.freemail.hu with SMTP; 30 Sep 2014 20:10:58 +0200 Received: from webmail by smtp gw id s8UIAwBf051116; Tue, 30 Sep 2014 20:10:58 +0200 (CEST) Date: Tue, 30 Sep 2014 20:10:58 +0200 (CEST) From: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg In-Reply-To: <542824AD.8090501@cs.ucla.edu> Message-ID: X-Originating-IP: [91.83.55.54] X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0 X-Original-User: hzmester MIME-Version: 1.0 Content-Type: TEXT/plain; CHARSET=UTF-8 X-Spam-Score: 1.9 (+) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has identified this incoming email as possible spam. The original message has been attached to this so you can view it (if it isn't spam) or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi, >It's purely a performance question. GNU grep already uses libpcre to search >binary data, and it works now. It's just slow, that all. I'm willing to live >with this, and tell users "Sorry, but libpcre is not designed to search binary >data quickly; if you want speed then don't use grep's -P option." If you're >willing to live with this too, we're done. [...] Content analysis details: (1.9 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (hzmester[at]freemail.hu) -0.0 RCVD_IN_DNSWL_NONE RBL: Sender listed at http://www.dnswl.org/, no trust [84.2.42.68 listed in list.dnswl.org] 1.9 MALFORMED_FREEMAIL Bad headers on message from free email service 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay lines X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi, >It's purely a performance question. GNU grep already uses libpcre to search >binary data, and it works now. It's just slow, that all. I'm willing to live >with this, and tell users "Sorry, but libpcre is not designed to search binary >data quickly; if you want speed then don't use grep's -P option." If you're >willing to live with this too, we're done. Yes, PCRE is not designed for matching binary data as UTF. Too much complexity for too little gain. Normal search can be used on binary data without limitations. >Grep already does that sort of thing. And it's smart enough to start matching >only at character boundaries. It's not libpcre's job to worry about this; the >caller can worry about it. Thank you for bringing this up. I don't see any point of reimplementing what is already there. However, if PCRE says it supports UTF matching in binary data, it should. Because the "what is there" depends on the environment. This clearly the best answer why the environment is responsible for handling the binary part of the data. Most environment needs some kind of validating, and we would just duplicate code. It is good to hear that everything is in grep, perhaps a few more lines are needed to do it in a thread. >The code you posted could be made faster than that; among other things there >should not be an unbounded backward scan. And even the code you posted would >often be faster than what's in libpcre now. That early UTF-8 validity prepass >is a killer. I would recommend to disable it. It's only purpose is returning early for invalid buffers. I am sure grep already knows that a buffer is invalid, since it scans the buffer. Regards, Zoltan From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Tue, 30 Sep 2014 19:40:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: =?UTF-8?Q?Zolt=C3=A1n?= Herczeg Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141210596520144 (code B ref 18454); Tue, 30 Sep 2014 19:40:01 +0000 Received: (at 18454) by debbugs.gnu.org; 30 Sep 2014 19:39:25 +0000 Received: from localhost ([127.0.0.1]:56681 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XZ3Ge-0005Eq-J7 for submit@debbugs.gnu.org; Tue, 30 Sep 2014 15:39:24 -0400 Received: from smtp.cs.ucla.edu ([131.179.128.62]:60820) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XZ3Gc-0005Ei-Lw for 18454@debbugs.gnu.org; Tue, 30 Sep 2014 15:39:23 -0400 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 5604B39E801B; Tue, 30 Sep 2014 12:39:21 -0700 (PDT) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id c1Gh7cps-811; Tue, 30 Sep 2014 12:39:18 -0700 (PDT) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 2DE1439E8018; Tue, 30 Sep 2014 12:39:18 -0700 (PDT) Message-ID: <542B06E5.8040501@cs.ucla.edu> Date: Tue, 30 Sep 2014 12:39:17 -0700 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Score: -3.0 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.0 (---) On 09/30/2014 11:10 AM, Zoltán Herczeg wrote: > >> Grep already does that sort of thing. And it's smart enough to start matching >> only at character boundaries. It's not libpcre's job to worry about this; the >> caller can worry about it. > Thank you for bringing this up. I don't see any point of reimplementing what is already there. Sorry, it sounds like my earlier comment was unclear. GNU grep is smart enough to start matching at character boundaries without checking the validity of the input data. This helps it run faster. However, because libpcre requires a validity prepass, grep -P must slow down and do the validity check one way or another. Grep does this only when libpcre is used, and that's one reason grep -P is slower than plain grep. It's not a question of duplicating code: grep already has code to validate binary data. It's a question of performance. Requiring a prepass for validity checking is typically slower (or takes more energy, or whatever) than checking validity on the fly. And in many cases going multithreaded would just make matters worse. I can understand that you don't want to take on the burden of making a nontrivial libpcre performance improvement. Also, I hope 'grep -P' performance, though not great, is good enough now to satisfy most users. So perhaps we should just give the topic a rest. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 28 Nov 2014 03:00:03 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141714356424875 (code B ref 18454); Fri, 28 Nov 2014 03:00:03 +0000 Received: (at 18454) by debbugs.gnu.org; 28 Nov 2014 02:59:24 +0000 Received: from localhost ([127.0.0.1]:48012 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuBmF-0006T9-RK for submit@debbugs.gnu.org; Thu, 27 Nov 2014 21:59:24 -0500 Received: from ioooi.vinc17.net ([92.243.22.117]:60142) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuBmC-0006Sz-Fk for 18454@debbugs.gnu.org; Thu, 27 Nov 2014 21:59:21 -0500 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id 5C6191E1; Fri, 28 Nov 2014 03:59:19 +0100 (CET) Received: by xvii.vinc17.org (Postfix, from userid 1000) id 6D1B821A07A; Fri, 28 Nov 2014 03:59:18 +0100 (CET) Date: Fri, 28 Nov 2014 03:59:18 +0100 From: Vincent Lefevre Message-ID: <20141128025918.GA26989@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="zYM0uCDKw75PZbzx" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140912012449.GB18162@xvii.vinc17.org> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6365-vl-r59709 (2014-09-07) X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --zYM0uCDKw75PZbzx Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit On binary files, it seems that testing the UTF-8 sequences in pcresearch.c is faster than asking pcre_exec to do that (because of the retry I assume); see attached patch. It actually checks UTF-8 only if an invalid sequence was already found by pcre_exec, assuming that pcre_exec can check the validity of a valid text file in a faster way. On some file similar to PDF (test 1): Before: 1.77s After: 1.38s But now, the main problem is the many pcre_exec. Indeed, if I replace the non-ASCII bytes by \n with: LC_ALL=C tr \\200-\\377 \\n (now, one has a valid file but with many short lines), the grep -P time is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes with: LC_ALL=C tr \\200-\\377 \\000 the grep -P time is 0.30s (test 3), thus it is much faster. Note also that libpcre is much slower than normal grep on simple words, but on "a[0-9]b", it can be faster: grep PCRE PCRE+patch test 1 4.31 1.90 1.53 test 2 0.18 1.61 1.63 test 3 3.28 0.39 0.39 With grep, I wonder why test 2 is much faster. -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) --zYM0uCDKw75PZbzx Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename="grep221-pcresearch.patch" diff --git a/src/pcresearch.c b/src/pcresearch.c index 5451029..6bff1e4 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -38,6 +38,8 @@ static pcre_extra *extra; # endif #endif +#define INVALID(C) (to_uchar (C) < 0x80 || to_uchar (C) > 0xbf) + /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty string matches when that flag is used. */ static int empty_match[2]; @@ -156,6 +158,7 @@ Pexecute (char const *buf, size_t size, size_t *match_size, char const *line_start = buf; int e = PCRE_ERROR_NOMATCH; char const *line_end; + int invalid = 0; /* If the input type is unknown, the caller is still testing the input, which means the current buffer cannot contain encoding @@ -212,25 +215,54 @@ Pexecute (char const *buf, size_t size, size_t *match_size, if (multiline) options |= PCRE_NO_UTF8_CHECK; - e = pcre_exec (cre, extra, p, search_bytes, 0, - options, sub, NSUB); - if (e != PCRE_ERROR_BADUTF8) + int valid_bytes = search_bytes; + if (invalid) { - if (0 < e && multiline && sub[1] - sub[0] != 0) + /* At least an encoding error was found. Other such errors + are likely to occur, and detecting them here is faster + in average than relying on pcre. */ + options |= PCRE_NO_UTF8_CHECK; + char const *p2 = p; + while (p2 != line_end) { - char const *nl = memchr (p + sub[0], eolbyte, - sub[1] - sub[0]); - if (nl) + unsigned char c = p2[0]; + size_t len = + c < 0x80 ? 1 : + c < 0xc2 || c > 0xf7 || INVALID(p2[1]) ? 0 : + c < 0xe0 ? 2 : INVALID(p2[2]) ? 0 : + c < 0xf0 ? 3 : INVALID(p2[3]) ? 0 : 4; + if (len == 0) { - /* This match crosses a line boundary; reject it. */ - p += sub[0]; - line_end = nl; - continue; + valid_bytes = p2 - p; + break; } + p2 += len; } - break; } - int valid_bytes = sub[0]; + + if (valid_bytes == search_bytes) + { + e = pcre_exec (cre, extra, p, search_bytes, 0, + options, sub, NSUB); + if (e != PCRE_ERROR_BADUTF8) + { + if (0 < e && multiline && sub[1] - sub[0] != 0) + { + char const *nl = memchr (p + sub[0], eolbyte, + sub[1] - sub[0]); + if (nl) + { + /* This match crosses a line boundary; reject it. */ + p += sub[0]; + line_end = nl; + continue; + } + } + break; + } + invalid = 1; + valid_bytes = sub[0]; + } /* Try to match the string before the encoding error. Again, handle the empty-match case specially, for speed. */ --zYM0uCDKw75PZbzx-- From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 28 Nov 2014 14:33:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Vincent Lefevre Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141718512514508 (code B ref 18454); Fri, 28 Nov 2014 14:33:02 +0000 Received: (at 18454) by debbugs.gnu.org; 28 Nov 2014 14:32:05 +0000 Received: from localhost ([127.0.0.1]:48164 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuMaZ-0003lv-S0 for submit@debbugs.gnu.org; Fri, 28 Nov 2014 09:32:04 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:49356) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuMaN-0003lQ-8L for 18454@debbugs.gnu.org; Fri, 28 Nov 2014 09:31:55 -0500 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 116B5C800C for <18454@debbugs.gnu.org>; Fri, 28 Nov 2014 23:31:49 +0900 (JST) Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp02 with bizsmtp id M2Xp1p0033zXHqt012XpHh; Fri, 28 Nov 2014 23:31:49 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.56] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id A435A1BF0091; Fri, 28 Nov 2014 23:31:48 +0900 (JST) Date: Fri, 28 Nov 2014 23:31:49 +0900 From: Norihiro Tanaka In-Reply-To: <20141128025918.GA26989@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> <20141128025918.GA26989@xvii.vinc17.org> Message-Id: <20141128233148.7418.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Fri, 28 Nov 2014 03:59:18 +0100 Vincent Lefevre wrote: > On binary files, it seems that testing the UTF-8 sequences in > pcresearch.c is faster than asking pcre_exec to do that (because > of the retry I assume); see attached patch. It actually checks > UTF-8 only if an invalid sequence was already found by pcre_exec, > assuming that pcre_exec can check the validity of a valid text > file in a faster way. > > On some file similar to PDF (test 1): > > Before: 1.77s > After: 1.38s > > But now, the main problem is the many pcre_exec. Indeed, if I replace > the non-ASCII bytes by \n with: > > LC_ALL=C tr \\200-\\377 \\n > > (now, one has a valid file but with many short lines), the grep -P time > is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes > with: > > LC_ALL=C tr \\200-\\377 \\000 > > the grep -P time is 0.30s (test 3), thus it is much faster. > > Note also that libpcre is much slower than normal grep on simple words, > but on "a[0-9]b", it can be faster: > > grep PCRE PCRE+patch > test 1 4.31 1.90 1.53 > test 2 0.18 1.61 1.63 > test 3 3.28 0.39 0.39 > > With grep, I wonder why test 2 is much faster. > > -- > Vincent Lefevre - Web: > 100% accessible validated (X)HTML - Blog: > Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) Thanks for the patch. However, I seem that valid_utf() in PCRE also considers 5 and 6 bytes characters in PCRE. IMHO, We assume that grep doesn't know how to check for an input text in valid_utf(), althouth we know PCRE checks whether an input text is valid utf8 or not, so that even when PCRE changes behaviour of valid_utf(), grep should run. If we do not check invalid utf8 characters with valid_utf8() in advance, grep may cause core dump with PCRE_NO_UTF8_CHECK. See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586 So we can not avoid for checking invalid utf8 characters with valid_utf8(). Further more, we must perform to check as PCRE expects, but grep does not know how to PCRE to check invalid_utf8 characters due to an above assumption. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 28 Nov 2014 15:51:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141718983522686 (code B ref 18454); Fri, 28 Nov 2014 15:51:01 +0000 Received: (at 18454) by debbugs.gnu.org; 28 Nov 2014 15:50:35 +0000 Received: from localhost ([127.0.0.1]:48574 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuNoZ-0005tp-5V for submit@debbugs.gnu.org; Fri, 28 Nov 2014 10:50:35 -0500 Received: from ioooi.vinc17.net ([92.243.22.117]:60239) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuNoU-0005te-Q1 for 18454@debbugs.gnu.org; Fri, 28 Nov 2014 10:50:31 -0500 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id AD5791E1; Fri, 28 Nov 2014 16:50:29 +0100 (CET) Received: by xvii.vinc17.org (Postfix, from userid 1000) id 300C121A07A; Fri, 28 Nov 2014 16:50:29 +0100 (CET) Date: Fri, 28 Nov 2014 16:50:29 +0100 From: Vincent Lefevre Message-ID: <20141128155029.GB8207@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> <20141128025918.GA26989@xvii.vinc17.org> <20141128233148.7418.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20141128233148.7418.27F6AC2D@kcn.ne.jp> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6365-vl-r59709 (2014-09-07) X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 2014-11-28 23:31:49 +0900, Norihiro Tanaka wrote: > Thanks for the patch. However, I seem that valid_utf() in PCRE also > considers 5 and 6 bytes characters in PCRE. In any case, even if PCRE considers these sequences as valid UTF-8, they shouldn't match because they are not part of Unicode (if they can match, this would be a bug in libpcre). My patch considers that these sequences do not match, which is consistent with the expected behavior. > IMHO, We assume that grep doesn't know how to check for an input text in > valid_utf(), althouth we know PCRE checks whether an input text is valid > utf8 or not, so that even when PCRE changes behaviour of valid_utf(), > grep should run. > > If we do not check invalid utf8 characters with valid_utf8() in advance, > grep may cause core dump with PCRE_NO_UTF8_CHECK. > See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586 > > So we can not avoid for checking invalid utf8 characters with valid_utf8(). > Further more, we must perform to check as PCRE expects, but grep does > not know how to PCRE to check invalid_utf8 characters due to an above > assumption. What matters is whether a sequence corresponds to a valid UTF-8 encoded Unicode character. My patch ensures that pcre_exec is called on a string with only such characters, which implies that this is also valid UTF-8 for PCRE (whether Unicode validity is also considered in valid_utf8() or not). So, there's no valid reason why grep would crash under such a condition. -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 29 Nov 2014 02:59:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Vincent Lefevre Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141722993530919 (code B ref 18454); Sat, 29 Nov 2014 02:59:02 +0000 Received: (at 18454) by debbugs.gnu.org; 29 Nov 2014 02:58:55 +0000 Received: from localhost ([127.0.0.1]:48803 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuYFK-00082c-Nf for submit@debbugs.gnu.org; Fri, 28 Nov 2014 21:58:54 -0500 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:55810) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XuYFH-00082O-Ds for 18454@debbugs.gnu.org; Fri, 28 Nov 2014 21:58:52 -0500 Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id EB3FB67C18 for <18454@debbugs.gnu.org>; Sat, 29 Nov 2014 11:58:48 +0900 (JST) Received: from mail09.kcn.ne.jp ([61.86.6.188]) by imp01 with bizsmtp id MEyo1p00V43QJrh01EyonM; Sat, 29 Nov 2014 11:58:48 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.56] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail09.kcn.ne.jp (Postfix) with ESMTPA id CCCD81BD0097; Sat, 29 Nov 2014 11:58:48 +0900 (JST) Date: Sat, 29 Nov 2014 11:58:48 +0900 From: Norihiro Tanaka In-Reply-To: <20141128155029.GB8207@xvii.vinc17.org> References: <20141128233148.7418.27F6AC2D@kcn.ne.jp> <20141128155029.GB8207@xvii.vinc17.org> Message-Id: <20141129115848.6DF7.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Fri, 28 Nov 2014 16:50:29 +0100 Vincent Lefevre wrote: > What matters is whether a sequence corresponds to a valid UTF-8 > encoded Unicode character. My patch ensures that pcre_exec is called > on a string with only such characters, which implies that this is > also valid UTF-8 for PCRE (whether Unicode validity is also considered > in valid_utf8() or not). So, there's no valid reason why grep would > crash under such a condition. It seems that PCRE treats e.g. following character as invalid. It means we should not these characters into pcre_exec with PCRE_NO_UTF8_CHECK option. 0xE0 0xC2 0xFF 0xED 0xA0 0xFF 0xF0 0xBF 0xFF 0xFF 0xF4 0xBF 0xBF 0xBF From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Thu, 18 Dec 2014 13:47:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141891036328844 (code B ref 18454); Thu, 18 Dec 2014 13:47:01 +0000 Received: (at 18454) by debbugs.gnu.org; 18 Dec 2014 13:46:03 +0000 Received: from localhost ([127.0.0.1]:49553 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1bP1-0007VA-4G for submit@debbugs.gnu.org; Thu, 18 Dec 2014 08:46:03 -0500 Received: from ypig.lip.ens-lyon.fr ([140.77.13.48]:54810) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1bOy-0007Ud-BA for 18454@debbugs.gnu.org; Thu, 18 Dec 2014 08:46:01 -0500 Received: from vlefevre by ypig.lip.ens-lyon.fr with local (Exim 4.84) (envelope-from ) id 1Y1bOw-0008O3-7E; Thu, 18 Dec 2014 14:45:58 +0100 Date: Thu, 18 Dec 2014 14:45:58 +0100 From: Vincent Lefevre Message-ID: <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> References: <20141128233148.7418.27F6AC2D@kcn.ne.jp> <20141128155029.GB8207@xvii.vinc17.org> <20141129115848.6DF7.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20141129115848.6DF7.27F6AC2D@kcn.ne.jp> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04) X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Sorry for the late reply. On 2014-11-29 11:58:48 +0900, Norihiro Tanaka wrote: > On Fri, 28 Nov 2014 16:50:29 +0100 > Vincent Lefevre wrote: > > What matters is whether a sequence corresponds to a valid UTF-8 > > encoded Unicode character. My patch ensures that pcre_exec is called > > on a string with only such characters, which implies that this is > > also valid UTF-8 for PCRE (whether Unicode validity is also considered > > in valid_utf8() or not). So, there's no valid reason why grep would > > crash under such a condition. > > It seems that PCRE treats e.g. following character as invalid. It means > we should not these characters into pcre_exec with PCRE_NO_UTF8_CHECK > option. > > 0xE0 0xC2 0xFF > 0xED 0xA0 0xFF > 0xF0 0xBF 0xFF 0xFF If I'm not mistaken, these first three are also treated as invalid by my patch (and should be treated as invalid by any tool). > 0xF4 0xBF 0xBF 0xBF (corresponding to U+0013ffff). Well, I followed some comment in the grep source, which is currently incorrect. pcreunicode(3) specifies that it follows RFC 3629, and that only values in the range U+0 to U+10FFFF, excluding the surrogate area, are allowed. I'll try to update my patch. But IMHO, it would be better to get PCRE improved, and I had opened a bug: http://bugs.exim.org/show_bug.cgi?id=1554 BTW, printf "\xF4\xBF\xBF\xBF\n" | grep . finds a match, and this appears to be a bug (grep should follow the current standard). -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 19 Dec 2014 14:01:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Vincent Lefevre Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.14189976467448 (code B ref 18454); Fri, 19 Dec 2014 14:01:02 +0000 Received: (at 18454) by debbugs.gnu.org; 19 Dec 2014 14:00:46 +0000 Received: from localhost ([127.0.0.1]:50945 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1y6n-0001w4-FV for submit@debbugs.gnu.org; Fri, 19 Dec 2014 09:00:46 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:40695) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1y6i-0001vs-8I for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 09:00:41 -0500 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id E0820C8004 for <18454@debbugs.gnu.org>; Fri, 19 Dec 2014 23:00:37 +0900 (JST) Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp02 with bizsmtp id VS0d1p00g3zXHqt01S0d3R; Fri, 19 Dec 2014 23:00:37 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.68] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id B37C51BF0021; Fri, 19 Dec 2014 23:00:37 +0900 (JST) Date: Fri, 19 Dec 2014 23:00:38 +0900 From: Norihiro Tanaka In-Reply-To: <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> References: <20141129115848.6DF7.27F6AC2D@kcn.ne.jp> <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> Message-Id: <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Thu, 18 Dec 2014 14:45:58 +0100 Vincent Lefevre wrote: > > > > 0xE0 0xC2 0xFF > > 0xED 0xA0 0xFF > > 0xF0 0xBF 0xFF 0xFF > > If I'm not mistaken, these first three are also treated as invalid by > my patch (and should be treated as invalid by any tool). I got them from pcre_valid_utf8(), but I made some mistakes. They are as following. 0xE0 0xAF 0xBF 0xED 0xA0 0xBF 0xF0 0x8F 0xBF 0xBF By the way, they are correspond with following codes in pcre_valid_utf8(). if (c == 0xe0 && (d & 0x20) == 0) { *erroroffset = (int)(p - string) - 2; return PCRE_UTF8_ERR16; } if (c == 0xed && d >= 0xa0) { *erroroffset = (int)(p - string) - 2; return PCRE_UTF8_ERR14; } ........ if (c == 0xf0 && (d & 0x30) == 0) { *erroroffset = (int)(p - string) - 3; return PCRE_UTF8_ERR17; } if (c > 0xf4 || (c == 0xf4 && d > 0x8f)) { *erroroffset = (int)(p - string) - 3; return PCRE_UTF8_ERR13; } > BTW, > > printf "\xF4\xBF\xBF\xBF\n" | grep . > > finds a match, and this appears to be a bug (grep should follow > the current standard). I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1 for the sequence. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 00:14:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.1419034424803 (code B ref 18454); Sat, 20 Dec 2014 00:14:02 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 00:13:44 +0000 Received: from localhost ([127.0.0.1]:51991 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y27fz-0000Cs-NW for submit@debbugs.gnu.org; Fri, 19 Dec 2014 19:13:43 -0500 Received: from ioooi.vinc17.net ([92.243.22.117]:35531) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y27fw-0000Ci-Su for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 19:13:41 -0500 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id 83E7E444; Sat, 20 Dec 2014 01:13:39 +0100 (CET) Received: by xvii.vinc17.org (Postfix, from userid 1000) id 3FC6921A07A; Sat, 20 Dec 2014 01:13:39 +0100 (CET) Date: Sat, 20 Dec 2014 01:13:39 +0100 From: Vincent Lefevre Message-ID: <20141220001339.GJ32684@xvii.vinc17.org> References: <20141129115848.6DF7.27F6AC2D@kcn.ne.jp> <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04) X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote: > I got them from pcre_valid_utf8(), but I made some mistakes. They are > as following. > > 0xE0 0xAF 0xBF This one is valid UTF-8 and corresponds to the code point U+0BFF, and the following matches: $ printf "\xE0\xAF\xBF\n" | grep -P . ௿ > 0xED 0xA0 0xBF OK, this is in the surrogate area, and it doesn't match with PCRE. > 0xF0 0x8F 0xBF 0xBF This would be U+7FF4FFFF, larger than U+10FFFF. > > BTW, > > > > printf "\xF4\xBF\xBF\xBF\n" | grep . > > > > finds a match, and this appears to be a bug (grep should follow > > the current standard). > > I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1 > for the sequence. Ditto with: printf "\xED\xA0\xBF\n" | grep . (surrogate area). -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 01:24:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141903861112648 (code B ref 18454); Sat, 20 Dec 2014 01:24:02 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 01:23:31 +0000 Received: from localhost ([127.0.0.1]:52024 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y28lW-0003Hw-HV for submit@debbugs.gnu.org; Fri, 19 Dec 2014 20:23:30 -0500 Received: from ioooi.vinc17.net ([92.243.22.117]:35543) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y28lU-0003Hn-Hg for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 20:23:29 -0500 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id 730DE444; Sat, 20 Dec 2014 02:23:27 +0100 (CET) Received: by xvii.vinc17.org (Postfix, from userid 1000) id 25FB221A07A; Sat, 20 Dec 2014 02:23:27 +0100 (CET) Date: Sat, 20 Dec 2014 02:23:27 +0100 From: Vincent Lefevre Message-ID: <20141220012326.GA2678@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20140912012449.GB18162@xvii.vinc17.org> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04) X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 2014-09-12 03:24:49 +0200, Vincent Lefevre wrote: > Timings with the Debian packages on my personal svn working copy > (binary + text files): > > 2.18-2 0.9s with -P, 0.4s without -P > 2.20-3 11.6s with -P, 0.4s without -P I've done another test on a large PDF file. Let's forget grep 2.18, which is indeed too buggy (I could reproduce a buffer overflow). But let's compare with pcregrep, using the "zzz" pattern: Debian grep 2.20-3 6.64s (with -P) Upstream grep 2.21 5.39s (with -P) Debian pcregrep 8.35 0.71s In all cases, PCRE is used, but pcregrep is much faster than grep -P. (Note: on this example, "grep" alone is much faster than pcregrep, but this is not related to the invalid encoding, and depending on the pattern, either grep or PCRE can be significantly faster.) So, perhaps that the right method would be to do what pcregrep does, even though "grep -P" can currently be a bit faster than pcregrep in some cases. -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 01:32:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 18454@debbugs.gnu.org Cc: Vincent Lefevre Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141903911213448 (code B ref 18454); Sat, 20 Dec 2014 01:32:01 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 01:31:52 +0000 Received: from localhost ([127.0.0.1]:52028 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y28tb-0003Up-Ih for submit@debbugs.gnu.org; Fri, 19 Dec 2014 20:31:51 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:43532) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y28tY-0003Ue-AB for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 20:31:49 -0500 Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 1392FC8009 for <18454@debbugs.gnu.org>; Sat, 20 Dec 2014 10:31:46 +0900 (JST) Received: from mail09.kcn.ne.jp ([61.86.6.188]) by imp03 with bizsmtp id VdXm1p00343QJrh01dXmGP; Sat, 20 Dec 2014 10:31:46 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.71] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail09.kcn.ne.jp (Postfix) with ESMTPA id DF0D51BD00C3; Sat, 20 Dec 2014 10:31:45 +0900 (JST) Date: Sat, 20 Dec 2014 10:31:46 +0900 From: Norihiro Tanaka In-Reply-To: <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> References: <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> Message-Id: <20141220103146.F2C0.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Fri, 19 Dec 2014 23:00:38 +0900 Norihiro Tanaka wrote: > I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1 > for the sequence. $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G . Binary file (standard input) matches $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P . $ regex also behaves same as grep -G, e.g. sed only using regex returns the line. Therefore, I think that what a character in the surrogate area matches a period with grep -G is not a bug, although the behavior might not obey a standard. $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p' By the way, mbrlen() returns (size_t) -1 for the character. OTOH, if a character in the surrogate area does not match a period in PCRE, I think that the character should not also match a period grep -P. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 01:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: 18454@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.141903935213818 (code B ref -1); Sat, 20 Dec 2014 01:36:02 +0000 Received: (at submit) by debbugs.gnu.org; 20 Dec 2014 01:35:52 +0000 Received: from localhost ([127.0.0.1]:52032 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y28xU-0003ao-6x for submit@debbugs.gnu.org; Fri, 19 Dec 2014 20:35:52 -0500 Received: from eggs.gnu.org ([208.118.235.92]:48436) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y28xS-0003ag-Be for submit@debbugs.gnu.org; Fri, 19 Dec 2014 20:35:51 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Y28xI-0001Cq-8l for submit@debbugs.gnu.org; Fri, 19 Dec 2014 20:35:50 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:41402) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y28xH-0001Ck-WA for submit@debbugs.gnu.org; Fri, 19 Dec 2014 20:35:40 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50101) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y28xA-00004y-FZ for bug-grep@gnu.org; Fri, 19 Dec 2014 20:35:39 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Y28x2-00018f-Lo for bug-grep@gnu.org; Fri, 19 Dec 2014 20:35:32 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:35618) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Y28x2-00017e-GF for bug-grep@gnu.org; Fri, 19 Dec 2014 20:35:24 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 0A26139E8018 for ; Fri, 19 Dec 2014 17:35:16 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 0Fuf5Wdx2vRp for ; Fri, 19 Dec 2014 17:35:13 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 40982A60088 for ; Fri, 19 Dec 2014 17:35:13 -0800 (PST) Message-ID: <5494D251.5050403@cs.ucla.edu> Date: Fri, 19 Dec 2014 17:35:13 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 References: <20140912012449.GB18162@xvii.vinc17.org> <20141220012326.GA2678@xvii.vinc17.org> In-Reply-To: <20141220012326.GA2678@xvii.vinc17.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) On 12/19/2014 05:23 PM, Vincent Lefevre wrote: > So, perhaps that the right method would be to do what pcregrep does, What does pcregrep do? From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Vincent Lefevre Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 02:14:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Norihiro Tanaka Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141904162317405 (code B ref 18454); Sat, 20 Dec 2014 02:14:02 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:13:43 +0000 Received: from localhost ([127.0.0.1]:52039 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y29Y6-0004Wf-Pf for submit@debbugs.gnu.org; Fri, 19 Dec 2014 21:13:43 -0500 Received: from ioooi.vinc17.net ([92.243.22.117]:35555) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y29Y4-0004WW-4l for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 21:13:40 -0500 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id A282A444; Sat, 20 Dec 2014 03:13:39 +0100 (CET) Received: by xvii.vinc17.org (Postfix, from userid 1000) id 4420E21A07A; Sat, 20 Dec 2014 03:13:39 +0100 (CET) Date: Sat, 20 Dec 2014 03:13:39 +0100 From: Vincent Lefevre Message-ID: <20141220021339.GN32684@xvii.vinc17.org> References: <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> <20141220103146.F2C0.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20141220103146.F2C0.27F6AC2D@kcn.ne.jp> X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04) X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On 2014-12-20 10:31:46 +0900, Norihiro Tanaka wrote: > On Fri, 19 Dec 2014 23:00:38 +0900 > Norihiro Tanaka wrote: > $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G . > Binary file (standard input) matches > $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P . > $ > > regex also behaves same as grep -G, e.g. sed only using regex returns the > line. Therefore, I think that what a character in the surrogate area > matches a period with grep -G is not a bug, although the behavior might > not obey a standard. > > $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p' > > By the way, mbrlen() returns (size_t) -1 for the character. IMHO, both grep and sed should be fixed to obey RFC 3629, which specifies UTF-8. And other tools too (iconv...). > OTOH, if a character in the surrogate area does not match a period in > PCRE, I think that the character should not also match a period grep -P. I agree. -- Vincent Lefèvre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 02:32:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Vincent Lefevre , Norihiro Tanaka Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141904267819051 (code B ref 18454); Sat, 20 Dec 2014 02:32:01 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:31:18 +0000 Received: from localhost ([127.0.0.1]:52044 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y29p8-0004xC-FG for submit@debbugs.gnu.org; Fri, 19 Dec 2014 21:31:18 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:47377) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y29p7-0004x5-BX for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 21:31:17 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id AF888A60011; Fri, 19 Dec 2014 18:31:16 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LRs8tzbXU-+R; Fri, 19 Dec 2014 18:31:08 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 17FEFA60001; Fri, 19 Dec 2014 18:31:08 -0800 (PST) Message-ID: <5494DF69.8010509@cs.ucla.edu> Date: Fri, 19 Dec 2014 18:31:05 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 References: <20141218134558.GQ3818@ypig.lip.ens-lyon.fr> <20141219230038.CE8D.27F6AC2D@kcn.ne.jp> <20141220103146.F2C0.27F6AC2D@kcn.ne.jp> <20141220021339.GN32684@xvii.vinc17.org> In-Reply-To: <20141220021339.GN32684@xvii.vinc17.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) On 12/19/2014 06:13 PM, Vincent Lefevre wrote: > both grep and sed should be fixed to obey RFC 3629 Shouldn't this be done in the C library code? If mbrlen does the right thing, grep and sed should do the right thing. From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 02:46:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Vincent Lefevre Cc: 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141904352020314 (code B ref 18454); Sat, 20 Dec 2014 02:46:02 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:45:20 +0000 Received: from localhost ([127.0.0.1]:52048 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y2A2h-0005Ha-Nn for submit@debbugs.gnu.org; Fri, 19 Dec 2014 21:45:20 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:36098) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y2A2f-0005HQ-Az for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 21:45:18 -0500 Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id 624FF6C12C7 for <18454@debbugs.gnu.org>; Sat, 20 Dec 2014 11:45:15 +0900 (JST) Received: from mail07.kcn.ne.jp ([61.86.6.186]) by imp03 with bizsmtp id VelF1p00B40oyB901elFvo; Sat, 20 Dec 2014 11:45:15 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.71] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail07.kcn.ne.jp (Postfix) with ESMTPA id 0E76DD5009D; Sat, 20 Dec 2014 11:45:15 +0900 (JST) Date: Sat, 20 Dec 2014 11:45:15 +0900 From: Norihiro Tanaka In-Reply-To: <20141220012326.GA2678@xvii.vinc17.org> References: <20140912012449.GB18162@xvii.vinc17.org> <20141220012326.GA2678@xvii.vinc17.org> Message-Id: <20141220114515.F2D4.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Sat, 20 Dec 2014 02:23:27 +0100 Vincent Lefevre wrote: > Debian grep 2.20-3 6.64s (with -P) > Upstream grep 2.21 5.39s (with -P) > Debian pcregrep 8.35 0.71s Did you use pcregrep --utf-8? You should use pcregrep --utf-8 pcregrep to compare. By the way, pcregrep --utf-8 does not support binary files. If pcregrep found 20 errors, it will exit without reading an input text until the last. $ yes src/grep | head -1000 | xargs cat > big_grep $ ls -l big_grep -rw-r--r--. 1 staff users 611453000 Dec 20 11:30 big_grep $ time -p env LC_ALL=en_US.utf8 src/grep -P test big_grep real 10.16 user 10.09 sys 0.07 $ time -p pcregrep --buffer-size=65536 test big_grep real 1.50 user 1.41 sys 0.09 $ time -p pcregrep --buffer-size=65536 --utf-8 test big_grep 2>&1 | tail -1 pcregrep: Too many errors - abandoned. real 0.00 user 0.00 sys 0.00 $ pcregrep --version pcregrep version 8.36 2014-09-26 From unknown Sun Jun 22 08:03:45 2025 X-Loop: help-debbugs@gnu.org Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Resent-From: Norihiro Tanaka Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sat, 20 Dec 2014 02:58:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 18454 X-GNU-PR-Package: grep X-GNU-PR-Keywords: patch To: Paul Eggert Cc: Vincent Lefevre , 18454@debbugs.gnu.org Received: via spool by 18454-submit@debbugs.gnu.org id=B18454.141904426221433 (code B ref 18454); Sat, 20 Dec 2014 02:58:02 +0000 Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:57:42 +0000 Received: from localhost ([127.0.0.1]:52056 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y2AEg-0005Zd-F4 for submit@debbugs.gnu.org; Fri, 19 Dec 2014 21:57:42 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:51148) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y2AEe-0005ZT-GR for 18454@debbugs.gnu.org; Fri, 19 Dec 2014 21:57:41 -0500 Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 92BE4C8009 for <18454@debbugs.gnu.org>; Sat, 20 Dec 2014 11:57:38 +0900 (JST) Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp01 with bizsmtp id Vexe1p00J3zXHqt01exeBR; Sat, 20 Dec 2014 11:57:38 +0900 X-OrgRCPT: 18454@debbugs.gnu.org Received: from [10.120.1.71] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id 4788D1BF0091; Sat, 20 Dec 2014 11:57:38 +0900 (JST) Date: Sat, 20 Dec 2014 11:57:39 +0900 From: Norihiro Tanaka In-Reply-To: <5494DF69.8010509@cs.ucla.edu> References: <20141220021339.GN32684@xvii.vinc17.org> <5494DF69.8010509@cs.ucla.edu> Message-Id: <20141220115738.F2DC.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Fri, 19 Dec 2014 18:31:05 -0800 Paul Eggert wrote: > If mbrlen does the right thing, grep and sed should do the right thing. mbrlen() already does the right thing. So, perhaps, they depend on behavior of regex. Even if so, I think that they should also be fixed in the C library. cat < #include #include #include int main () { setlocale (LC_ALL, ""); mbstate_t mbs = { 0 }; char s[] = { 0xED, 0xA0, 0xBF }; size_t len = mbrlen (s, 3, &mbs); printf ("mbrlen = %d\n", len); exit (EXIT_SUCCESS); } EOF gcc -xc - && ./a.out From debbugs-submit-bounces@debbugs.gnu.org Mon Aug 15 18:53:22 2016 Received: (at control) by debbugs.gnu.org; 15 Aug 2016 22:53:22 +0000 Received: from localhost ([127.0.0.1]:58629 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bZQl0-0002pg-L3 for submit@debbugs.gnu.org; Mon, 15 Aug 2016 18:53:22 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:48034) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bZQkz-0002pR-1e for control@debbugs.gnu.org; Mon, 15 Aug 2016 18:53:21 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id F12491611D9 for ; Mon, 15 Aug 2016 15:53:13 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 1ghveCh8lS7P for ; Mon, 15 Aug 2016 15:53:13 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 5BF7C161218 for ; Mon, 15 Aug 2016 15:53:13 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id y-IGtjCospee for ; Mon, 15 Aug 2016 15:53:13 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 43A801611E4 for ; Mon, 15 Aug 2016 15:53:13 -0700 (PDT) To: control@debbugs.gnu.org From: Paul Eggert Subject: 18454's patches have been applied Organization: UCLA Computer Science Department Message-ID: Date: Mon, 15 Aug 2016 15:53:13 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.5 (/) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) tags 18454 - patch thanks There is still a performance issue, so the bug report remains open. From unknown Sun Jun 22 08:03:45 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Vincent Lefevre Subject: bug#18454: closed (Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales) Message-ID: References: <964e27cb-0484-0daf-ffab-39382de271f0@cs.ucla.edu> <20140912012449.GB18162@xvii.vinc17.org> X-Gnu-PR-Message: they-closed 18454 X-Gnu-PR-Package: grep Reply-To: 18454@debbugs.gnu.org Date: Wed, 24 Nov 2021 03:37:04 +0000 Content-Type: multipart/mixed; boundary="----------=_1637725024-32733-1" This is a multi-part message in MIME format... ------------=_1637725024-32733-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #18454: Improve performance when -P (PCRE) is used in UTF-8 locales which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 18454@debbugs.gnu.org. --=20 18454: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D18454 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1637725024-32733-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 18454-done) by debbugs.gnu.org; 24 Nov 2021 03:36:37 +0000 Received: from localhost ([127.0.0.1]:52672 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mpj5G-0008V6-9A for submit@debbugs.gnu.org; Tue, 23 Nov 2021 22:36:36 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:36102) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1mpj54-0008Ug-7J for 18454-done@debbugs.gnu.org; Tue, 23 Nov 2021 22:36:31 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id EDC541600BC; Tue, 23 Nov 2021 19:36:12 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id zJTHdgbcofnb; Tue, 23 Nov 2021 19:36:12 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 15D451600D3; Tue, 23 Nov 2021 19:36:12 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id V8zNF-eF5CxT; Tue, 23 Nov 2021 19:36:11 -0800 (PST) Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id DE7B91600BC; Tue, 23 Nov 2021 19:36:11 -0800 (PST) Message-ID: <964e27cb-0484-0daf-ffab-39382de271f0@cs.ucla.edu> Date: Tue, 23 Nov 2021 19:36:11 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.3.0 Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales Content-Language: en-US From: Paul Eggert To: 18454-done@debbugs.gnu.org References: <20140912012449.GB18162@xvii.vinc17.org> <542B06E5.8040501@cs.ucla.edu> Organization: UCLA Computer Science Department In-Reply-To: <542B06E5.8040501@cs.ucla.edu> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.4 (--) X-Debbugs-Envelope-To: 18454-done Cc: =?UTF-8?Q?Santiago_Ruano_Rinc=c3=b3n?= , Norihiro Tanaka , Jim Meyering , Vincent Lefevre , =?UTF-8?Q?Zolt=c3=a1n_Herczeg?= , Eric Blake X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.4 (---) On 9/30/14 12:39, Paul Eggert wrote: > GNU grep is smart=20 > enough to start matching at character boundaries without checking the=20 > validity of the input data.=C2=A0 This helps it run faster.=C2=A0 Howev= er, because=20 > libpcre requires a validity prepass, grep -P must slow down and do the=20 > validity check one way or another.=C2=A0 Grep does this only when libpc= re is=20 > used, and that's one reason grep -P is slower than plain grep. Now that Grep master on Savannah has been changed to use PCRE2 instead=20 of PCRE, the 'grep -P' performance problem seems to have been fixed, in=20 that the following commands now take about the same amount of time: grep -P zzzyyyxxx 10840.pdf pcre2grep -U zzzyyyxxx 10840.pdf where the file is from .=20 Formerly, 'grep -P' was about 10x slower on this test. My guess is that the grep -P performance boost comes from bleeding-edge=20 grep using PCRE2's PCRE2_MATCH_INVALID_UTF option. I'm closing this old bug report . We can=20 always reopen it if there are still performance issues that I've missed. ------------=_1637725024-32733-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 12 Sep 2014 01:25:17 +0000 Received: from localhost ([127.0.0.1]:38673 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSFbw-0004Kf-B2 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:16 -0400 Received: from eggs.gnu.org ([208.118.235.92]:48932) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XSFbt-0004KW-92 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:13 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSFbm-0006Gw-92 for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:13 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([208.118.235.17]:59272) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFbm-0006GY-6s for submit@debbugs.gnu.org; Thu, 11 Sep 2014 21:25:06 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50626) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFbg-0007jp-0S for bug-grep@gnu.org; Thu, 11 Sep 2014 21:25:06 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XSFbY-0005zy-Qf for bug-grep@gnu.org; Thu, 11 Sep 2014 21:24:59 -0400 Received: from ioooi.vinc17.net ([92.243.22.117]:57473) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XSFbY-0005zo-KU for bug-grep@gnu.org; Thu, 11 Sep 2014 21:24:52 -0400 Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128]) by ioooi.vinc17.net (Postfix) with ESMTPSA id 3BA322CC; Fri, 12 Sep 2014 03:24:50 +0200 (CEST) Received: by xvii.vinc17.org (Postfix, from userid 1000) id E827921A079; Fri, 12 Sep 2014 03:24:49 +0200 (CEST) Date: Fri, 12 Sep 2014 03:24:49 +0200 From: Vincent Lefevre To: bug-grep@gnu.org Subject: Improve performance when -P (PCRE) is used in UTF-8 locales Message-ID: <20140912012449.GB18162@xvii.vinc17.org> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline X-Mailer-Info: http://www.vinc17.net/mutt/ User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25) Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 208.118.235.17 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) With the patch that fixes bug 18266, grep -P works again on binary files (with invalid UTF-8 sequences), but it is now significantly slower than old versions (which could yield undefined behavior). Timings with the Debian packages on my personal svn working copy (binary + text files): 2.18-2 0.9s with -P, 0.4s without -P 2.20-3 11.6s with -P, 0.4s without -P On this example, that's a 13x slowdown! Though the performance issue would better be fixed in libpcre3, I suppose that it is not so simple and won't occur any time soon. Things could be done in grep: 1. Ignore -P when the pattern would have the same meaning without -P (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b", at least for the simplest cases). 2. Call PCRE in the C locale when this is equivalent. 3. Transform invalid bytes to null bytes in-place before the PCRE call. This changes the current semantic, but: * the semantic on invalid bytes has never been specified, AFAIK; * the best *practical* behavior may not be the current one (I personally prefer to be able to match invalid bytes, just like one can match top-bit-set characters in the C locale, and seeing such invalid bytes as equivalent to null bytes would not be a problem for most users, IMHO -- things can also be configurable). --=20 Vincent Lef=E8vre - Web: 100% accessible validated (X)HTML - Blog: Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) ------------=_1637725024-32733-1--