From unknown Fri Jun 20 20:11:14 2025 X-Loop: help-debbugs@gnu.org Subject: bug#56079: Missing performance optimisation: Word start/end tests Resent-From: sur-behoffski Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 19 Jun 2022 05:40:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 56079 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 56079@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.165561716329629 (code B ref -1); Sun, 19 Jun 2022 05:40:02 +0000 Received: (at submit) by debbugs.gnu.org; 19 Jun 2022 05:39:23 +0000 Received: from localhost ([127.0.0.1]:50443 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o2ned-0007hp-8S for submit@debbugs.gnu.org; Sun, 19 Jun 2022 01:39:23 -0400 Received: from lists.gnu.org ([209.51.188.17]:34950) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o2nea-0007hg-FY for submit@debbugs.gnu.org; Sun, 19 Jun 2022 01:39:22 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:47608) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1o2neY-0004RX-Cd for bug-grep@gnu.org; Sun, 19 Jun 2022 01:39:20 -0400 Received: from ipmail06.adl3.internode.on.net ([150.101.137.16]:29271) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1o2neV-0000WC-Ls for bug-grep@gnu.org; Sun, 19 Jun 2022 01:39:18 -0400 X-SMTP-MATCH: 0 Received: from 114-30-111-149.tpgi.com.au (HELO [192.168.178.210]) ([114.30.111.149]) by ipmail06.adl3.internode.on.net with ESMTP; 19 Jun 2022 15:08:46 +0930 Message-ID: Date: Sun, 19 Jun 2022 15:08:46 +0930 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Content-Language: en-AU From: sur-behoffski Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Received-SPF: none client-ip=150.101.137.16; envelope-from=sur_behoffski@grouse.com.au; helo=ipmail06.adl3.internode.on.net X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) G'day, I'm in the throes of a massive rewrite of "hstbm", which combined a very-quick-and-dirty lexer, no parser, and an optimised Boyer-Moore-style search where I had made some incremental improvements. The only release is at savannah.non-gnu.org: https://savannah.nongnu.org/projects/hstbm It's been over six years since the first and only release; Lua fans will note that I have had another project that has been active over that time, intended to help people use a scientific/technical toolkit on a range of GNU/Linux machines. -- I'm now in the process of trialling an all-singing, all-dancing lexer, with the philosophy that it tries to capture the pattern syntax and semantics, without resorting to parser constructs such as an AST. [I'm currently at a hairy point where the meaning of characters such as "^" can vary, based on constructs such as "(" (start-of-group) and/or "|" (alternation)... where does lexing stop and parsing start?!] One thing that is captured is predicates e.g. relating to IS_WORD: "IS_WORD_YES" (0x01), "IS_WORD_NO" (0x10), and "IS_WORD_MAYBE" (0x11). I've found some patterns containing word start/end boundary checks that are impossible to match in practice, e.g.: a\[def] Grep does not recognise these cases, and so spends time ploughing through the text for a match that can never occur. My lexing code, in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or, equivalently, pairs of "IS_WORD_NO"), and arranges the lexical token stream such that the very first token is (effectively) MATCH_FAILED -- without any effort to inspect the haystack buffer. This can reduce runtimes for large haystack input from seconds to milliseconds. While this is not a terribly common case, it's an easy item to check for; it's possible that, in the future, patterns may become less hand-crafted and more machine-crafted, and so this case may become more relevant. cheers, sur-behoffski (Brenton Hoff) programmer, Grouse Software ["sur-" means "meta-", it's a commentary on a peculiar Australian event: See "Tony Abbott" + "Captain's Pick" + "Prince Philip". Absolutely no disrespect is intended to Honour-receivers at any level; I am grateful for your service, and how you have enriched society.] From unknown Fri Jun 20 20:11:14 2025 X-Loop: help-debbugs@gnu.org Subject: bug#56079: Missing performance optimisation: Word start/end tests Resent-From: Paul Eggert Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 19 Jun 2022 15:27:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 56079 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: sur-behoffski Cc: 56079@debbugs.gnu.org Received: via spool by 56079-submit@debbugs.gnu.org id=B56079.16556524134217 (code B ref 56079); Sun, 19 Jun 2022 15:27:02 +0000 Received: (at 56079) by debbugs.gnu.org; 19 Jun 2022 15:26:53 +0000 Received: from localhost ([127.0.0.1]:52869 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o2wpB-00015x-JO for submit@debbugs.gnu.org; Sun, 19 Jun 2022 11:26:53 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:57754) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o2wp9-00015h-8y for 56079@debbugs.gnu.org; Sun, 19 Jun 2022 11:26:51 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B067A16012F; Sun, 19 Jun 2022 08:26:44 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id FvfqwimWPR6Z; Sun, 19 Jun 2022 08:26:44 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 11500160130; Sun, 19 Jun 2022 08:26:44 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id O8SdL2n_R9pV; Sun, 19 Jun 2022 08:26:43 -0700 (PDT) Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id BBEEA16012F; Sun, 19 Jun 2022 08:26:43 -0700 (PDT) Message-ID: Date: Sun, 19 Jun 2022 10:26:43 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Content-Language: en-US References: From: Paul Eggert In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) On 6/19/22 00:38, sur-behoffski wrote: > a\ [abc]\>[def] > Yes, and this comes up even in strict POSIX without GNU extensions, as "a^" is a valid extended regular expression that cannot match anything. I don't know whether dfa.c, regex.c, etc. optimize for this, but if not it would be nice if they did. For example, 'grep -E "a^" FOO' can silently exit with status 1 without even opening FOO. From debbugs-submit-bounces@debbugs.gnu.org Sat Jul 02 18:26:34 2022 Received: (at control) by debbugs.gnu.org; 2 Jul 2022 22:26:34 +0000 Received: from localhost ([127.0.0.1]:43016 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7lZR-0002bO-R4 for submit@debbugs.gnu.org; Sat, 02 Jul 2022 18:26:33 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35928) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1o7lZP-0002bA-Es for control@debbugs.gnu.org; Sat, 02 Jul 2022 18:26:31 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 12DBB160143 for ; Sat, 2 Jul 2022 15:26:26 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 9dzAa2cUdWHo for ; Sat, 2 Jul 2022 15:26:25 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 567EE160145 for ; Sat, 2 Jul 2022 15:26:25 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id wzbBght_QXna for ; Sat, 2 Jul 2022 15:26:25 -0700 (PDT) Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id F1E18160143 for ; Sat, 2 Jul 2022 15:26:24 -0700 (PDT) Message-ID: <32055ca4-b721-1d5b-bc48-01d001b6283e@cs.ucla.edu> Date: Sat, 2 Jul 2022 17:26:24 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Content-Language: en-US To: control@debbugs.gnu.org From: Paul Eggert Subject: grep bug report maintenance Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) severity 56079 wishlist close 26205 close 20990