From unknown Tue Sep 23 14:39:20 2025 X-Loop: help-debbugs@gnu.org Subject: bug#41687: regex search for indexed files Resent-From: Peng Yu Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Wed, 03 Jun 2020 14:28:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 41687 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 41687@debbugs.gnu.org X-Debbugs-Original-To: bug-grep Received: via spool by submit@debbugs.gnu.org id=B.159119447729179 (code B ref -1); Wed, 03 Jun 2020 14:28:02 +0000 Received: (at submit) by debbugs.gnu.org; 3 Jun 2020 14:27:57 +0000 Received: from localhost ([127.0.0.1]:44318 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jgUN3-0007aY-IO for submit@debbugs.gnu.org; Wed, 03 Jun 2020 10:27:57 -0400 Received: from lists.gnu.org ([209.51.188.17]:58318) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jgUN2-0007aR-92 for submit@debbugs.gnu.org; Wed, 03 Jun 2020 10:27:56 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:60866) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jgUN2-00041G-37 for bug-grep@gnu.org; Wed, 03 Jun 2020 10:27:56 -0400 Received: from mail-io1-xd32.google.com ([2607:f8b0:4864:20::d32]:38366) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jgUN1-0003kC-Eu for bug-grep@gnu.org; Wed, 03 Jun 2020 10:27:55 -0400 Received: by mail-io1-xd32.google.com with SMTP id d7so2458192ioq.5 for ; Wed, 03 Jun 2020 07:27:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=R/7ISGomIilDOzMh2tfqjItGbGYYUan6uJGK5mGrOsw=; b=KODiwx7QQNOcYypNmO25OPJTMDERC3TNQ8VU8xhiq3VfQvIKVlx9wVLMZW2SZCQMfa U0spKqC3Oe2z55VkXVAuy9LS5KtSFWnbN0RmlQL4dy640kcVhLLPkOyhG1Q620QCU4Vm xuW2Ln80BaKfobTeeAMMF/hE1aChq1F3QFvzOCFZLHuYkCDH1VtFsarTyVop37k4QYOm qIqMBTUBzeldjpzdiVAu3KTF9poVw1fsCZ8ahT3CTgMCg9V5sI1f1UV0I47DnvNwNBJW T9AXhXrOS4ZU0+JrnTJogwvLvvKkcxHFpqXHyy3X6cl+YjAjX11RNe9TCeYL7eibgFDm Jn4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=R/7ISGomIilDOzMh2tfqjItGbGYYUan6uJGK5mGrOsw=; b=SF56aOgWXS0QY0kuMkBg6JlmvYT8Xp2DhhcNNq3LLubVCXLZpvkVE6vGrgcFuv8HRV aSx0LQocg8OkE6QreZNQ3ShPGSysn06WzBLj2x169svxVQjGetT0OvcNYxR8lY0PzxmL 7N2iZ2JMDE/5AiNcCAP3OYLBe9+9rL/LoXfSJCr6LHAmSGHMxpK2a/+ZvkPEKON2vJHF S1Nd1FFuXV+GW85dkXdoYtBhHTruYk2RvZqUOmDPqC8tDO2nWf6dO4ZqUaPNUxhYWoV3 mhBdSvRjOa0FqRsaBmwpVfTqOwJDohy/duO6gtBSYF69fUHbYU8OgWXPAcbNtFgA4f+B YZGw== X-Gm-Message-State: AOAM5330MHj/4LSf7x7MVSI7oTO4QlhL1aioGOqWEWvzEO6RBaEnTGNd CPFev20VKsTOCX1F8CqG8u8LfZfwjk5WoxWDqlX6c5M/ X-Google-Smtp-Source: ABdhPJwJ21znQcgsJuPyP2Z249agW5t2UJKmVNAVXibJ/KEMzZIBzq39dPGFbi2kotCi53P3EeWaS3BVA7KPlz9LT3w= X-Received: by 2002:a6b:fc0d:: with SMTP id r13mr64607ioh.40.1591194473600; Wed, 03 Jun 2020 07:27:53 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a5e:8818:0:0:0:0:0 with HTTP; Wed, 3 Jun 2020 07:27:52 -0700 (PDT) From: Peng Yu Date: Wed, 3 Jun 2020 09:27:52 -0500 Message-ID: Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2607:f8b0:4864:20::d32; envelope-from=pengyu.ut@gmail.com; helo=mail-io1-xd32.google.com X-detected-operating-system: by eggs.gnu.org: No matching host in p0f cache. That's all we know. X-Spam_score_int: -6 X-Spam_score: -0.7 X-Spam_bar: / X-Spam_report: (-0.7 / 5.0 requ) BAYES_05=-0.5, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001 autolearn=_AUTOLEARN X-Spam_action: no action X-Spam-Score: 2.7 (++) X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Hi, grep can do regex search but it needs to scan each file. When the number of files are large, it can be slow. Is there an alternative tool that can do regex search in the indexed files (including .docx .pdf and other commonly used file formats that can be converted to text) so that the search can be fast? Content analysis details: (2.7 points, 10.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- 0.0 SPF_HELO_NONE SPF: HELO does not publish an SPF Record 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider (pengyu.ut[at]gmail.com) 1.0 SPF_SOFTFAIL SPF: sender does not match SPF record (softfail) -2.3 RCVD_IN_DNSWL_MED RBL: Sender listed at https://www.dnswl.org/, medium trust [209.51.188.17 listed in list.dnswl.org] 0.0 RCVD_IN_MSPIKE_H4 RBL: Very Good reputation (+4) [209.51.188.17 listed in wl.mailspike.net] 0.0 RCVD_IN_MSPIKE_WL Mailspike good senders 2.0 PDS_TONAME_EQ_TOLOCAL_SHORT Short body with To: name matches everything in local email 2.0 SPOOFED_FREEMAIL No description available. X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.3 (/) Hi, grep can do regex search but it needs to scan each file. When the number of files are large, it can be slow. Is there an alternative tool that can do regex search in the indexed files (including .docx .pdf and other commonly used file formats that can be converted to text) so that the search can be fast? I see this. But it is too old and doesn't support formats like pdf and docx. https://github.com/google/codesearch -- Regards, Peng From unknown Tue Sep 23 14:39:20 2025 X-Loop: help-debbugs@gnu.org Subject: bug#41687: regex search for indexed files Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 07 Jun 2020 04:46:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 41687 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: Peng Yu , 41687@debbugs.gnu.org Cc: control@debbugs.gnu.org Received: via spool by 41687-submit@debbugs.gnu.org id=B41687.159150513013192 (code B ref 41687); Sun, 07 Jun 2020 04:46:02 +0000 Received: (at 41687) by debbugs.gnu.org; 7 Jun 2020 04:45:30 +0000 Received: from localhost ([127.0.0.1]:53149 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jhnBL-0003QK-EC for submit@debbugs.gnu.org; Sun, 07 Jun 2020 00:45:30 -0400 Received: from mail-pj1-f49.google.com ([209.85.216.49]:55278) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1jhnBI-0003Q1-Ro; Sun, 07 Jun 2020 00:45:13 -0400 Received: by mail-pj1-f49.google.com with SMTP id k7so536559pjj.4; Sat, 06 Jun 2020 21:45:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:cc:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=I6w+pyACb7QaFOO992A/vSOMVjDqyFUiYgywh4c24gE=; b=avYBrBk0iunI72Pjd1ZWQ/aQhGmgtJkJ2HfGVbZ1ZwzTfRr5Ydl/4DlYameUSLw+4J AX2cjczMv3YVJHT0WwiBsuoG3JyzFEjQj6m2zabpFyhWSUeZZwK0stzlKDd0cDwMTu5i k4q8R/jhm7XFPDrBRUh+CLoyburxBLVANHTMWIIxWItuuJVXYVzAmi/guBBnh106zZdz GarQ6lp2ju6mTdD3xdCB7PUuYPpDZaHZ3D53gx1erD0xarOnOY2q9gjK8mOQUa/gUbh2 qvVvPm4lovWACEge3RrS3B0ZZR67WLN3QVDbALWuMaQUtUrSiw+B7zb9upI7c7zPOQQO oqGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:cc:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=I6w+pyACb7QaFOO992A/vSOMVjDqyFUiYgywh4c24gE=; b=U6DG9OFZCFga9ZWPeTXjZHO2Ucm4ETWMHUmgwVLe5NA1IDN3MUw96ZYBFNhje4NV5g U/ZAGh56dGJClyW2L4kIxm4zXpQPqXJH3KksRUZqZXebaELTjcL1qZzbp8b1wACsPz0x QVd1Z0jbcP0Fsgtqi8bBZwbPqtQM0hx+7b8cN0a2Zlk00AHt26ODRi4sJYvYqu7RtINU gB0dKTOUy38rzhuZr1yzcPxm5q40360va0zxR2JiZzaN1cYxIksbw6Kg4PWDcmVDlYbV xf21JBKYk+CO67WbQHW9esRe0SG+SKUesWvRtZjCPmQbbi1rI1wew0J0v72tds0EJoIA TJ3w== X-Gm-Message-State: AOAM533LmiaLH2dzh2k2HU2kHlfBJlz9Rbl8B8frtOwpPTEaTMAlbLeK QafkK3C2lcG8JfZdxb0ai4yy65zX X-Google-Smtp-Source: ABdhPJyPqHJb1MPVaxLCQEJRR9+LcpjRVldqvO1T3mKrG2YDTHzfAm2y1jBOw/ZMIIk0Xk87/rY6ww== X-Received: by 2002:a17:90a:f993:: with SMTP id cq19mr11341317pjb.154.1591505106197; Sat, 06 Jun 2020 21:45:06 -0700 (PDT) Received: from tomato.moose.housegordon.com (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id gt22sm2140993pjb.2.2020.06.06.21.45.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 06 Jun 2020 21:45:05 -0700 (PDT) References: From: Assaf Gordon Message-ID: <0d70e7d5-c75a-eb5c-68e6-74fa4ec69a2d@gmail.com> Date: Sat, 6 Jun 2020 22:45:03 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) tag 41687 notabug close 41687 stop Hello, On 2020-06-03 8:27 a.m., Peng Yu wrote: > grep can do regex search but it needs to scan each file. When the > number of files are large, it can be slow. > > Is there an alternative tool that can do regex search in the indexed > files (including .docx .pdf and other commonly used file formats that > can be converted to text) so that the search can be fast? It seems you are mixing several questions together. 1. If you want "grep" to search only specific set of files, use the "--include" or "--exclude" options. Or better yet, use find+xargs+grep . 2. If you want to search in non-text files, use appropriate programs that understand the file format (e.g. "pdfgrep") or programs that can convert the custom format to text (e.g. "antiword" and "wv"). 3. You've mentioned "indexed files" - if you're looking for a program that scans files and indexes them, and then allows you to search the index, look for "Desktop search" programs, e.g. https://en.wikipedia.org/wiki/List_of_search_engines#Desktop_search_engines https://en.wikipedia.org/wiki/Recoll https://en.wikipedia.org/wiki/Tracker_(search_software) --- Lastly, For all of these topics, a simple internet search would have given you the above results. PLEASE respect everyone's time by first doing searching for answers yourself, before posting questions on a public mailing list. --- Since this is not a bug in grep, I'm marking this as "closed". regards, - assaf