From unknown Sat Jun 14 19:09:25 2025 X-Loop: help-debbugs@gnu.org Subject: bug#38690: grep -P quadratic on long lines Resent-From: Martin Raszyk Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Fri, 20 Dec 2019 15:33:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 38690 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 38690@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.15768559771509 (code B ref -1); Fri, 20 Dec 2019 15:33:02 +0000 Received: (at submit) by debbugs.gnu.org; 20 Dec 2019 15:32:57 +0000 Received: from localhost ([127.0.0.1]:47172 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iiKGv-0000OH-Jc for submit@debbugs.gnu.org; Fri, 20 Dec 2019 10:32:57 -0500 Received: from lists.gnu.org ([209.51.188.17]:37021) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1iiEzM-0005yF-QW for submit@debbugs.gnu.org; Fri, 20 Dec 2019 04:54:29 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:39513) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1iiEzL-00040e-KI for bug-grep@gnu.org; Fri, 20 Dec 2019 04:54:28 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1iiEzK-0005yV-Cw for bug-grep@gnu.org; Fri, 20 Dec 2019 04:54:27 -0500 Received: from mail-qt1-x834.google.com ([2607:f8b0:4864:20::834]:45932) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1iiEzK-0005uo-40 for bug-grep@gnu.org; Fri, 20 Dec 2019 04:54:26 -0500 Received: by mail-qt1-x834.google.com with SMTP id l12so7681053qtq.12 for ; Fri, 20 Dec 2019 01:54:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=Yh920Pc6WbqOxKVMvkz3I7LNlTTAzBAQUgPirvyNRjc=; b=bIpT9+1xINNUMCLvpldtrAcvqO7cVr5c2CheF2CZEqUKdHMcvxCt2pXUky6frpt5xK 6uZZdrmiT6LfZ6aoXTE/7c1BlxBdbu7RzVZ0Qj1PLh0eZiaD4WeUG2/pbqhkqpwp5mgu 19ZRfhH6+KXdK/IAcxTCZ1jQWqptwEYMqQO5HgDUSSQKwocFyBjHroR9tcgPYIbmhvzq vj2mz/gdmOUfLSQ8u33yW1POTAotJHBT6EMb7TAkeQ5kYerlK45ael23a/CLep608Hq6 3DLZ8qsTdyy2wdZW5wZ13l+96tPTyka79L0CZxCRqE4VFZhJxZUbC135ddr4rWgnwmLT 4BxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=Yh920Pc6WbqOxKVMvkz3I7LNlTTAzBAQUgPirvyNRjc=; b=J7+EmO78KgU2FOWP3VK9MpwIJoR+Q4mlq+gpYwgWxcg1XC2WUyyiUEKX59Ig10nIOw xHmXYlpzDApRn3N/xtYRmVWrJi3rnVcGKE6f+W/5X/ddd6aVaH/rT6eRfFoEkgQxpLYE Y6SzJUAarD8O5FtZpev1lvUGK+2UKecMY69GCnVZUc1qk0/W8PzGMS3SEIlTL7RL4peL jGEcYvxzUut7p0O0f1451eVOP2SmMa+yo/NpF8e4dxJrw9Vpw3b6aONYWhTpgRwx9eHR 3C0bmBj46gacUAbhklbwgUHb6wCL2L/jgpnrQNi+/y4qRor41S1lSzsFzw5Dn5CS08P8 vUEQ== X-Gm-Message-State: APjAAAXjy3k/Vs+UHbNaPlDvN0DiZp3lnQdxT6SJYbGqdJzGTWtVFLSc oy2NzkoAPjKdsbnVDB/Sbc+cilxJN69V3a7lB5aa4eovufpCzQ== X-Google-Smtp-Source: APXvYqztmFHaYZvpYlNUQTIkAYyNNXeL06UiNB3loAyExo7ixUVPvglD93dX3gJvsr0JNUl3PE9iWfKhQd0zgXS/CG8= X-Received: by 2002:ac8:23af:: with SMTP id q44mr10879833qtq.175.1576835664894; Fri, 20 Dec 2019 01:54:24 -0800 (PST) MIME-Version: 1.0 From: Martin Raszyk Date: Fri, 20 Dec 2019 10:54:14 +0100 Message-ID: Content-Type: text/plain; charset="UTF-8" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::834 X-Spam-Score: 0.7 (/) X-Mailman-Approved-At: Fri, 20 Dec 2019 10:32:56 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Dear grep maintainers, I've realized that grep -P takes quadratic time on long lines with many short matches. For instance, executing './src/grep -P -o "a" a.txt > a.out' on a file 'a.txt' consisting of N characters 'a' takes time quadratic in N. I've used grep-3.3 and pcre-8.43 for the benchmarks. The root causes for this behavior are as follows: 1. in src/pcresearch.c on l. 222 (at commit cf09252295c554dd3eba4cdb8eb53911fb495f40), the end of the line is searched each time a new match is searched; this already results in quadratic runtime in the above mentioned case 2. the function 'pcre_exec' from pcre-8.43 called in src/pcresearch.c on l. 71 for each match checks if the provided string is valid UTF-8 (code implemented in pcre_valid_utf8.c); this also results in quadratic runtime On your side, it is possible to fix the first root cause. I'll post an e-mail to PCRE mailing list about the second root cause. Best regards, Martin