From unknown Thu Aug 14 12:20:43 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#60697 <60697@debbugs.gnu.org> To: bug#60697 <60697@debbugs.gnu.org> Subject: Status: GNU grep mishandles \b near encoding errors Reply-To: bug#60697 <60697@debbugs.gnu.org> Date: Thu, 14 Aug 2025 19:20:43 +0000 retitle 60697 GNU grep mishandles \b near encoding errors reassign 60697 grep submitter 60697 Paul Eggert severity 60697 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Mon Jan 09 18:00:32 2023 Received: (at submit) by debbugs.gnu.org; 9 Jan 2023 23:00:32 +0000 Received: from localhost ([127.0.0.1]:38354 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pF184-0004cJ-4f for submit@debbugs.gnu.org; Mon, 09 Jan 2023 18:00:32 -0500 Received: from lists.gnu.org ([209.51.188.17]:36416) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pF182-0004cB-HZ for submit@debbugs.gnu.org; Mon, 09 Jan 2023 18:00:31 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pF182-0004Kw-78 for bug-grep@gnu.org; Mon, 09 Jan 2023 18:00:30 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pF17w-0002DY-CR for bug-grep@gnu.org; Mon, 09 Jan 2023 18:00:29 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id E7961160054 for ; Mon, 9 Jan 2023 15:00:21 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id xfjCDlILSQkQ for ; Mon, 9 Jan 2023 15:00:16 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 4EFF1160056 for ; Mon, 9 Jan 2023 15:00:16 -0800 (PST) DKIM-Filter: OpenDKIM Filter v2.9.2 zimbra.cs.ucla.edu 4EFF1160056 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu; s=78364E5A-2AF3-11ED-87FA-8298ECA2D365; t=1673305216; bh=N9DkPjCkW4l48YN3CVX5wApZBJRgL7Pqa0Od2BnWIek=; h=Message-ID:Date:MIME-Version:To:From:Subject:Content-Type: Content-Transfer-Encoding; b=XDt2tAEo9RrxSQNB7EINt4Vka+xHsCACfDODOgVofRdra2IUdaanZGQkLJDQY91bq EPortsbKH2BqUfW/ZhsNvYXBHjQMv/hjhDwkC5dZnTcXZupwaX4y+g7YyAIAXOHgNe aNjKSsDQFNKW/tLPCFlubfDhkZOC2zKmHAAqCE8Y= X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id B3aL2oNe-dwl for ; Mon, 9 Jan 2023 15:00:16 -0800 (PST) Received: from [131.179.64.200] (Penguin.CS.UCLA.EDU [131.179.64.200]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 30738160054 for ; Mon, 9 Jan 2023 15:00:16 -0800 (PST) Message-ID: Date: Mon, 9 Jan 2023 15:00:15 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 Content-Language: en-US To: bug-grep@gnu.org From: Paul Eggert Subject: GNU grep mishandles \b near encoding errors Organization: UCLA Computer Science Department Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=131.179.128.68; envelope-from=eggert@cs.ucla.edu; helo=zimbra.cs.ucla.edu X-Spam_score_int: -42 X-Spam_score: -4.3 X-Spam_bar: ---- X-Spam_report: (-4.3 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.6 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) Here's a shell session illustrating the problem on Fedora 37, which has GNU grep 3.7. The same bug is still in bleeding-edge GNU grep. $ export LC_ALL=en_US.utf8 $ printf '\300\n' | grep '\b' grep: (standard input): binary file matches $ printf '\300\n' | grep -P '\b' $ Plain grep finds a word boundary in the input even though the input contains no words (just an encoding error). 'grep -P' does the right thing. The underlying issue is in the glibc regex code so the fix should be in glibc / Gnulib, but I thought I'd report it here before I forgot it. From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 12 01:04:19 2023 Received: (at 60697) by debbugs.gnu.org; 12 Jan 2023 06:04:19 +0000 Received: from localhost ([127.0.0.1]:44409 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pFqhG-0001Bv-M7 for submit@debbugs.gnu.org; Thu, 12 Jan 2023 01:04:18 -0500 Received: from mail-lf1-f50.google.com ([209.85.167.50]:44018) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pFqh8-0001Bc-PT for 60697@debbugs.gnu.org; Thu, 12 Jan 2023 01:04:15 -0500 Received: by mail-lf1-f50.google.com with SMTP id f34so26885079lfv.10 for <60697@debbugs.gnu.org>; Wed, 11 Jan 2023 22:04:10 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=01Ja/Rzwcn4O5dF1pSvGSf+fH3zfHd0qLs2t/9ibbp4=; b=VdCEHQqxIyX+3FD4fbZeyfI/4Qy9u2uHiomizFpawkykr7n4JPVu8AxqkS+S62UICa F0kjhD29+22tnWpZQy3uNQiee0434lfJdUuqAkZjlU/nM0Fk77DjAhpchyDdkpoh9f2d N3AT2vt1a+LnXXbMg+Y3Sc2lsGqKW0Ct3rNhSNqWrtP+Q5hso+gZzF0GE21FgB8alc8E evsya4iswLdeoIZBgvb8dv/8B4qLKV5PBPvOOUEgZZzT9kQvF6KXn+9IMkyOIxxh+k+L FG1oOVSRz5BvoGOn+tHG7XNgUlIluyI3ZR7STHhu8WlJsbFZIRi5u6hChQOlD8pOxyr1 pLSg== X-Gm-Message-State: AFqh2kroet1EROognj7KyalM0J+SeZHCpHTRMxjM9Q0V53C0tmWVW3Nh ZWNt5w8lGzaXf2/0EyvPjOU9NVmjGmgoDABBawk= X-Google-Smtp-Source: AMrXdXuBecY4Hs7I1M1osetsAj9A0wOS+r2QB7f8zBJaWZd9hd2FTnV1qM8NXnbQMfCTUA1JEA3zhUrvhv1WfdgQFMs= X-Received: by 2002:ac2:42d3:0:b0:4a2:4b43:9aad with SMTP id n19-20020ac242d3000000b004a24b439aadmr5803089lfl.213.1673503444747; Wed, 11 Jan 2023 22:04:04 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jim Meyering Date: Wed, 11 Jan 2023 22:03:52 -0800 Message-ID: Subject: Re: bug#60697: GNU grep mishandles \b near encoding errors To: Paul Eggert Content-Type: text/plain; charset="UTF-8" X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 60697 Cc: 60697@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) On Mon, Jan 9, 2023 at 10:16 PM Paul Eggert wrote: > Here's a shell session illustrating the problem on Fedora 37, which has > GNU grep 3.7. The same bug is still in bleeding-edge GNU grep. > > $ export LC_ALL=en_US.utf8 > $ printf '\300\n' | grep '\b' > grep: (standard input): binary file matches > $ printf '\300\n' | grep -P '\b' > $ > > Plain grep finds a word boundary in the input even though the input > contains no words (just an encoding error). 'grep -P' does the right thing. > > The underlying issue is in the glibc regex code so the fix should be in > glibc / Gnulib, but I thought I'd report it here before I forgot it. Thanks! While this would definitely be nice to fix before the release (in the next week or so), it's enough of a corner case that I wouldn't feel bad releasing without a fix. For the record, this problem first arose in grep-2.19.