From unknown Fri Sep 05 08:56:12 2025 X-Loop: help-debbugs@gnu.org Subject: bug#36094: Possible sed bug Resent-From: Roel Van de Paar Original-Sender: "Debbugs-submit" Resent-CC: bug-sed@gnu.org Resent-Date: Wed, 05 Jun 2019 02:16:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 36094 X-GNU-PR-Package: sed X-GNU-PR-Keywords: To: 36094@debbugs.gnu.org X-Debbugs-Original-To: bug-sed@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.155970094526717 (code B ref -1); Wed, 05 Jun 2019 02:16:02 +0000 Received: (at submit) by debbugs.gnu.org; 5 Jun 2019 02:15:45 +0000 Received: from localhost ([127.0.0.1]:46179 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hYLSq-0006wq-Kk for submit@debbugs.gnu.org; Tue, 04 Jun 2019 22:15:44 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52209) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hYJxT-0002ZK-Cg for submit@debbugs.gnu.org; Tue, 04 Jun 2019 20:39:15 -0400 Received: from lists.gnu.org ([209.51.188.17]:49707) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1hYJxO-0005Ny-6N for submit@debbugs.gnu.org; Tue, 04 Jun 2019 20:39:10 -0400 Received: from eggs.gnu.org ([209.51.188.92]:38332) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1hYJxN-0004A2-7q for bug-sed@gnu.org; Tue, 04 Jun 2019 20:39:10 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, HTML_MESSAGE autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1hYJxM-0005JF-C4 for bug-sed@gnu.org; Tue, 04 Jun 2019 20:39:09 -0400 Received: from mail-lf1-x12e.google.com ([2a00:1450:4864:20::12e]:40288) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1hYJxM-0005E3-0H for bug-sed@gnu.org; Tue, 04 Jun 2019 20:39:08 -0400 Received: by mail-lf1-x12e.google.com with SMTP id a9so16380968lff.7 for ; Tue, 04 Jun 2019 17:39:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=0lN6G//VAkGAgzR3BRESbBJfXl5LsCo/DcKg86mDvLk=; b=ULaYfuXYL9ujbrILCSHajd/UzChPoaLR3R7eL62AgPH0ByDuSBeYIyM2/7wDGVUXaj Qm4gVzn8PrJedtqnMxYc0Nx6qHNsa2ZhGHwRlHfS0m9E2gwKS3Vzqo8joxyEVbS2DA+7 9vBqVZKZYkI35m5zKx3FqOeUUEY9I/guYq6CY2U5vjfuwl1Kvu1Vi1Wo/VDXAECBB+2L JEXd+JIqNDLfXoyh/Io/kmxQZ6TxmQs8Mgv4uLBnNw9hdu+cz//e6bdE7NCfIy3iIyLM +adW0Mr7kTDwFRIfAtdJEpgs8s9/XgtnmBL5fiSZke96UI5O8wz+lRK+0fzvL+oxdB3+ Ix3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=0lN6G//VAkGAgzR3BRESbBJfXl5LsCo/DcKg86mDvLk=; b=OxEHiXKFsqgsnIrkjyDWyno9FYwLqYNLZ6lTcy4sX/k+8sT0ToFzlXkVfYiwe2qTAD /UYTqZ9Z84zKldGZTkDXf7XB3gLFW7u34/HdR53COGbUQ3Bz6rjfi1BMx6mr+Wo+Dml7 7wFT/N/UbhLmkT/C5rvGAcl8bgzp8cC0UQ0kty91tt/wc3Sz4cKgG4VaNQGUe5wO/mEi xbc6qdnptR2q10hMjGkm8HiJb8AQXE/gGFaJLClB2ipLGXAQl5+57vJDr6dYGCk7PvVr 26+F8QQPsAM8MyxQCqrBjpsI5gAVZf+c0gglpfAFAFNghxNlHzXd18NIWrnMY/Pnj93d mR0w== X-Gm-Message-State: APjAAAURM6pLTnakDu9Oe2fw3apy8qVWfBh9CVdO6423mAW6Zp+7PLXw 70v+19gIjwtfbc8iG6q6QAi696/o8qN9W2e8cl165xpv0XA= X-Google-Smtp-Source: APXvYqz4mAGZhn5TrufFjxq9ZQyues2MJLwiD3J46ACAtEhVPdX/thaVnB7qZQmiHKUNpWhQ82pBjY1lU1dJx+ufo2c= X-Received: by 2002:ac2:596c:: with SMTP id h12mr740378lfp.101.1559695145303; Tue, 04 Jun 2019 17:39:05 -0700 (PDT) MIME-Version: 1.0 From: Roel Van de Paar Date: Wed, 5 Jun 2019 10:38:53 +1000 Message-ID: Content-Type: multipart/mixed; boundary="000000000000182351058a88d3d7" X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2a00:1450:4864:20::12e X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Spam-Score: -1.3 (-) X-Mailman-Approved-At: Tue, 04 Jun 2019 22:15:42 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --000000000000182351058a88d3d7 Content-Type: multipart/alternative; boundary="00000000000018234d058a88d3d5" --00000000000018234d058a88d3d5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable See attached 'test' file with some non-ASCII chars. $ cat test a=EF=BF=BD-=EF=BF=BD- Then; $ sed -i "s|.*|allgone|" test && cat test allgone=EF=BF=BD-=EF=BF=BD- Or (using fresh copy of 'test'); $ sed -i "s|.*|allgone|gi" test && cat test allgone=EF=BF=BDallgone=EF=BF=BDallgone Expected output in both cases would seem to be "allgone" on the line and nothing else? God Bless, Roel --00000000000018234d058a88d3d5 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
See= attached 'test' file with some non-ASCII chars.
<= span style=3D"font-family:courier new,monospace">
$ cat test
a=EF=BF=BD-=EF= =BF=BD-

Then;
<= br>
$ se= d -i "s|.*|allgone|" test && cat test
allgone=EF=BF=BD= -=EF=BF=BD-

Or (using fresh copy of 'test');

$ sed -i "s|.*|allgone|gi" te= st && cat test
allgone=EF=BF=BDallgone=EF=BF=BDallgone

Expected output = in both cases would seem to be "allgone" on the line and nothing = else?

God Bless,
Roel

--00000000000018234d058a88d3d5-- --000000000000182351058a88d3d7 Content-Type: application/octet-stream; name=test Content-Disposition: attachment; filename=test Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_jwii0in60 Ya0trS0K --000000000000182351058a88d3d7-- From unknown Fri Sep 05 08:56:12 2025 X-Loop: help-debbugs@gnu.org Subject: bug#36094: Possible sed bug Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-sed@gnu.org Resent-Date: Wed, 05 Jun 2019 14:18:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 36094 X-GNU-PR-Package: sed X-GNU-PR-Keywords: To: Roel Van de Paar Cc: 36094@debbugs.gnu.org Received: via spool by 36094-submit@debbugs.gnu.org id=B36094.155974426223422 (code B ref 36094); Wed, 05 Jun 2019 14:18:02 +0000 Received: (at 36094) by debbugs.gnu.org; 5 Jun 2019 14:17:42 +0000 Received: from localhost ([127.0.0.1]:47757 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hYWjW-00065h-IE for submit@debbugs.gnu.org; Wed, 05 Jun 2019 10:17:42 -0400 Received: from mail-pf1-f171.google.com ([209.85.210.171]:33959) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1hYWjU-00065Q-Tq; Wed, 05 Jun 2019 10:17:41 -0400 Received: by mail-pf1-f171.google.com with SMTP id c85so6130258pfc.1; Wed, 05 Jun 2019 07:17:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=Bjh70231qlN/ski7NyDJETFRI/cIW1gg7CKfMsHRT8w=; b=KAFh0EdMzybUqJUTPsGwo0oMXXbIiKOCwc8YUv2h8FkXuGj37KgOoa2j8T9NU0Bcq6 9QyS5+Zd2dHCAG2UK4prrWvFWQnS3V743E82DCdfBM6uNlQEQpuG3394qeZZywropeHy u5Up5WuCTXn/ohvhvH458Zv9E9wXc5uXbNxEqul8SO8CLY6DfYOGHKSxMB0QsLFwuFwC 2BipZOugnSmfA4FxuEZAteGwOdrD02FeSHeSusT50r5ivYDkKyK9Nfr+W28hKriP9ZcG UI8FlnwyzPEYp7f5LAxxWzBhz6bo/NCf1rjAK76hqFJORORx2BdvQEWytqbvZAwPl5sJ RXQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=Bjh70231qlN/ski7NyDJETFRI/cIW1gg7CKfMsHRT8w=; b=ko8vHwd7qEMfYFjHhauR6DNxtejp9a4ScGB6rd+pqRxAdijwdg/p3yQH1YHXgdueSv 2bWS8fHBZ/OQ++rBRjvda7beV/u7WQmwrrhE87CF4lplxk5D6/EbgknWSpN161+1ZbJ5 LKTJPqPwZ8w8zOBXzOOliPXImkOjYVSrg3pr/qenpZnt+44DT5FlA5huzLZ8WVccDmUL f7BJKRMyppqlFTQ1WjskZIBYpUmTPwg8r9UGA/PY0gwW2lRXqvjkFWoA3lGkQzAZai86 urpYem5g1S/dCs4wxazSfY7WqDSqDcgSYibqzKzGfZyAz8EeyAUJfvPySLBPNxiTJsWe +Q4w== X-Gm-Message-State: APjAAAWjPzl6+XRfsod4R1k/rBOX07NmZMsS6L1zkK7T8GqI0DGL51lM oIbbkxxggG8I03MB5o0fuC7xyxhO X-Google-Smtp-Source: APXvYqxxUQkUazTr3QXijW1yYgcS7Z37wcgEdoupA9wBzhuZLkOoe0K384iCytkT4QuEAXsfGU3Mjg== X-Received: by 2002:a65:56cb:: with SMTP id w11mr4761936pgs.236.1559744254499; Wed, 05 Jun 2019 07:17:34 -0700 (PDT) Received: from tomato (moose.housegordon.com. [184.68.105.38]) by smtp.gmail.com with ESMTPSA id c142sm23162291pfb.171.2019.06.05.07.17.33 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 05 Jun 2019 07:17:33 -0700 (PDT) Received: by tomato (Postfix, from userid 1000) id 4812F680D80; Wed, 5 Jun 2019 08:17:32 -0600 (MDT) Date: Wed, 5 Jun 2019 08:17:32 -0600 From: Assaf Gordon Message-ID: <20190605141732.GA18519@tomato.moose.housegordon.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.11.4 (2019-03-13) X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) tag 36094 notabug close 36094 stop Hello, On Wed, Jun 05, 2019 at 10:38:53AM +1000, Roel Van de Paar wrote: > $ cat test > a�-�- > > $ sed -i "s|.*|allgone|gi" test && cat test > allgone�allgone�allgone > > Expected output in both cases would seem to be "allgone" on the line and > nothing else? This is not a bug, but a side-effect of having invalid UTF8 characters in the input file, while working with a UTF8 locale. POSIX requires that '.*' regular expression does not match invalid characters. The 'test' input file contains two bytes of 255 (\xFF) - these are invalid (under UTF8 locale), and the regex matching stops at these bytes. The other characters in the file are matched as three separate patterns (due to "g" flag). The simplest solution when working with such files is to force C locale, where all bytes are considered valid (but then you loose UTF8 capabilities). Compare: $ LC_ALL=en_CA.utf8 sed "s|.*|allgone|g" test | od -An -c a l l g o n e 255 a l l g o n e 255 a l l g o n e \n $ LC_ALL=C sed "s|.*|allgone|g" test | od -An -c a l l g o n e \n But then multi-byte UTF8 characters are processed as individual bytes: $ printf "\U1011\n" | LC_ALL=en_CA.utf8 sed 's/./A/g' A $ printf "\U1011\n" | LC_ALL=C sed 's/./A/g' AAA As a side-note, This is the reason GNU sed has the non-standad 'z' command to clear the pattern space - a more intuitive 's/.*//' command will fail to clear a pattern containing invalid characters. $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 's/.*//g' | od -An -tx1 ff 0a $ printf "FOO \xFF\n" | LC_ALL=en_CA.utf8 sed 'z' | od -An -tx1 0a I'm closing this as "not a bug", but discussion can continue by replying to this thread. regards, - assaf