From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 25 01:36:11 2024 Received: (at submit) by debbugs.gnu.org; 25 Feb 2024 06:36:11 +0000 Received: from localhost ([127.0.0.1]:45991 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1re87P-0003FR-2M for submit@debbugs.gnu.org; Sun, 25 Feb 2024 01:36:11 -0500 Received: from lists.gnu.org ([209.51.188.17]:35328) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1rdytS-0002w8-7J for submit@debbugs.gnu.org; Sat, 24 Feb 2024 15:45:11 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rdysy-0000ts-Ph for bug-coreutils@gnu.org; Sat, 24 Feb 2024 15:44:41 -0500 Received: from devianza.investici.org ([198.167.222.108]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1rdysw-00089a-7v for bug-coreutils@gnu.org; Sat, 24 Feb 2024 15:44:40 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paranoici.org; s=stigmate; t=1708807471; bh=geaupRY94cizSOeXJXAzx0Bn3dZVEjutOckp6mPw8P8=; h=Date:From:To:Subject:From; b=gCJcTd3kkzvTZK02uGLPsuMqkKIkq1undlfMmIX6eJq2rL1DeiL7SHVEAlKdsDtru ZfiR1KpYAHSoNqgjvOeMRw7vv7wPw8TiSCUmUpxXXEjRdw7Y3h7GqPVyfu709oCsv3 eL5/OI8Jm3q5TAXff7X8VOU4Hcvd2jKtwJdPoIIA= Received: from mx2.investici.org (unknown [127.0.0.1]) by devianza.investici.org (Postfix) with ESMTP id 4ThzQ73hm9z6vLb for ; Sat, 24 Feb 2024 20:44:31 +0000 (UTC) Received: from [198.167.222.108] (mx2.investici.org [198.167.222.108]) (Authenticated sender: aearil@paranoici.org) by localhost (Postfix) with ESMTPSA id 4ThzQ71Sp8z6vLZ for ; Sat, 24 Feb 2024 20:44:30 +0000 (UTC) Date: Sat, 24 Feb 2024 21:44:24 +0100 From: Aearil To: bug-coreutils@gnu.org Subject: wc -w ignores breaking space over UCHAR_MAX Message-ID: X-Clacks-Overhead: GNU Elise Nodel, Laura, Natalie Nguyen, Terry Pratchett MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Received-SPF: pass client-ip=198.167.222.108; envelope-from=aearil@paranoici.org; helo=devianza.investici.org X-Spam_score_int: -27 X-Spam_score: -2.8 X-Spam_bar: -- X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sun, 25 Feb 2024 01:36:07 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.4 (--) Hi, wc -w doesn't seem to recognize whitespace characters with a codepoint over UCHAR_MAX (255) as word separators. For example, using the character EM SPACE U+2003: $ printf "foo\u2003bar" | ./wc -w 1 I should get a word count of 2, but instead the space is ignored while counting words. Meanwhile, wc v9.4 gives the correct answer: $ printf "foo\u2003bar" | wc -w 2 It looks like the regression has been introduced by [f40c6b5] and would be fixed by something like the following change: diff --git a/src/wc.c b/src/wc.c index f5a921534..9d456f8c0 100644 --- a/src/wc.c +++ b/src/wc.c @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (width > 0) linepos += width; } - in_word2 = !iswnbspace (wide_char); + in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char); } /* Count words by counting word starts, i.e., each Cheers, -- Aearil From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 25 07:25:18 2024 Received: (at 69369-done) by debbugs.gnu.org; 25 Feb 2024 12:25:18 +0000 Received: from localhost ([127.0.0.1]:41082 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1reDZE-0000Ly-UB for submit@debbugs.gnu.org; Sun, 25 Feb 2024 07:25:18 -0500 Received: from mail-wm1-f45.google.com ([209.85.128.45]:55475) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1reDSW-0008Sw-Li for 69369-done@debbugs.gnu.org; Sun, 25 Feb 2024 07:18:21 -0500 Received: by mail-wm1-f45.google.com with SMTP id 5b1f17b1804b1-412a2d84c10so2369105e9.2 for <69369-done@debbugs.gnu.org>; Sun, 25 Feb 2024 04:17:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708863410; x=1709468210; darn=debbugs.gnu.org; h=in-reply-to:from:references:to:content-language:subject:user-agent :mime-version:date:message-id:sender:from:to:cc:subject:date :message-id:reply-to; bh=D+ZgGKz8YL/iLGPb8sSjyEiSbSGOKLkTxRxoEE18yoA=; b=QgYOq0YkSiOGIcOI7E1Qeeg+TSuNIBH3eDrpB5SLpEap03x5TwniolG7HRhpU7ue0o jjZKhB6b+x0FsK7i1MGA/kRmKqP1jgmrvN+8CV9Due2pcnIgmskhN6XxD4ymvh0MNvHw CoVHso3hht9K6UvX8UwH5+1KVYDCT9V3Oa2RS2sWjLuM4gVF7EXFeWzC/+Jr4meAWrPa uVsSnYX1HRC1ddDsxPWv3J4t/3paYwFfhucKZNGizi4JzjL7zH7OlGSQgct/yoo4EhEd 5VmE6gwH1E8hZRZ44g8z11owtQemsrqvYvomjx0sIKFoINkZt83rs0gxeMXNCZEFNNYo 62Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708863410; x=1709468210; h=in-reply-to:from:references:to:content-language:subject:user-agent :mime-version:date:message-id:sender:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D+ZgGKz8YL/iLGPb8sSjyEiSbSGOKLkTxRxoEE18yoA=; b=IMCsjF/I8A2KKDXjbAnPcBTiYqb6XqFIrxCq6P5ZvnrHV0MIuAuIAKj3Bj049fJlIu XBo8+bwpmZaubx782mQ6g5knt4I8z2ur9w1CogfEKY0oWo7ueeE0nVdzuSK8whgm6YGy HkopayBGGq1fSouv0LPMN7ZbryJ+xGj6RJAl/2DE9gvaTAAZD2lpO1vCiuvWYnk2KgwE 7TpJ1lhtJYVrdZZqUnKJ90vbcGTjko1PLWYGkjb/y9FUnoXIXd3hRrnR15WE8Rui22Ve r+JjfKZx2YAFYXD20euFuSL7zi0Ae0BIosfx5E9aSwL9vqGYRH7SqgYJc2vrmkdAyR8d p9NA== X-Forwarded-Encrypted: i=1; AJvYcCXOqKYoIcPvEj1v3D6NKOa8PgcmKeKM68dCj2WHquk5b+izq90koQ/eIfoCkMePYOOtwo7YqaHcU7O0IkEhAyxuPEPV00uuC7F7hQ== X-Gm-Message-State: AOJu0Yx++j2Oo+WVML0uESzZWuz8OZAIWuHr0ALO/UaNZroXFEABtj6B NEb96kEwBkfM7/hfNbGLKcsJ/oN1dAUb+XTxN0LcjKlpIwn1avxY X-Google-Smtp-Source: AGHT+IFWQgwxtJy1NGho5h/ELF9w3cArZ8wQ3fT7DulcmcCmxyTqZsv1RXP1kG5ACRJqfRqOMRh+TA== X-Received: by 2002:a05:600c:1381:b0:412:96e2:96c0 with SMTP id u1-20020a05600c138100b0041296e296c0mr3404100wmf.25.1708863409854; Sun, 25 Feb 2024 04:16:49 -0800 (PST) Received: from [192.168.1.46] (86-40-129-3-dynamic.agg2.lod.rsl-rtd.eircom.net. [86.40.129.3]) by smtp.googlemail.com with ESMTPSA id d33-20020a05600c4c2100b004129f87a2c6sm1489458wmp.1.2024.02.25.04.16.49 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 25 Feb 2024 04:16:49 -0800 (PST) Content-Type: multipart/mixed; boundary="------------bUG7w50TCX0FYEXPPfri7gDW" Message-ID: Date: Sun, 25 Feb 2024 12:16:48 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: bug#69369: wc -w ignores breaking space over UCHAR_MAX Content-Language: en-US To: Aearil , 69369-done@debbugs.gnu.org References: From: =?UTF-8?Q?P=C3=A1draig_Brady?= In-Reply-To: X-Spam-Score: 0.2 (/) X-Debbugs-Envelope-To: 69369-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.8 (/) This is a multi-part message in MIME format. --------------bUG7w50TCX0FYEXPPfri7gDW Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 24/02/2024 20:44, Aearil via GNU coreutils Bug Reports wrote: > Hi, > > wc -w doesn't seem to recognize whitespace characters with a codepoint > over UCHAR_MAX (255) as word separators. For example, using the > character EM SPACE U+2003: > > $ printf "foo\u2003bar" | ./wc -w > 1 > > I should get a word count of 2, but instead the space is ignored while > counting words. Meanwhile, wc v9.4 gives the correct answer: > > $ printf "foo\u2003bar" | wc -w > 2 > > It looks like the regression has been introduced by [f40c6b5] and > would be fixed by something like the following change: > > diff --git a/src/wc.c b/src/wc.c > index f5a921534..9d456f8c0 100644 > --- a/src/wc.c > +++ b/src/wc.c > @@ -528,7 +528,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) > if (width > 0) > linepos += width; > } > - in_word2 = !iswnbspace (wide_char); > + in_word2 = !iswspace (wide_char) && !iswnbspace (wide_char); > } > > /* Count words by counting word starts, i.e., each Nice one. Great to catch this before release. I've augmented your patch with a test, and will push the attached later. Marking this as done. thanks! Pádraig --------------bUG7w50TCX0FYEXPPfri7gDW Content-Type: text/x-patch; charset=UTF-8; name="wc-wide-space.patch" Content-Disposition: attachment; filename="wc-wide-space.patch" Content-Transfer-Encoding: base64 RnJvbSBjZWQ4YzY0Yzk4NmI3OWMwYmZhNzQwMjhhOTU4MWUwN2Q1ZGYxOTc0IE1vbiBTZXAg MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBBZWFyaWwgPGFlYXJpbEBwYXJhbm9pY2kub3JnPgpE YXRlOiBTYXQsIDI0IEZlYiAyMDI0IDIxOjQ0OjI0ICswMTAwClN1YmplY3Q6IFtQQVRDSF0g d2M6IGZpeCAtdyB3aXRoIGJyZWFraW5nIHNwYWNlIG92ZXIgVUNIQVJfTUFYCgoqIHNyYy93 Yy5jICh3Yyk6IEZpeCByZWdyZXNzaW9uIGludHJvZHVjZWQgaW4gY29tbWl0IHY5LjQtNDgt Z2Y0MGM2YjVjZi4KKiB0ZXN0cy93Yy93Yy1uYnNoLnNoOiBBZGQgdGVzdCBjYXNlcyBmb3Ig InN0YW5kYXJkIiBzcGFjZXMuCkZpeGVzIGh0dHBzOi8vYnVncy5nbnUub3JnLzY5MzY5Ci0t LQogc3JjL3djLmMgICAgICAgICAgICB8IDIgKy0KIHRlc3RzL3djL3djLW5ic3Auc2ggfCA1 ICsrKysrCiAyIGZpbGVzIGNoYW5nZWQsIDYgaW5zZXJ0aW9ucygrKSwgMSBkZWxldGlvbigt KQoKZGlmZiAtLWdpdCBhL3NyYy93Yy5jIGIvc3JjL3djLmMKaW5kZXggZjVhOTIxNTM0Li45 ZDQ1NmY4YzAgMTAwNjQ0Ci0tLSBhL3NyYy93Yy5jCisrKyBiL3NyYy93Yy5jCkBAIC01Mjgs NyArNTI4LDcgQEAgd2MgKGludCBmZCwgY2hhciBjb25zdCAqZmlsZV94LCBzdHJ1Y3QgZnN0 YXR1cyAqZnN0YXR1cywgb2ZmX3QgY3VycmVudF9wb3MpCiAgICAgICAgICAgICAgICAgICAg ICAgICAgIGlmICh3aWR0aCA+IDApCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgbGlu ZXBvcyArPSB3aWR0aDsKICAgICAgICAgICAgICAgICAgICAgICAgIH0KLSAgICAgICAgICAg ICAgICAgICAgICBpbl93b3JkMiA9ICFpc3duYnNwYWNlICh3aWRlX2NoYXIpOworICAgICAg ICAgICAgICAgICAgICAgIGluX3dvcmQyID0gIWlzd3NwYWNlICh3aWRlX2NoYXIpICYmICFp c3duYnNwYWNlICh3aWRlX2NoYXIpOwogICAgICAgICAgICAgICAgICAgICB9CiAKICAgICAg ICAgICAgICAgICAgIC8qIENvdW50IHdvcmRzIGJ5IGNvdW50aW5nIHdvcmQgc3RhcnRzLCBp LmUuLCBlYWNoCmRpZmYgLS1naXQgYS90ZXN0cy93Yy93Yy1uYnNwLnNoIGIvdGVzdHMvd2Mv d2MtbmJzcC5zaAppbmRleCAzNzFjYzhiNWIuLjM5YThiYWNjYyAxMDA3NTUKLS0tIGEvdGVz dHMvd2Mvd2MtbmJzcC5zaAorKysgYi90ZXN0cy93Yy93Yy1uYnNwLnNoCkBAIC0zOCwxMCAr MzgsMTUgQEAgZmkKIAogZXhwb3J0IExDX0FMTD1lbl9VUy5VVEYtOAogaWYgdGVzdCAiJChs b2NhbGUgY2hhcm1hcCAyPi9kZXYvbnVsbCkiID0gVVRGLTg7IHRoZW4KKyAgI25vbiBicmVh a2luZyBzcGFjZSBjbGFzcwogICBjaGVja193b3JkX3NlcCAnXHUwMEEwJwogICBjaGVja193 b3JkX3NlcCAnXHUyMDA3JwogICBjaGVja193b3JkX3NlcCAnXHUyMDJGJwogICBjaGVja193 b3JkX3NlcCAnXHUyMDYwJworCisgICNzYW1wbGluZyBvZiAic3RhbmRhcmQiIHNwYWNlIGNs YXNzCisgIGNoZWNrX3dvcmRfc2VwICdcdTAwMjAnCisgIGNoZWNrX3dvcmRfc2VwICdcdTIw MDMnCiBmaQogCiBleHBvcnQgTENfQUxMPXJ1X1JVLktPSTgtUgotLSAKMi40My4wCgo= --------------bUG7w50TCX0FYEXPPfri7gDW-- From unknown Sun Jun 15 09:01:32 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Wed, 27 Mar 2024 11:24:17 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator