From unknown Mon Jun 23 04:09:38 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#34524 <34524@debbugs.gnu.org> To: bug#34524 <34524@debbugs.gnu.org> Subject: Status: wc: word count incorrect when words separated only by no-break space Reply-To: bug#34524 <34524@debbugs.gnu.org> Date: Mon, 23 Jun 2025 11:09:38 +0000 retitle 34524 wc: word count incorrect when words separated only by no-brea= k space reassign 34524 coreutils submitter 34524 vampyrebat@gmail.com severity 34524 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 18 03:12:55 2019 Received: (at submit) by debbugs.gnu.org; 18 Feb 2019 08:12:55 +0000 Received: from localhost ([127.0.0.1]:52033 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gve2p-0000eF-EC for submit@debbugs.gnu.org; Mon, 18 Feb 2019 03:12:55 -0500 Received: from eggs.gnu.org ([209.51.188.92]:59953) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gve2l-0000dz-5S for submit@debbugs.gnu.org; Mon, 18 Feb 2019 03:12:51 -0500 Received: from lists.gnu.org ([209.51.188.17]:59382) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1gve2b-0002r1-JO for submit@debbugs.gnu.org; Mon, 18 Feb 2019 03:12:43 -0500 Received: from eggs.gnu.org ([209.51.188.92]:46086) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1gve2a-0003oJ-Pd for bug-coreutils@gnu.org; Mon, 18 Feb 2019 03:12:41 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, MSGID_FROM_MTA_HEADER autolearn=disabled version=3.3.2 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1gve2Y-0002og-C8 for bug-coreutils@gnu.org; Mon, 18 Feb 2019 03:12:40 -0500 Received: from mail-it1-x131.google.com ([2607:f8b0:4864:20::131]:35572) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1gve2N-0002cO-Fy for bug-coreutils@gnu.org; Mon, 18 Feb 2019 03:12:32 -0500 Received: by mail-it1-x131.google.com with SMTP id v72so39129457itc.0 for ; Mon, 18 Feb 2019 00:12:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:date:to:from:subject; bh=QbL92Qo6gGDf2sxifKUZbTox5oadHi+sYWmsoBtJRHs=; b=dda1mZKjnwCO99GuXtjCjIOSmRxoc+eLpRFdA2v7vctFPvieykJcaXWFFfDIXqmNqX 8sxeIQ44NaP9vcr3L6lJ1+RiJngmqM287YzIM3uiRDRqL7A7TaagkriW8uyclZAui9Dx kxkXz/Z2QNa/+UauS8A54WvY90W4jpsRhK1zhPP2xgG7WEn371LY0NMVB6pevFPivpHy 20AfiiLQgNQYL2BMPZa764c+r3rGhfZkj9aQJfqksDDcLzxGdu1C8sXFxw4wHu1AeSXN pT9KlPj1SNddCO3+0BgZ88jRqpsVeTYmMbtA1SI+15xhswZlM8iX+/wdcsgpIqbzbhgN nPKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:date:to:from:subject; bh=QbL92Qo6gGDf2sxifKUZbTox5oadHi+sYWmsoBtJRHs=; b=jzk+Zo93oM5+QxBNtD6HFMDK7XHDoX/kKTzkBC3hWrGow96Aou6VyId+mQCLTagomt jbEVsBylSO+Mro2NgkVwyi1VbUpJlF2QHN1NqPwBGz7NE3xBCUnhe3C6/yvmirR0qlo8 JnK1gFxRhbIzulg5qQsLDZMGqO2WagXUmRblC4BFzMttld4HN4x8u1+IkoCbOUjDxtwQ +iNbqRuG8adVz6aWiRJM8Hdi9DtDmaTCXahhts0XwBH7POM3glu9aU2PapCydR4Bo2Hy 3T1o7Yoa/lXa9xj0Gh0m5ZiCAHI3ndH4PU50EqOu+v5HPObhl58+cusLsjGyy6ewrnfn /Nhw== X-Gm-Message-State: AHQUAuaHhVnK0OtfhUKICU/cZj2e/4NCCByhWZSOzvsVgpL3e64Gyak3 Op6XEWBGEz+RiWy7+0b4NFgsHt0X X-Google-Smtp-Source: AHgI3Ib9+9wA31325RDTV3ViUgKxA3Jkq3BB5CoauO/62ipoXLkpOLTcYWbgyjeTJzbLUpv/dll0KA== X-Received: by 2002:a5e:dc4c:: with SMTP id s12mr713572iop.304.1550477538505; Mon, 18 Feb 2019 00:12:18 -0800 (PST) Received: from evo ([136.49.165.244]) by smtp.gmail.com with ESMTPSA id 131sm5308325itm.32.2019.02.18.00.12.16 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 18 Feb 2019 00:12:17 -0800 (PST) Message-ID: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> Received: by evo (sSMTP sendmail emulation); Mon, 18 Feb 2019 02:12:15 -0600 Date: Mon, 18 Feb 2019 02:12:15 -0600 To: bug-coreutils@gnu.org From: vampyrebat@gmail.com Subject: wc: word count incorrect when words separated only by no-break space X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::131 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) $ wc --version wc (GNU coreutils) 8.29 Packaged by Gentoo (8.29-r1 (p1.0)) The man page for wc states: "A word is a... sequence of characters delimited by white space." But its concept of white space only seems to include ASCII white space. U+00A0 NO-BREAK SPACE, for instance, is not recognized. If your terminal displays UTF-8 encoding: printf 'how are\xC2\xA0you\n' or if your terminal displays ISO 8859-1 encoding: printf 'how are\xA0you\n' the visible output of this printf is "how are you". In either case, wc does not recognize the second space as white space, resulting in an incorrect word count: $ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w 2 $ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w 2 From debbugs-submit-bounces@debbugs.gnu.org Fri Feb 22 18:34:13 2019 Received: (at 34524) by debbugs.gnu.org; 22 Feb 2019 23:34:13 +0000 Received: from localhost ([127.0.0.1]:48678 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxKKa-00045V-S8 for submit@debbugs.gnu.org; Fri, 22 Feb 2019 18:34:13 -0500 Received: from havoc.proulx.com ([96.88.95.61]:33042) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxKKY-00045I-Sl for 34524@debbugs.gnu.org; Fri, 22 Feb 2019 18:34:11 -0500 Received: from joseki.proulx.com (localhost [127.0.0.1]) by havoc.proulx.com (Postfix) with ESMTP id 4A9CA2C8; Fri, 22 Feb 2019 16:34:05 -0700 (MST) Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id 1294E21243; Fri, 22 Feb 2019 16:34:05 -0700 (MST) Received: by hysteria.proulx.com (Postfix, from userid 1000) id 09D742DC7C; Fri, 22 Feb 2019 16:34:04 -0700 (MST) Date: Fri, 22 Feb 2019 16:34:04 -0700 From: Bob Proulx To: vampyrebat@gmail.com Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space Message-ID: <20190222162553369112526@bob.proulx.com> References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 34524 Cc: 34524@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) vampyrebat@gmail.com wrote: > The man page for wc states: "A word is a... sequence of characters delimited by white space." > > But its concept of white space only seems to include ASCII white > space. U+00A0 NO-BREAK SPACE, for instance, is not recognized. Indeed this is because wc and other coreutils programs, and other programs, use the libc locale definition. $ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 od -tx1 -c 0000000 c2 a0 0a 302 240 \n 0000003 printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l 0 $ printf '\xC2\xA0 \n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l 1 This shows that grep does not recognize \xC2\xA0 as a character in the class of space characters either. $ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 tr '[[:space:]]' x | od -tx1 -c 0000000 c2 a0 78 302 240 x 0000003 And while a space character matches and is translated the other is not. Since character classes are defined as part of the locale table there isn't really anything we can do about it on the coreutils wc side of things. It would need to be redefined upstream there. Bob From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 24 00:22:58 2019 Received: (at 34524) by debbugs.gnu.org; 24 Feb 2019 05:22:58 +0000 Received: from localhost ([127.0.0.1]:49745 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxmFe-0002KH-4k for submit@debbugs.gnu.org; Sun, 24 Feb 2019 00:22:58 -0500 Received: from mail.magicbluesmoke.com ([82.195.144.49]:58092) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxmFc-0002K7-4T for 34524@debbugs.gnu.org; Sun, 24 Feb 2019 00:22:56 -0500 Received: from localhost.localdomain (unknown [76.21.115.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id F03019A25; Sun, 24 Feb 2019 05:22:53 +0000 (GMT) Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space To: vampyrebat@gmail.com, 34524@debbugs.gnu.org References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> Date: Sat, 23 Feb 2019 21:22:51 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 34524 Cc: Bruno Haible X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On 18/02/19 00:12, vampyrebat@gmail.com wrote: > $ wc --version > wc (GNU coreutils) 8.29 > Packaged by Gentoo (8.29-r1 (p1.0)) > > The man page for wc states: "A word is a... sequence of characters delimited by white space." > > But its concept of white space only seems to include ASCII white space. U+00A0 NO-BREAK SPACE, for instance, is not recognized. > > If your terminal displays UTF-8 encoding: > > printf 'how are\xC2\xA0you\n' > > or if your terminal displays ISO 8859-1 encoding: > > printf 'how are\xA0you\n' > > the visible output of this printf is "how are you". In either case, wc does not recognize the second space as white space, resulting in an incorrect word count: > > $ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w > 2 > $ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w > 2 wc does support multi-byte locales well and we use iswspace() to test whether it's a separator or not. Though on glibc, NBSP is not considered a space. I wrote a little prog to output what is considered a space on glibc locales: 0009 HORIZONTAL TAB 000A NEW LINE (not blank) 000B VERTICAL TAB (not blank) 000C FORM FEED (not blank) 000D CARRIAGE RETURN (not blank) 0020 SPACE 1680 OGHAM SPACE MARK 2000 EN QUAD 2001 EM QUAD 2002 EN SPACE 2003 EM SPACE 2004 THREE-PER-EM SPACE 2005 FOUR-PER-EM SPACE 2006 SIX-PER-EM SPACE 2008 PUNCTUATION SPACE 2009 THIN SPACE 200A HAIR SPACE 2028 LINE SEPARATOR (not blank) 2029 PARAGRAPH SEPARATOR (not blank) 205F MEDIUM MATHEMATICAL SPACE 3000 IDEOGRAPHIC SPACE In the non breaking space class we have: 00A0 NON BREAKING SPACE 2007 FIGURE SPACE 202F NARROW NO-BREAK SPACE 2060 WORD JOINER Maybe we should consider these as word separators? I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'` into libreoffice writer and it treated all but the last as a word separator in its word count tool. There is some discussion of POSIX and unicode classes at: http://unicode.org/L2/L2003/03139-posix-classes.htm I guess POSIX is defining lower level functionality and has to be compat with all uses of iswspace() which might be used for line reformatting etc. but wc(1) being higher level, perhaps should consider the non breaking variants as word separators? The following change would do that: diff --git a/src/wc.c b/src/wc.c index 179abbe..ca990b4 100644 --- a/src/wc.c +++ b/src/wc.c @@ -147,6 +147,13 @@ the following order: newline, word, character, byte, maximum line length.\n\ exit (status); } +static int _GL_ATTRIBUTE_PURE +iswnbspace (wint_t wc) +{ + return wc == L'\u00A0' || wc == L'\u2007' \ + || wc == L'\u202F' || wc == L'\u2060'; +} + /* FILE is the name of the file (or NULL for standard input) associated with the specified counters. */ static void @@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (width > 0) linepos += width; } - if (iswspace (wide_char)) + if (iswspace (wide_char) || iswnbspace (wide_char)) goto mb_word_separator; in_word = true; } Note general word boundary handling is complicated: https://www.unicode.org/reports/tr29/#Word_Boundaries Consider this number with figure space: $ printf "1\u2007234,56\n" 1 234,56 That would be considered as one word rather than two. For more sophisticated contextual processing we would need to use some of the word break functionality from libunistring. cheers, Pádraig From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 24 08:58:09 2019 Received: (at 34524) by debbugs.gnu.org; 24 Feb 2019 13:58:09 +0000 Received: from localhost ([127.0.0.1]:49901 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxuIC-0008Ms-Vf for submit@debbugs.gnu.org; Sun, 24 Feb 2019 08:58:09 -0500 Received: from mo4-p00-ob.smtp.rzone.de ([81.169.146.162]:21580) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxuIA-0008Mi-8V for 34524@debbugs.gnu.org; Sun, 24 Feb 2019 08:58:08 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1551016684; s=strato-dkim-0002; d=clisp.org; h=References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: X-RZG-CLASS-ID:X-RZG-AUTH:From:Subject:Sender; bh=pCmXa3nvednM6ri7StVnVqUXpqVJAMxAm7WZIJ85FM0=; b=MjuT5Ky8KsTpucrP5GeTNLsHcVfgPr//l2g9EpjSJd1Z06WhpNzklOZNb/47QLVCXW 5ngL0ca/IMbOh2GWeio43Ckpj2A+0AxwiREk2z6gXU6d3yEzC1HOVmR2lnL4WEa+w/F/ H/z4r/Y6uI4mmaLNnyIDQgSD+uWfEl5eYt7lKmplYyScanPhhtJwFLZEwJFfr9aK8dJ0 5FSoWB9Lo2xcM+nNI7eNlAJd1QQEuq78Ouo4qn5Jp5gteOpfgjZILQBirP8OQz5ORhFy tCRjkDzFnvm3zmM1lAB6bg9p4i9P76ntH2ZXcAYSVLof2twkTW+zB7OvOqURtUlw2g2s DAnw== X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH+AHjwLuWOGaf3zJZW" X-RZG-CLASS-ID: mo00 Received: from bruno.haible.de by smtp.strato.de (RZmta 44.9 DYNA|AUTH) with ESMTPSA id v0a34ev1ODw39qv (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (curve secp521r1 with 521 ECDH bits, eq. 15360 bits RSA)) (Client did not present a certificate); Sun, 24 Feb 2019 14:58:03 +0100 (CET) From: Bruno Haible To: =?ISO-8859-1?Q?P=E1draig?= Brady , bug-libunistring@gnu.org Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space Date: Sun, 24 Feb 2019 14:58:02 +0100 Message-ID: <6462920.csScTDZgAZ@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-141-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="UTF-8" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 34524 Cc: vampyrebat@gmail.com, 34524@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) [Ccing bug-libunistring, because this is about Unicode handling in GNU. The original thread is in .] > > The man page for wc states: "A word is a... sequence of characters deli= mited by white space." > >=20 > > But its concept of white space only seems to include ASCII white space.= U+00A0 NO-BREAK SPACE, for instance, is not recognized. > >=20 > > If your terminal displays UTF-8 encoding: > >=20 > > printf 'how are\xC2\xA0you\n' > >=20 > > or if your terminal displays ISO 8859-1 encoding: > >=20 > > printf 'how are\xA0you\n' > >=20 > > the visible output of this printf is "how are you". In either case, wc= does not recognize the second space as white space, resulting in an incorr= ect word count: It is a complicated issue. I) Relax. Don't be religious about it. II) POSIX char classes III) User expectations IV) The Unicode standard V) Implementation issues I) Relax. Don't be religious about it. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D Unicode is an effort to make programs work *reasonably well* with as many kinds of text as possible. =46or example, Unicode 23.2 page 859 says: "The effect of layout controls is specific to particular text processes. As much as possible, layout controls are transparent to those text proce= sses for which they were not intended." Or, Unicode TR 29 says: "The precise determination of text elements may vary according to orthographic conventions for a given script or language. The goal of matching user perceptions cannot always be met exactly because the text alone does not always contain enough information to unambiguously decide boundaries. For example, the period (U+002E FULL STOP) is used ambiguous= ly, sometimes for end-of-sentence purposes, sometimes for abbreviations, and sometimes for numbers. In most cases, however, programmatic text boundar= ies can match user perceptions quite closely, although sometimes the best th= at can be done is not to surprise the user." Or, there is criticism: Therefore, this is a reminder that sometimes no optimal solution can be fou= nd. Relax. II) POSIX char classes =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > There is some discussion of POSIX and unicode classes at: > http://unicode.org/L2/L2003/03139-posix-classes.htm >=20 > I guess POSIX is defining lower level functionality > and has to be compat with all uses of iswspace() > which might be used for line reformatting etc. > but wc(1) being higher level, perhaps should consider > the non breaking variants as word separators? Exactly, that's the right approach. The POSIX char classes are defined in glibc/localedata/unicode-gen/unicode_utils.py; in this case what matters is the is_space function, and it has a comment: # Don=E2=80=99t make U+00A0 a space. Non-breaking space means that all = programs # should treat it like a punctuation character, not like a space. If U+00A0 was made a space, most programs would treat NO-BREAK SPACE like SPACE, which is against the purpose of NO-BREAK SPACE. So, in general, users should be aware that NO-BREAK SPACE is not a space. (And likewise, the SOFT HYPHEN is not to be treated like HYPHEN, because that would be against the purpose of the SOFT HYPHEN.) But 'wc' is a specific program, with a specific purpose, and that might warrant exceptions. > I pasted `printf '=3D\u00A0=3D\u2007=3D\u202F=3D\u2060=3D\n'` > into libreoffice writer and it treated all but the last > as a word separator in its word count tool. This is a good approach, because text processors usually deal with Unicode in more detail and with more thought than we usually do in the command-line / monospaced world. III) User expectations =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On one hand, user expectation that a no-break space separates words is justified: In "Dr.\u00A0Pinkwart" a user sees two words. On the other hand, the opposite user expectation is justified as well. The English sentence "Look: here he is" is translated into French as "Regarde\u00A0: le voil=C3=A0". (It is customary to put a space before colo= n, question mark, and exclamation mark in French. And to avoid line breaking at these points, it must be a NO-BREAK space.) When a translator counts the words they have translated, "Regarde : le voil=C3=A0" should count as 3 words, not 4 words. OTOH, it could be argued that in this case, the problem is that a word (":") consisting only of punctuation characters should not be counted as a word. But again: relax. Translators are being paid according to word counts, but a word count that is 1 too high or 1 too low is not dramatic. IV) The Unicode standard =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D On one hand, the Unicode standard makes it clear in several places that 1) NO-BREAK SPACE prohibits line breaking, 2) line breaking and words are related. See for example, the Unicode standard section 5.12 page 219: "Line breaking algorithms generally use state machines for determining word breaks." Or the Unicode standard section 23.2 page 859 "Word Joiner. U+2060 word joiner behaves like U+00A0 no-break space in that it indicates the absence of line breaks; ..." On the other hand, in the same section 23.2 it says "Line breaking and word breaking are distinct text processes. Although a candidate position for a line break in text often coincides with a candidate position for a word break, there are also many situations where candidate break positions of different types do not coincide." And in the Unicode TR 29 section 4 "Word boundaries" it treats NO-BREAK SPACE as a word boundary by default - this can be verified through the program below - but also says that SPACE and NO-BREAK SPACE "may be tailored to be in MidNum, depending on the environme= nt". Here's an example program, that uses GNU libunistring: =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D #include #include int main () { printf ("%d\n", uc_wordbreak_property (0x00A0)); { uint8_t string[] =3D "Regarde : le voil=C3=A0"; char p[19]; u8_wordbreaks (string, 19, p); puts ((char *) string); for (int i =3D 0; i < 19; i++) if (p[i]) printf ("word break at position %d\n", i); } { uint8_t string[] =3D "Regarde\u00A0: le voil=C3=A0"; char p[20]; u8_wordbreaks (string, 20, p); puts ((char *) string); for (int i =3D 0; i < 20; i++) if (p[i]) printf ("word break at position %d\n", i); } } =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D and its output: 0 (means: WBP_OTHER) Regarde : le voil=C3=A0 word break at position 7 word break at position 8 word break at position 9 word break at position 10 word break at position 12 word break at position 13 Regarde : le voil=C3=A0 word break at position 7 word break at position 9 word break at position 10 word break at position 11 word break at position 13 word break at position 14 V) Implementation issues =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The following change would do that: >=20 > diff --git a/src/wc.c b/src/wc.c > index 179abbe..ca990b4 100644 > --- a/src/wc.c > +++ b/src/wc.c > @@ -147,6 +147,13 @@ the following order: newline, word, character, byte,= maximum line length.\n\ > exit (status); > } >=20 > +static int _GL_ATTRIBUTE_PURE > +iswnbspace (wint_t wc) > +{ > + return wc =3D=3D L'\u00A0' || wc =3D=3D L'\u2007' \ > + || wc =3D=3D L'\u202F' || wc =3D=3D L'\u2060'; > +} > + > /* FILE is the name of the file (or NULL for standard input) > associated with the specified counters. */ > static void > @@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstat= us, off_t current_pos) > if (width > 0) > linepos +=3D width; > } > - if (iswspace (wide_char)) > + if (iswspace (wide_char) || iswnbspace (wide_char)) > goto mb_word_separator; > in_word =3D true; > } >=20 > ... > For more sophisticated contextual processing we would need > to use some of the word break functionality from libunistring. I don't think you will be able to satisfactorily blend POSIX behaviour with Unicode behaviour without introducing a command-line option. On the POSIX side: POSIX says "The wc utility shall consider a word to be a non-zero-length string of characters delimited by white space." and "space Define characters to be classified as white-space characters." So, when operates according to POSIX expectations, it MUST use iswspace (wide_char) not iswspace (wide_char) || iswnbspace (wide_char) On the Unicode side: It is reasonable to see two "words" in "Regardez\u00A0:", and the GNU libunistring library implement it like this. It is also reasonable to expect that 'wc' counts words in the Thai language, which does not use spaces to delimit words. GNU libunistring may implement this in the future as well. =46or this reason, I would find it best to introduce an option '--unicode' to 'wc', that would produce Unicode compliant results, at the cost of - not following POSIX to the letter, - being slower. Bruno From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 24 12:47:11 2019 Received: (at 34524) by debbugs.gnu.org; 24 Feb 2019 17:47:11 +0000 Received: from localhost ([127.0.0.1]:50410 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxxrr-0005O1-90 for submit@debbugs.gnu.org; Sun, 24 Feb 2019 12:47:11 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:38900) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gxxrp-0005Nn-JK for 34524@debbugs.gnu.org; Sun, 24 Feb 2019 12:47:10 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D9AE2161411; Sun, 24 Feb 2019 09:47:03 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id nMrP9Lj_K8ab; Sun, 24 Feb 2019 09:47:03 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 27080161459; Sun, 24 Feb 2019 09:47:03 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 78qB277fltFi; Sun, 24 Feb 2019 09:47:03 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E2F91161367; Sun, 24 Feb 2019 09:47:02 -0800 (PST) Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space To: Bruno Haible , =?UTF-8?Q?P=c3=a1draig_Brady?= , bug-libunistring@gnu.org References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> <6462920.csScTDZgAZ@omega> From: Paul Eggert Organization: UCLA Computer Science Department Message-ID: Date: Sun, 24 Feb 2019 09:47:02 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.4.0 MIME-Version: 1.0 In-Reply-To: <6462920.csScTDZgAZ@omega> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 34524 Cc: vampyrebat@gmail.com, 34524@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Bruno Haible wrote: > I would find it best to introduce an option '--unicode' > to 'wc', that would produce Unicode compliant results, at the cost of > - not following POSIX to the letter, It'd make sense to have an option. How about a more-general option --words, that would let the user define what a word is? This option's operand could use ERE syntax, or a shorthand beginning with '+' for common combinations. For example, the command: wc --words='[[:alnum:]]+' would say that a word consists of the longest contiguous sequence of alphanumeric characters. And wc --words='+unicode' would use the Unicode definition of word, whatever it is. From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 24 20:07:28 2019 Received: (at 34524) by debbugs.gnu.org; 25 Feb 2019 01:07:28 +0000 Received: from localhost ([127.0.0.1]:50671 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gy4jw-0007c2-17 for submit@debbugs.gnu.org; Sun, 24 Feb 2019 20:07:28 -0500 Received: from mail.magicbluesmoke.com ([82.195.144.49]:35580) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gy4jr-0007br-0f for 34524@debbugs.gnu.org; Sun, 24 Feb 2019 20:07:26 -0500 Received: from localhost.localdomain (unknown [76.21.115.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id 00E669D90; Mon, 25 Feb 2019 01:07:20 +0000 (GMT) Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space To: Bruno Haible , bug-libunistring@gnu.org References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> <6462920.csScTDZgAZ@omega> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <6904da61-a2a7-8aac-16e2-2587e636c298@draigBrady.com> Date: Sun, 24 Feb 2019 17:07:18 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <6462920.csScTDZgAZ@omega> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 34524 Cc: vampyrebat@gmail.com, Paul Eggert , 34524@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On 24/02/19 05:58, Bruno Haible wrote: > [Ccing bug-libunistring, because this is about Unicode handling in GNU. The > original thread is in .] > >>> The man page for wc states: "A word is a... sequence of characters delimited by white space." >>> >>> But its concept of white space only seems to include ASCII white space. U+00A0 NO-BREAK SPACE, for instance, is not recognized. >>> >>> If your terminal displays UTF-8 encoding: >>> >>> printf 'how are\xC2\xA0you\n' >>> >>> or if your terminal displays ISO 8859-1 encoding: >>> >>> printf 'how are\xA0you\n' >>> >>> the visible output of this printf is "how are you". In either case, wc does not recognize the second space as white space, resulting in an incorrect word count: > > It is a complicated issue. > > I) Relax. Don't be religious about it. > II) POSIX char classes > III) User expectations > IV) The Unicode standard > V) Implementation issues > > > I) Relax. Don't be religious about it. > ====================================== > > Unicode is an effort to make programs work *reasonably well* with as many > kinds of text as possible. > > For example, Unicode 23.2 > page 859 > says: > "The effect of layout controls is specific to particular text processes. > As much as possible, layout controls are transparent to those text processes > for which they were not intended." > > Or, Unicode TR 29 says: > "The precise determination of text elements may vary according to > orthographic conventions for a given script or language. The goal of > matching user perceptions cannot always be met exactly because the text > alone does not always contain enough information to unambiguously decide > boundaries. For example, the period (U+002E FULL STOP) is used ambiguously, > sometimes for end-of-sentence purposes, sometimes for abbreviations, and > sometimes for numbers. In most cases, however, programmatic text boundaries > can match user perceptions quite closely, although sometimes the best that > can be done is not to surprise the user." > > Or, there is criticism: > > Therefore, this is a reminder that sometimes no optimal solution can be found. > Relax. > > > II) POSIX char classes > ====================== > >> There is some discussion of POSIX and unicode classes at: >> http://unicode.org/L2/L2003/03139-posix-classes.htm >> >> I guess POSIX is defining lower level functionality >> and has to be compat with all uses of iswspace() >> which might be used for line reformatting etc. >> but wc(1) being higher level, perhaps should consider >> the non breaking variants as word separators? > > Exactly, that's the right approach. The POSIX char classes are defined in > glibc/localedata/unicode-gen/unicode_utils.py; in this case what matters is > the is_space function, and it has a comment: > # Don’t make U+00A0 a space. Non-breaking space means that all programs > # should treat it like a punctuation character, not like a space. > If U+00A0 was made a space, most programs would treat NO-BREAK SPACE like > SPACE, which is against the purpose of NO-BREAK SPACE. So, in general, > users should be aware that NO-BREAK SPACE is not a space. (And likewise, > the SOFT HYPHEN is not to be treated like HYPHEN, because that would be > against the purpose of the SOFT HYPHEN.) > > But 'wc' is a specific program, with a specific purpose, and that might > warrant exceptions. > >> I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'` >> into libreoffice writer and it treated all but the last >> as a word separator in its word count tool. > > This is a good approach, because text processors usually deal with Unicode > in more detail and with more thought than we usually do in the command-line > / monospaced world. > > > III) User expectations > ====================== > > On one hand, user expectation that a no-break space separates words is > justified: In "Dr.\u00A0Pinkwart" a user sees two words. > > On the other hand, the opposite user expectation is justified as well. > The English sentence "Look: here he is" is translated into French as > "Regarde\u00A0: le voilà". (It is customary to put a space before colon, > question mark, and exclamation mark in French. And to avoid line breaking > at these points, it must be a NO-BREAK space.) When a translator counts > the words they have translated, "Regarde : le voilà" should count as > 3 words, not 4 words. OTOH, it could be argued that in this case, the > problem is that a word (":") consisting only of punctuation characters > should not be counted as a word. > > But again: relax. Translators are being paid according to word counts, > but a word count that is 1 too high or 1 too low is not dramatic. > > > IV) The Unicode standard > ======================== > > On one hand, the Unicode standard makes it clear in several places that > 1) NO-BREAK SPACE prohibits line breaking, > 2) line breaking and words are related. > > See for example, the Unicode standard section 5.12 > > page 219: > "Line breaking algorithms generally use state machines for determining > word breaks." > > Or the Unicode standard section 23.2 > page 859 > "Word Joiner. U+2060 word joiner behaves like U+00A0 no-break space > in that it indicates the absence of line breaks; ..." > > On the other hand, in the same section 23.2 it says > "Line breaking and word breaking are distinct text processes. > Although a candidate position for a line break in text often coincides > with a candidate position for a word break, there are also many > situations where candidate break positions of different types do not > coincide." > > And in the Unicode TR 29 section 4 "Word boundaries" > > it treats NO-BREAK SPACE as a word boundary by default - this can be > verified through the program below - but also says that SPACE and > NO-BREAK SPACE "may be tailored to be in MidNum, depending on the environment". > > Here's an example program, that uses GNU libunistring: > ============================================================== > #include > #include > > int main () > { > printf ("%d\n", uc_wordbreak_property (0x00A0)); > { > uint8_t string[] = "Regarde : le voilà"; > char p[19]; > u8_wordbreaks (string, 19, p); > puts ((char *) string); > for (int i = 0; i < 19; i++) > if (p[i]) > printf ("word break at position %d\n", i); > } > { > uint8_t string[] = "Regarde\u00A0: le voilà"; > char p[20]; > u8_wordbreaks (string, 20, p); > puts ((char *) string); > for (int i = 0; i < 20; i++) > if (p[i]) > printf ("word break at position %d\n", i); > } > } > ============================================================== > and its output: > 0 (means: WBP_OTHER) > Regarde : le voilà > word break at position 7 > word break at position 8 > word break at position 9 > word break at position 10 > word break at position 12 > word break at position 13 > Regarde : le voilà > word break at position 7 > word break at position 9 > word break at position 10 > word break at position 11 > word break at position 13 > word break at position 14 > > > V) Implementation issues > ======================== > >> The following change would do that: >> >> diff --git a/src/wc.c b/src/wc.c >> index 179abbe..ca990b4 100644 >> --- a/src/wc.c >> +++ b/src/wc.c >> @@ -147,6 +147,13 @@ the following order: newline, word, character, byte, maximum line length.\n\ >> exit (status); >> } >> >> +static int _GL_ATTRIBUTE_PURE >> +iswnbspace (wint_t wc) >> +{ >> + return wc == L'\u00A0' || wc == L'\u2007' \ >> + || wc == L'\u202F' || wc == L'\u2060'; >> +} >> + >> /* FILE is the name of the file (or NULL for standard input) >> associated with the specified counters. */ >> static void >> @@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) >> if (width > 0) >> linepos += width; >> } >> - if (iswspace (wide_char)) >> + if (iswspace (wide_char) || iswnbspace (wide_char)) >> goto mb_word_separator; >> in_word = true; >> } >> >> ... >> For more sophisticated contextual processing we would need >> to use some of the word break functionality from libunistring. > > I don't think you will be able to satisfactorily blend POSIX behaviour > with Unicode behaviour without introducing a command-line option. > > On the POSIX side: POSIX says > > "The wc utility shall consider a word to be a non-zero-length string > of characters delimited by white space." > and > > "space > Define characters to be classified as white-space characters." > So, when operates according to POSIX expectations, it MUST use > iswspace (wide_char) > not > iswspace (wide_char) || iswnbspace (wide_char) > > On the Unicode side: It is reasonable to see two "words" in > "Regardez\u00A0:", and the GNU libunistring library implement it like > this. It is also reasonable to expect that 'wc' counts words in the Thai > language, which does not use spaces to delimit words. GNU libunistring > may implement this in the future as well. > > For this reason, I would find it best to introduce an option '--unicode' > to 'wc', that would produce Unicode compliant results, at the cost of > - not following POSIX to the letter, > - being slower. Wow thanks for all that deep info. So non break space is generally considered a word delimiter, though there are complications you detail from unicode. In regard to options for enabling various behaviors for wc(1), I'm thinking we might keep the strict POSIX isspace() behavior with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace() by default, since that's the most common operation one would want, and is consistent with libreoffice for example. I'll adjust the patch along those lines. I like the --words=unicode idea to give us control over various more contextual behaviors in future. thank you! Pádraig From debbugs-submit-bounces@debbugs.gnu.org Sun Feb 24 22:55:55 2019 Received: (at 34524) by debbugs.gnu.org; 25 Feb 2019 03:55:55 +0000 Received: from localhost ([127.0.0.1]:50751 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gy7Mn-00033r-Kp for submit@debbugs.gnu.org; Sun, 24 Feb 2019 22:55:55 -0500 Received: from mail.magicbluesmoke.com ([82.195.144.49]:35930) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gy7Ml-00033i-JW for 34524@debbugs.gnu.org; Sun, 24 Feb 2019 22:55:44 -0500 Received: from localhost.localdomain (unknown [76.21.115.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id 58D489C5D; Mon, 25 Feb 2019 03:55:41 +0000 (GMT) Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space To: Bruno Haible References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> <6462920.csScTDZgAZ@omega> <6904da61-a2a7-8aac-16e2-2587e636c298@draigBrady.com> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <810d6133-6e7b-1b7d-98f5-538314aca2db@draigBrady.com> Date: Sun, 24 Feb 2019 19:55:39 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <6904da61-a2a7-8aac-16e2-2587e636c298@draigBrady.com> Content-Type: multipart/mixed; boundary="------------C5F6FE559B7267F4F18385E4" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 34524 Cc: vampyrebat@gmail.com, Paul Eggert , 34524@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) This is a multi-part message in MIME format. --------------C5F6FE559B7267F4F18385E4 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit On 24/02/19 17:07, Pádraig Brady wrote: > So non break space is generally considered a word delimiter, > though there are complications you detail from unicode. > > In regard to options for enabling various behaviors for wc(1), > I'm thinking we might keep the strict POSIX isspace() behavior > with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace() > by default, since that's the most common operation one would want, > and is consistent with libreoffice for example. > I'll adjust the patch along those lines. Full patch attached. cheers, Pádraig --------------C5F6FE559B7267F4F18385E4 Content-Type: text/x-patch; name="wc-nbsp.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="wc-nbsp.patch" >From 079ab206ee2444d9888f29c788df33f6d5902705 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Sat, 23 Feb 2019 21:23:47 -0800 Subject: [PATCH] wc: treat non break space as a word separator * src/wc.c (iswnbspace): A new function to match characters in this class. (main): Initialize posixly_correct from the environment, to allow disabling honoring NBSP in non C locales. (wc): Call is[w]nbspace() as well as is[w]space. * tests/misc/wc-nbsp.sh: A new test. * tests/local.mk: Reference the new test. * NEWS: Mention the change in behavior. --- NEWS | 3 +++ src/wc.c | 29 +++++++++++++++++++++++++++-- tests/local.mk | 1 + tests/misc/wc-nbsp.sh | 42 ++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 73 insertions(+), 2 deletions(-) create mode 100755 tests/misc/wc-nbsp.sh diff --git a/NEWS b/NEWS index e400554..9bfa3c3 100644 --- a/NEWS +++ b/NEWS @@ -53,6 +53,9 @@ GNU coreutils NEWS -*- outline -*- operator, so POSIX changed this to 'test -e FILE'. Scripts using it were already broken and non-portable; the -a unary operator was never documented. + wc now treats non breaking space characters as word delimiters + unless the POSIXLY_CORRECT environment variable is set. + ** New features id now supports specifying multiple users. diff --git a/src/wc.c b/src/wc.c index 179abbe..b682a1e 100644 --- a/src/wc.c +++ b/src/wc.c @@ -74,6 +74,9 @@ static bool have_read_stdin; /* Used to determine if file size can be determined without reading. */ static size_t page_size; +/* Enable to _not_ treat non breaking space as a word separator. */ +static bool posixly_correct; + /* The result of calling fstat or stat on a file descriptor or file. */ struct fstatus { @@ -147,6 +150,25 @@ the following order: newline, word, character, byte, maximum line length.\n\ exit (status); } +/* Return non zero if a non breaking space. */ +static int _GL_ATTRIBUTE_PURE +iswnbspace (wint_t wc) +{ + return ! posixly_correct + && (wc == 0x00A0 || wc == 0x2007 + || wc == 0x202F || wc == 0x2060); +} + +static int _GL_ATTRIBUTE_PURE +isnbspace (int c) +{ +#if HAVE_BTOWC + return iswnbspace (btowc (c)); +#else + return 0; +#endif +} + /* FILE is the name of the file (or NULL for standard input) associated with the specified counters. */ static void @@ -455,7 +477,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (width > 0) linepos += width; } - if (iswspace (wide_char)) + if (iswspace (wide_char) || iswnbspace (wide_char)) goto mb_word_separator; in_word = true; } @@ -538,7 +560,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (isprint (to_uchar (p[-1]))) { linepos++; - if (isspace (to_uchar (p[-1]))) + if (isspace (to_uchar (p[-1])) + || isnbspace (to_uchar (p[-1]))) goto word_separator; in_word = true; } @@ -681,6 +704,8 @@ main (int argc, char **argv) so that processes running in parallel do not intersperse their output. */ setvbuf (stdout, NULL, _IOLBF, 0); + posixly_correct = (getenv ("POSIXLY_CORRECT") != NULL); + print_lines = print_words = print_chars = print_bytes = false; print_linelength = false; total_lines = total_words = total_chars = total_bytes = max_line_length = 0; diff --git a/tests/local.mk b/tests/local.mk index 4751886..bacc5d2 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -272,6 +272,7 @@ all_tests = \ tests/misc/wc.pl \ tests/misc/wc-files0-from.pl \ tests/misc/wc-files0.sh \ + tests/misc/wc-nbsp.sh \ tests/misc/wc-parallel.sh \ tests/misc/wc-proc.sh \ tests/misc/cat-proc.sh \ diff --git a/tests/misc/wc-nbsp.sh b/tests/misc/wc-nbsp.sh new file mode 100755 index 0000000..587aab5 --- /dev/null +++ b/tests/misc/wc-nbsp.sh @@ -0,0 +1,42 @@ +#!/bin/sh +# Test non breaking space handling + +# Copyright (C) 2019 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src +print_ver_ wc + +# Before coreutils 8.31 nbsp was treated as part of a word, +# rather than a word delimiter + +export LC_ALL=en_US.ISO-8859-1 +if test "$(locale charmap 2>/dev/null)" = ISO-8859-1; then + test $(printf '=\xA0=' | wc -w) = 2 || fail=1 + test $(printf '=\xA0=' | POSIXLY_CORRECT=1 wc -w) = 1 || fail=1 +fi +export LC_ALL=en_US.UTF-8 +if test "$(locale charmap 2>/dev/null)" = UTF-8; then + test $(printf '=\u00A0=' | wc -w) = 2 || fail=1 + test $(printf '=\u2007=' | wc -w) = 2 || fail=1 + test $(printf '=\u202F=' | wc -w) = 2 || fail=1 + test $(printf '=\u2060=' | wc -w) = 2 || fail=1 +fi +export LC_ALL=ru_RU.KOI8-R +if test "$(locale charmap 2>/dev/null)" = KOI8-R; then + test $(printf '=\x9A=' | wc -w) = 2 || fail=1 +fi + +Exit $fail -- 2.9.3 --------------C5F6FE559B7267F4F18385E4-- From debbugs-submit-bounces@debbugs.gnu.org Mon Feb 25 23:27:12 2019 Received: (at 34524-done) by debbugs.gnu.org; 26 Feb 2019 04:27:12 +0000 Received: from localhost ([127.0.0.1]:52138 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gyUKc-0002Go-Nu for submit@debbugs.gnu.org; Mon, 25 Feb 2019 23:27:12 -0500 Received: from mail.magicbluesmoke.com ([82.195.144.49]:37936) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1gyUKZ-0002GM-Bz for 34524-done@debbugs.gnu.org; Mon, 25 Feb 2019 23:27:00 -0500 Received: from localhost.localdomain (unknown [76.21.115.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id 6176F9C74; Tue, 26 Feb 2019 04:26:57 +0000 (GMT) Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space To: Bruno Haible References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <39f9dd85-aeb2-6eac-7857-1d1910f0476f@draigBrady.com> <6462920.csScTDZgAZ@omega> <6904da61-a2a7-8aac-16e2-2587e636c298@draigBrady.com> <810d6133-6e7b-1b7d-98f5-538314aca2db@draigBrady.com> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <944c5643-0007-bf9b-43eb-a51c003ba1ec@draigBrady.com> Date: Mon, 25 Feb 2019 20:26:55 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <810d6133-6e7b-1b7d-98f5-538314aca2db@draigBrady.com> Content-Type: multipart/mixed; boundary="------------53C7D8BAF26A4083C08F01B1" X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 34524-done Cc: vampyrebat@gmail.com, 34524-done@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) This is a multi-part message in MIME format. --------------53C7D8BAF26A4083C08F01B1 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit On 24/02/19 19:55, Pádraig Brady wrote: > On 24/02/19 17:07, Pádraig Brady wrote: >> So non break space is generally considered a word delimiter, >> though there are complications you detail from unicode. >> >> In regard to options for enabling various behaviors for wc(1), >> I'm thinking we might keep the strict POSIX isspace() behavior >> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace() >> by default, since that's the most common operation one would want, >> and is consistent with libreoffice for example. >> I'll adjust the patch along those lines. > > Full patch attached. Updated patch attached. I'll push in a few hours. Marking this bug as done. cheers, Pádraig. --------------53C7D8BAF26A4083C08F01B1 Content-Type: text/x-patch; name="wc-nbsp.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="wc-nbsp.patch" >From c04ff0df5dfe788a38162cb2609b38495e765383 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= Date: Sat, 23 Feb 2019 21:23:47 -0800 Subject: [PATCH] wc: treat non break space as a word separator * src/wc.c (iswnbspace): A new function to match characters in this class. (main): Initialize posixly_correct from the environment, to allow disabling honoring NBSP in non C locales. (wc): Call is[w]nbspace() as well as is[w]space. * bootstrap.conf: Ensure btowc is available. * tests/misc/wc-nbsp.sh: A new test. * tests/local.mk: Reference the new test. * NEWS: Mention the change in behavior. --- NEWS | 3 +++ bootstrap.conf | 1 + src/wc.c | 25 +++++++++++++++++++++++-- tests/local.mk | 1 + tests/misc/wc-nbsp.sh | 42 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 70 insertions(+), 2 deletions(-) create mode 100755 tests/misc/wc-nbsp.sh diff --git a/NEWS b/NEWS index e400554..9bfa3c3 100644 --- a/NEWS +++ b/NEWS @@ -53,6 +53,9 @@ GNU coreutils NEWS -*- outline -*- operator, so POSIX changed this to 'test -e FILE'. Scripts using it were already broken and non-portable; the -a unary operator was never documented. + wc now treats non breaking space characters as word delimiters + unless the POSIXLY_CORRECT environment variable is set. + ** New features id now supports specifying multiple users. diff --git a/bootstrap.conf b/bootstrap.conf index a525ef4..4926152 100644 --- a/bootstrap.conf +++ b/bootstrap.conf @@ -38,6 +38,7 @@ gnulib_modules=" backup-rename base32 base64 + btowc buffer-lcm c-strcase cl-strtod diff --git a/src/wc.c b/src/wc.c index 179abbe..2381804 100644 --- a/src/wc.c +++ b/src/wc.c @@ -74,6 +74,9 @@ static bool have_read_stdin; /* Used to determine if file size can be determined without reading. */ static size_t page_size; +/* Enable to _not_ treat non breaking space as a word separator. */ +static bool posixly_correct; + /* The result of calling fstat or stat on a file descriptor or file. */ struct fstatus { @@ -147,6 +150,21 @@ the following order: newline, word, character, byte, maximum line length.\n\ exit (status); } +/* Return non zero if a non breaking space. */ +static int _GL_ATTRIBUTE_PURE +iswnbspace (wint_t wc) +{ + return ! posixly_correct + && (wc == 0x00A0 || wc == 0x2007 + || wc == 0x202F || wc == 0x2060); +} + +static int +isnbspace (int c) +{ + return iswnbspace (btowc (c)); +} + /* FILE is the name of the file (or NULL for standard input) associated with the specified counters. */ static void @@ -455,7 +473,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (width > 0) linepos += width; } - if (iswspace (wide_char)) + if (iswspace (wide_char) || iswnbspace (wide_char)) goto mb_word_separator; in_word = true; } @@ -538,7 +556,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos) if (isprint (to_uchar (p[-1]))) { linepos++; - if (isspace (to_uchar (p[-1]))) + if (isspace (to_uchar (p[-1])) + || isnbspace (to_uchar (p[-1]))) goto word_separator; in_word = true; } @@ -681,6 +700,8 @@ main (int argc, char **argv) so that processes running in parallel do not intersperse their output. */ setvbuf (stdout, NULL, _IOLBF, 0); + posixly_correct = (getenv ("POSIXLY_CORRECT") != NULL); + print_lines = print_words = print_chars = print_bytes = false; print_linelength = false; total_lines = total_words = total_chars = total_bytes = max_line_length = 0; diff --git a/tests/local.mk b/tests/local.mk index 4751886..bacc5d2 100644 --- a/tests/local.mk +++ b/tests/local.mk @@ -272,6 +272,7 @@ all_tests = \ tests/misc/wc.pl \ tests/misc/wc-files0-from.pl \ tests/misc/wc-files0.sh \ + tests/misc/wc-nbsp.sh \ tests/misc/wc-parallel.sh \ tests/misc/wc-proc.sh \ tests/misc/cat-proc.sh \ diff --git a/tests/misc/wc-nbsp.sh b/tests/misc/wc-nbsp.sh new file mode 100755 index 0000000..11ee0d6 --- /dev/null +++ b/tests/misc/wc-nbsp.sh @@ -0,0 +1,42 @@ +#!/bin/sh +# Test non breaking space handling + +# Copyright (C) 2019 Free Software Foundation, Inc. + +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. + +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. + +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +. "${srcdir=.}/tests/init.sh"; path_prepend_ ./src +print_ver_ wc printf + +# Before coreutils 8.31 nbsp was treated as part of a word, +# rather than a word delimiter + +export LC_ALL=en_US.ISO-8859-1 +if test "$(locale charmap 2>/dev/null)" = ISO-8859-1; then + test $(env printf '=\xA0=' | wc -w) = 2 || fail=1 + test $(env printf '=\xA0=' | POSIXLY_CORRECT=1 wc -w) = 1 || fail=1 +fi +export LC_ALL=en_US.UTF-8 +if test "$(locale charmap 2>/dev/null)" = UTF-8; then + test $(env printf '=\u00A0=' | wc -w) = 2 || fail=1 + test $(env printf '=\u2007=' | wc -w) = 2 || fail=1 + test $(env printf '=\u202F=' | wc -w) = 2 || fail=1 + test $(env printf '=\u2060=' | wc -w) = 2 || fail=1 +fi +export LC_ALL=ru_RU.KOI8-R +if test "$(locale charmap 2>/dev/null)" = KOI8-R; then + test $(env printf '=\x9A=' | wc -w) = 2 || fail=1 +fi + +Exit $fail -- 2.9.3 --------------53C7D8BAF26A4083C08F01B1-- From debbugs-submit-bounces@debbugs.gnu.org Sat Mar 09 08:52:59 2019 Received: (at 34524-done) by debbugs.gnu.org; 9 Mar 2019 13:52:59 +0000 Received: from localhost ([127.0.0.1]:36972 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h2cPL-0003Kq-9Q for submit@debbugs.gnu.org; Sat, 09 Mar 2019 08:52:59 -0500 Received: from mo4-p01-ob.smtp.rzone.de ([81.169.146.164]:15741) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h2cPG-0003Ke-Uc for 34524-done@debbugs.gnu.org; Sat, 09 Mar 2019 08:52:56 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; t=1552139572; s=strato-dkim-0002; d=clisp.org; h=References:In-Reply-To:Message-ID:Date:Subject:Cc:To:From: X-RZG-CLASS-ID:X-RZG-AUTH:From:Subject:Sender; bh=kwbdct2NAciCUDZF5wQTMrkh0DAjUEC8R3BMSU6zUiQ=; b=LjxUJkdi68w3AxUYT3QPISaFmUYdDfmnOMTPEm/riWO9vIkQawuNStoP5gKv8c6Fs4 qpzUdlevhZTlqnzeafjw/R+FUsNo73W8Y4vfS7h+X1xgFaTGRFxlgsT4lTTWFjmDQ9Wn QyBcVeXXyLTnS/N3zuZdG5dJDY59uSoJUUruZpjia2bbmOU2HUCTfkHy/Fasiu46sAFn Mosg1avAy4y+jvJC5I9O1f5+jRSDbJoSpzXnVQlts3iIeZOCsPD6xcvBH/+oCUBDjSQI 3fayBPl8AKgBqOLCfFPir6i3iad1soD2w7142RdcuaqUR/skfIbTc24lhBnpycrjl5mn VtPw== X-RZG-AUTH: ":Ln4Re0+Ic/6oZXR1YgKryK8brlshOcZlIWs+iCP5vnk6shH+AHjwLuWOH6f0yZBW" X-RZG-CLASS-ID: mo00 Received: from bruno.haible.de by smtp.strato.de (RZmta 44.13 DYNA|AUTH) with ESMTPSA id 60865ev29Dqozdp (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (curve secp521r1 with 521 ECDH bits, eq. 15360 bits RSA)) (Client did not present a certificate); Sat, 9 Mar 2019 14:52:50 +0100 (CET) From: Bruno Haible To: =?ISO-8859-1?Q?P=E1draig?= Brady Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space Date: Sat, 09 Mar 2019 14:52:49 +0100 Message-ID: <1834081.TtEGrmvcJs@omega> User-Agent: KMail/5.1.3 (Linux/4.4.0-141-generic; KDE/5.18.0; x86_64; ; ) In-Reply-To: <944c5643-0007-bf9b-43eb-a51c003ba1ec@draigBrady.com> References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <810d6133-6e7b-1b7d-98f5-538314aca2db@draigBrady.com> <944c5643-0007-bf9b-43eb-a51c003ba1ec@draigBrady.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 34524-done Cc: vampyrebat@gmail.com, 34524-done@debbugs.gnu.org, Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) Hi P=E1draig, > >> In regard to options for enabling various behaviors for wc(1), > >> I'm thinking we might keep the strict POSIX isspace() behavior > >> with LC_CTYPE=3DC and/or POSIXLY_CORRECT=3D1, and use iswnbspace() > >> by default Since you plan to add a --words=3D... option in the future (as suggested by Paul or me), it would make sense to add this option now, instead of testing POSIXLY_CORRECT. If you introduce POSIXLY_CORRECT dependent behaviour now (and need to keep it for backward-compatibility), you'll have a hard to understand interface: What will the following do? env POSIXLY_CORRECT=3D1 wc --words=3Dunicode wc --words=3Dunicode Bruno From debbugs-submit-bounces@debbugs.gnu.org Sat Mar 09 22:31:49 2019 Received: (at 34524) by debbugs.gnu.org; 10 Mar 2019 03:31:49 +0000 Received: from localhost ([127.0.0.1]:37570 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h2pBl-0003yX-Ce for submit@debbugs.gnu.org; Sat, 09 Mar 2019 22:31:49 -0500 Received: from mail.magicbluesmoke.com ([82.195.144.49]:54604) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1h2pBj-0003yO-PU for 34524@debbugs.gnu.org; Sat, 09 Mar 2019 22:31:48 -0500 Received: from localhost.localdomain (unknown [76.21.115.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id 7E3B5AA5F; Sun, 10 Mar 2019 03:31:45 +0000 (GMT) Subject: Re: bug#34524: wc: word count incorrect when words separated only by no-break space To: Bruno Haible References: <5c6a68e1.1c69fb81.2980c.a4e4@mx.google.com> <810d6133-6e7b-1b7d-98f5-538314aca2db@draigBrady.com> <944c5643-0007-bf9b-43eb-a51c003ba1ec@draigBrady.com> <1834081.TtEGrmvcJs@omega> From: =?UTF-8?Q?P=c3=a1draig_Brady?= Message-ID: <5f3b9dde-0a2a-5ab6-0597-95d009a100ea@draigBrady.com> Date: Sat, 9 Mar 2019 19:31:43 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <1834081.TtEGrmvcJs@omega> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 34524 Cc: vampyrebat@gmail.com, Paul Eggert , 34524@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) On 09/03/19 05:52, Bruno Haible wrote: > Hi Pádraig, > >>>> In regard to options for enabling various behaviors for wc(1), >>>> I'm thinking we might keep the strict POSIX isspace() behavior >>>> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace() >>>> by default > > Since you plan to add a --words=... option in the future (as suggested > by Paul or me), it would make sense to add this option now, instead > of testing POSIXLY_CORRECT. If you introduce POSIXLY_CORRECT dependent > behaviour now (and need to keep it for backward-compatibility), you'll > have a hard to understand interface: What will the following do? > > env POSIXLY_CORRECT=1 wc --words=unicode > wc --words=unicode Well until we actually support more contextual unicode word separation operation, the --words option parameter would be a bit redundant. Generally no-one would need to use POSIXLY_CORRECT directly with wc, rather setting it globally on a system or script to minimize changes. In the above example --words=unicode would be an explicit option to operate in extension to POSIX, and so POSIXLY_CORRECT would be ignored there. cheers, Pádraig From unknown Mon Jun 23 04:09:38 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sun, 07 Apr 2019 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator