From unknown Sun Jun 22 03:49:01 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#20751 <20751@debbugs.gnu.org> To: bug#20751 <20751@debbugs.gnu.org> Subject: Status: wc -m doesn't count UTF-8 characters properly Reply-To: bug#20751 <20751@debbugs.gnu.org> Date: Sun, 22 Jun 2025 10:49:01 +0000 retitle 20751 wc -m doesn't count UTF-8 characters properly reassign 20751 coreutils submitter 20751 valdis.vitolins@odo.lv severity 20751 normal tag 20751 notabug thanks From debbugs-submit-bounces@debbugs.gnu.org Sat Jun 06 13:11:49 2015 Received: (at submit) by debbugs.gnu.org; 6 Jun 2015 17:11:49 +0000 Received: from localhost ([127.0.0.1]:43404 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1HdM-0002po-2U for submit@debbugs.gnu.org; Sat, 06 Jun 2015 13:11:48 -0400 Received: from odo.lv ([92.240.68.210]:52786) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1C20-0001Uy-RO for submit@debbugs.gnu.org; Sat, 06 Jun 2015 07:12:53 -0400 Received: by odo.lv (Postfix, from userid 1006) id 1A4B1109029; Sat, 6 Jun 2015 14:12:46 +0300 (EEST) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on odo.lv X-Spam-Level: X-Spam-Status: No, score=-2.9 required=4.0 tests=ALL_TRUSTED,BAYES_00 autolearn=no version=3.3.2 Received: from [10.0.0.7] (bubba [84.237.141.51]) (using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by odo.lv (Postfix) with ESMTPSA id 31E14108F7D for ; Sat, 6 Jun 2015 14:12:30 +0300 (EEST) Message-ID: <1433589149.2439.10.camel@vostro> Subject: wc -m doesn't count UTF-8 characters properly From: Valdis =?UTF-8?Q?V=C4=ABtoli=C5=86=C5=A1?= To: submit@debbugs.gnu.org Date: Sat, 06 Jun 2015 14:12:29 +0300 Organization: Odo Content-Type: multipart/mixed; boundary="=-9SiK36ltpioJx4yBxu8W" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Sat, 06 Jun 2015 13:11:45 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: valdis.vitolins@odo.lv List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --=-9SiK36ltpioJx4yBxu8W Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit Version: wc (GNU coreutils) 8.21 When 'wc -m' is invoked, it should print character count, but it counts incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6 bytes in them, but all have only two UTF-8 encoded characters, which you can see with any modern text editor. wc -c chows correct number of bytes: wc -c * 3 3bytes.txt 4 4bytes.txt 6 6bytes.txt 13 total But wc -m shows incorrect number of characters: wc -m * 3 3bytes.txt 3 4bytes.txt 3 6bytes.txt 9 total But should be: wc -m * 2 3bytes.txt 2 4bytes.txt 2 6bytes.txt 6 total I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64 GNU/Linux 3.13.0-53-generic kernel P.S. If attachments will not pass through system, you can test it by creating files with following content: 3bytes.txt: aa 4bytes.txt: aā 6bytes.txt: a𐄈 --=-9SiK36ltpioJx4yBxu8W Content-Disposition: attachment; filename="3bytes.txt" Content-Type: text/plain; name="3bytes.txt"; charset="UTF-8" Content-Transfer-Encoding: 7bit aa --=-9SiK36ltpioJx4yBxu8W Content-Disposition: attachment; filename="4bytes.txt" Content-Type: text/plain; name="4bytes.txt"; charset="UTF-8" Content-Transfer-Encoding: 8bit aā --=-9SiK36ltpioJx4yBxu8W Content-Disposition: attachment; filename="6bytes.txt" Content-Type: text/plain; name="6bytes.txt"; charset="UTF-8" Content-Transfer-Encoding: 8bit a𐄈 --=-9SiK36ltpioJx4yBxu8W-- From debbugs-submit-bounces@debbugs.gnu.org Sat Jun 06 14:10:31 2015 Received: (at 20751) by debbugs.gnu.org; 6 Jun 2015 18:10:31 +0000 Received: from localhost ([127.0.0.1]:43427 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1IYA-00050w-Mc for submit@debbugs.gnu.org; Sat, 06 Jun 2015 14:10:31 -0400 Received: from fencepost.gnu.org ([208.118.235.10]:56963 ident=Debian-exim) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1IY9-00050g-5c for 20751@debbugs.gnu.org; Sat, 06 Jun 2015 14:10:29 -0400 Received: from rgm by fencepost.gnu.org with local (Exim 4.82) (envelope-from ) id 1Z1IY6-0003Rt-GK; Sat, 06 Jun 2015 14:10:26 -0400 From: Glenn Morris To: valdis.vitolins@odo.lv Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly In-Reply-To: <1433589149.2439.10.camel@vostro> ("Valdis \=\?utf-8\?B\?VsSrdG9s\?\= \=\?utf-8\?B\?acWGxaEiJ3M\=\?\= message of "Sat, 06 Jun 2015 14:12:29 +0300") References: <1433589149.2439.10.camel@vostro> Mail-Followup-To: valdis.vitolins@odo.lv, 20751@debbugs.gnu.org X-Spook: unclassified Weapons grade Mudslide INSCOM Rule Psix 9/11 X-Ran: h)9TaRwy+|EVX,^NMp){Sqn:bR/rG:*BP!qW:4(Y;NStH[I2Zumbj(-j97Uh.oI-8c]6'g X-Hue: red X-Attribution: GM Date: Sat, 06 Jun 2015 14:10:23 -0400 Message-ID: <8csia4908g.fsf@fencepost.gnu.org> User-Agent: Gnus (www.gnus.org), GNU Emacs (www.gnu.org/software/emacs/) MIME-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 8bit X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: 20751 Cc: 20751@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: valdis.vitolins@odo.lv, 20751@debbugs.gnu.org List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) You mailed submit@debbugs without specifying a Package:, so your bug report ended up on the help-debbugs list. I have reassigned it to coreutils. (Please note there is no "wc" package.) (My mailer is messing up the UTF-8 characters in your report. Interested parties can see the original at http://debbugs.gnu.org/20751#5 .) Valdis V toli wrote: > Version: wc (GNU coreutils) 8.21 > > When 'wc -m' is invoked, it should print character count, but it counts > incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6 > bytes in them, but all have only two UTF-8 encoded characters, which you > can see with any modern text editor. > > wc -c chows correct number of bytes: > wc -c * > 3 3bytes.txt > 4 4bytes.txt > 6 6bytes.txt > 13 total > > But wc -m shows incorrect number of characters: > wc -m * > 3 3bytes.txt > 3 4bytes.txt > 3 6bytes.txt > 9 total > > But should be: > wc -m * > 2 3bytes.txt > 2 4bytes.txt > 2 6bytes.txt > 6 total > > I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64 > GNU/Linux 3.13.0-53-generic kernel > > P.S. > If attachments will not pass through system, you can test it by creating > files with following content: > > 3bytes.txt: aa > 4bytes.txt: aā > 6bytes.txt: a Attachments at http://debbugs.gnu.org/20751#5 From debbugs-submit-bounces@debbugs.gnu.org Sat Jun 06 14:49:32 2015 Received: (at 20751) by debbugs.gnu.org; 6 Jun 2015 18:49:32 +0000 Received: from localhost ([127.0.0.1]:43476 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1J9w-00067a-0w for submit@debbugs.gnu.org; Sat, 06 Jun 2015 14:49:32 -0400 Received: from odo.lv ([92.240.68.210]:47401) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1J9u-00067M-2g for 20751@debbugs.gnu.org; Sat, 06 Jun 2015 14:49:30 -0400 Received: by odo.lv (Postfix, from userid 1006) id 4821E109029; Sat, 6 Jun 2015 21:49:23 +0300 (EEST) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on odo.lv X-Spam-Level: X-Spam-Status: No, score=-2.9 required=4.0 tests=ALL_TRUSTED,BAYES_00 autolearn=no version=3.3.2 Received: from [10.0.0.7] (bubba [84.237.141.51]) (using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by odo.lv (Postfix) with ESMTPSA id 9E947108F7D for <20751@debbugs.gnu.org>; Sat, 6 Jun 2015 21:49:16 +0300 (EEST) Message-ID: <1433616556.2526.6.camel@vostro> Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly From: Valdis =?UTF-8?Q?V=C4=ABtoli=C5=86=C5=A1?= To: 20751@debbugs.gnu.org Date: Sat, 06 Jun 2015 21:49:16 +0300 In-Reply-To: <8csia4908g.fsf@fencepost.gnu.org> References: <1433589149.2439.10.camel@vostro> <8csia4908g.fsf@fencepost.gnu.org> Organization: Odo Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20751 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: valdis.vitolins@odo.lv List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Note, that UTF-8 characters can be counted by counting bytes with bit patterns 0xxxxxxx or 11xxxxxx: https://en.wikipedia.org/wiki/UTF-8#Description So, general logic should be, that, if: a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or b) first two bytes of file are 0xFE 0xFF https://en.wikipedia.org/wiki/Byte_order_mark then count bytes with bits 0xxxxxxx and 11xxxxxx. > You mailed submit@debbugs without specifying a Package:, so your bug > report ended up on the help-debbugs list. I have reassigned it to > coreutils. (Please note there is no "wc" package.) > > (My mailer is messing up the UTF-8 characters in your report. > Interested parties can see the original at http://debbugs.gnu.org/20751#5 .) > > Valdis V toli wrote: > > > Version: wc (GNU coreutils) 8.21 > > > > When 'wc -m' is invoked, it should print character count, but it counts > > incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6 > > bytes in them, but all have only two UTF-8 encoded characters, which you > > can see with any modern text editor. > > > > wc -c chows correct number of bytes: > > wc -c * > > 3 3bytes.txt > > 4 4bytes.txt > > 6 6bytes.txt > > 13 total > > > > But wc -m shows incorrect number of characters: > > wc -m * > > 3 3bytes.txt > > 3 4bytes.txt > > 3 6bytes.txt > > 9 total > > > > But should be: > > wc -m * > > 2 3bytes.txt > > 2 4bytes.txt > > 2 6bytes.txt > > 6 total > > > > I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64 > > GNU/Linux 3.13.0-53-generic kernel > > > > P.S. > > If attachments will not pass through system, you can test it by creating > > files with following content: > > > > 3bytes.txt: aa > > 4bytes.txt: aā > > 6bytes.txt: a > > Attachments at http://debbugs.gnu.org/20751#5 From debbugs-submit-bounces@debbugs.gnu.org Sat Jun 06 17:43:39 2015 Received: (at 20751) by debbugs.gnu.org; 6 Jun 2015 21:43:39 +0000 Received: from localhost ([127.0.0.1]:43530 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1LsQ-0001x9-6f for submit@debbugs.gnu.org; Sat, 06 Jun 2015 17:43:39 -0400 Received: from mail1.vodafone.ie ([213.233.128.43]:36033) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1LsM-0001wk-K3; Sat, 06 Jun 2015 17:43:35 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AhAdAH9oc1VtTLGd/2dsb2JhbABbgxCCboEUTsBeglUCgS5MAQEBAQEBgQtBAQICg10BAQQjDwFWCQINCwICBRYLAgIJAwIBAgFFBwwIAQGILQGZYJ0ZhWudXwErgSGKIoUNgmiBRQEEpnWPQCRhgSUfgVM9gngBAQE Received: from unknown (HELO localhost.localdomain) ([109.76.177.157]) by mail1.vodafone.ie with ESMTP; 06 Jun 2015 22:43:28 +0100 Message-ID: <55736980.1050408@draigBrady.com> Date: Sat, 06 Jun 2015 22:43:28 +0100 From: =?UTF-8?B?UMOhZHJhaWcgQnJhZHk=?= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: valdis.vitolins@odo.lv, 20751@debbugs.gnu.org Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly References: <1433589149.2439.10.camel@vostro> <8csia4908g.fsf@fencepost.gnu.org> <1433616556.2526.6.camel@vostro> In-Reply-To: <1433616556.2526.6.camel@vostro> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 20751 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) tag 20751 notabug close 20751 stop On 06/06/15 19:49, Valdis Vītoliņš wrote: >>> Version: wc (GNU coreutils) 8.21 >>> >>> When 'wc -m' is invoked, it should print character count, but it counts >>> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6 >>> bytes in them, but all have only two UTF-8 encoded characters, which you >>> can see with any modern text editor. >>> >>> wc -c chows correct number of bytes: >>> wc -c * >>> 3 3bytes.txt >>> 4 4bytes.txt >>> 6 6bytes.txt >>> 13 total >>> >>> But wc -m shows incorrect number of characters: >>> wc -m * >>> 3 3bytes.txt >>> 3 4bytes.txt >>> 3 6bytes.txt >>> 9 total >>> >>> But should be: >>> wc -m * >>> 2 3bytes.txt >>> 2 4bytes.txt >>> 2 6bytes.txt >>> 6 total I think it's working correctly. I.E. the \n is included in the count. thanks, Pádraig. From debbugs-submit-bounces@debbugs.gnu.org Sun Jun 07 16:50:48 2015 Received: (at 20751) by debbugs.gnu.org; 7 Jun 2015 20:50:49 +0000 Received: from localhost ([127.0.0.1]:44356 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1hWq-0000sc-Eg for submit@debbugs.gnu.org; Sun, 07 Jun 2015 16:50:48 -0400 Received: from odo.lv ([92.240.68.210]:42004) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1hWm-0000sK-O2 for 20751@debbugs.gnu.org; Sun, 07 Jun 2015 16:50:46 -0400 Received: by odo.lv (Postfix, from userid 1006) id 1CF2710904B; Sun, 7 Jun 2015 23:50:37 +0300 (EEST) X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on odo.lv X-Spam-Level: X-Spam-Status: No, score=-2.9 required=4.0 tests=ALL_TRUSTED,BAYES_00 autolearn=no version=3.3.2 Received: from [10.0.0.7] (bubba [84.237.141.51]) (using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by odo.lv (Postfix) with ESMTPSA id 14C9910902E for <20751@debbugs.gnu.org>; Sun, 7 Jun 2015 23:50:30 +0300 (EEST) Message-ID: <1433710227.2472.4.camel@vostro> Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly From: Valdis =?UTF-8?Q?V=C4=ABtoli=C5=86=C5=A1?= To: 20751@debbugs.gnu.org Date: Sun, 07 Jun 2015 23:50:27 +0300 In-Reply-To: <55736980.1050408@draigBrady.com> References: <1433589149.2439.10.camel@vostro> <8csia4908g.fsf@fencepost.gnu.org> <1433616556.2526.6.camel@vostro> <55736980.1050408@draigBrady.com> Organization: Odo Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.10.4-0ubuntu2 Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 20751 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: valdis.vitolins@odo.lv List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Thanks for clarification! I tested it with Bash script: chars=$(wc -m mylog|cut -d ' ' -f1) lines=$(wc -l mylog|cut -d ' ' -f1) let chars="$chars - $lines" echo $chars and got the same number as given by vim :%s/.//gn (Which was place from what I got confused.) Hopefully this bug description will help to others. > > I think it's working correctly. > I.E. the \n is included in the count. > > thanks, > Pádraig. > From debbugs-submit-bounces@debbugs.gnu.org Sun Jun 07 17:47:40 2015 Received: (at 20751) by debbugs.gnu.org; 7 Jun 2015 21:47:40 +0000 Received: from localhost ([127.0.0.1]:44365 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1iPr-0002BT-Rg for submit@debbugs.gnu.org; Sun, 07 Jun 2015 17:47:40 -0400 Received: from mail-wi0-f172.google.com ([209.85.212.172]:33487) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Z1iPp-0002BB-He for 20751@debbugs.gnu.org; Sun, 07 Jun 2015 17:47:38 -0400 Received: by wiwd19 with SMTP id d19so67651997wiw.0 for <20751@debbugs.gnu.org>; Sun, 07 Jun 2015 14:47:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=Yk+WH/S159RVjDaK9FLAqvJcpD3FipHpTlFtQirmkhs=; b=zj0NxKSKAr6FzuG5FpHk2If3ldITHebfM9AYrAIIKetWQs3rOOe42gDaJYcwsJQUTR o9QFRD7eaVCNYHvjeDeVg2phoCFdIXtHy/64GUGIvLDUNTMp+F0xAFisjapiF1+JEy7l SOcQmQvk6GABp95clK87iEyNcwIFzZ2wfMi6KjWXCV1rPakfZZvP18AXyJ85QDI/YK3y ZDLIO3IPnepzrKRMJpJm2rZaBS6FojUGuOvRzLR4MjXJpyFdui7Zh/SZn02EdHR9JDFO vC4oWHwUCFzjlWfnCcDkHGV0TZcIsVj9CBMAJZN4TOtZ1A/Th2TNP8g74mhxHYPO6Gv0 Op1Q== X-Received: by 10.180.80.197 with SMTP id t5mr16035405wix.63.1433713651783; Sun, 07 Jun 2015 14:47:31 -0700 (PDT) Received: from chaz.gmail.com (05448b1b.skybroadband.com. [5.68.139.27]) by mx.google.com with ESMTPSA id mv11sm8474412wic.23.2015.06.07.14.47.30 (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Sun, 07 Jun 2015 14:47:30 -0700 (PDT) Date: Sun, 7 Jun 2015 22:47:29 +0100 From: Stephane Chazelas To: Valdis =?utf-8?B?VsSrdG9sacWGxaE=?= Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly Message-ID: <20150607214729.GB9894@chaz.gmail.com> References: <1433589149.2439.10.camel@vostro> <8csia4908g.fsf@fencepost.gnu.org> <1433616556.2526.6.camel@vostro> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1433616556.2526.6.camel@vostro> User-Agent: Mutt/1.5.21 (2010-09-15) X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 20751 Cc: 20751@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) 2015-06-06 21:49:16 +0300, Valdis Vītoliņš: > Note, that UTF-8 characters can be counted by counting bytes with bit > patterns 0xxxxxxx or 11xxxxxx: > https://en.wikipedia.org/wiki/UTF-8#Description > > So, general logic should be, that, if: > a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or > b) first two bytes of file are 0xFE 0xFF > https://en.wikipedia.org/wiki/Byte_order_mark > > then count bytes with bits 0xxxxxxx and 11xxxxxx. [...] Except that only valid characters should be counted. And there, the definition of valid character is not always clear. At least an incorrect UTF-8 encoding can't count as valid characters. So printf '\300' | wc -m should return 0 as 11000000 alone is not a valid character so we can't use your algorithm without first verifying the validity of the input. Then the UTF-8 encoding of the UTF16 surrogate pairs (0xD800 to 0xDFFF) should probably be excluded as well: printf '\355\240\200' | wc -m should return 0 for instance.. And maybe code-points above 0x11FFFF now since Unicode seem to have given up on ever defining characters above that (probably because of the UTF16 limitation). Now even in the range 0 -> D700, E000-> 0x11FFFF, there are still thousands of code points that are not defined yet in the latest Unicode version. I suppose we can imagine locale definitions where each of the known characters are listed and the rest rejected... -- Stephane From unknown Sun Jun 22 03:49:01 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 06 Jul 2015 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator