From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: Mark Roberts Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Thu, 07 Dec 2017 16:27:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 29606@debbugs.gnu.org X-Debbugs-Original-To: bug-coreutils@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.151266398924638 (code B ref -1); Thu, 07 Dec 2017 16:27:02 +0000 Received: (at submit) by debbugs.gnu.org; 7 Dec 2017 16:26:29 +0000 Received: from localhost ([127.0.0.1]:50873 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMz0G-0006PJ-Hm for submit@debbugs.gnu.org; Thu, 07 Dec 2017 11:26:29 -0500 Received: from eggs.gnu.org ([208.118.235.92]:59006) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMt8J-0002Jo-KU for submit@debbugs.gnu.org; Thu, 07 Dec 2017 05:10:23 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eMt89-000482-FQ for submit@debbugs.gnu.org; Thu, 07 Dec 2017 05:10:18 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:56645) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eMt89-00047m-Bt for submit@debbugs.gnu.org; Thu, 07 Dec 2017 05:10:13 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49964) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eMt85-0002AN-4I for bug-coreutils@gnu.org; Thu, 07 Dec 2017 05:10:13 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eMt7y-0003qA-SA for bug-coreutils@gnu.org; Thu, 07 Dec 2017 05:10:08 -0500 Received: from mxrout04.htp-tel.de ([81.14.243.18]:65302) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eMt7y-0003nF-HD for bug-coreutils@gnu.org; Thu, 07 Dec 2017 05:10:02 -0500 Received: from mxrin03.htp-tel.de ([81.14.243.120]) by mxrout04.htp-tel.de with ESMTPS id vB7A9w0J026398 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 7 Dec 2017 11:09:58 +0100 (CET) Received: from gold.gold (a89-183-82-77.net-htp.de [89.183.82.77]) by mxrin03.htp-tel.de with ESMTPS id vB7A9vCO000875 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Thu, 7 Dec 2017 11:09:58 +0100 (CET) Received: from mroberts (helo=localhost) by gold.gold with local-esmtp (Exim 4.80) (envelope-from ) id 1eMt7z-0002mZ-1H for bug-coreutils@gnu.org; Thu, 07 Dec 2017 11:10:03 +0100 Date: Thu, 7 Dec 2017 11:10:02 +0100 (CET) From: Mark Roberts X-X-Sender: mroberts@gold.gold Message-ID: User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mxrin03.htp-tel.de [172.19.11.6]); Thu, 07 Dec 2017 11:09:58 +0100 (CET) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Mailman-Approved-At: Thu, 07 Dec 2017 11:26:26 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Dear maintainers, I am using fold version 8.13 on a Debian 3.2.93-1 > cat filename | fold If 'filename' contains utf8 characters consisting of more than one byte, fold will consider breaking the line inside such a character. There is no option to stop it doing that. Except, of course "-s": break at spaces. But that may not be what the user wants. According to man-page, it counts columns by default, not bytes. This seems not to be true. The switch "-b": count bytes, has no influence on the output in my test case. How to fix this? I presume that either (1) the default behavior (counting columns) is not what I expect, namely to count characters instead of bytes. This would have to be clarified in man-page. or (2) that the default isn't what the man-page says it is: possibly the default set in the code is to count bytes. This would be an error. or (3) that 'fold' fails to read my "LANG" environment variable which clearly states a UTF-8 locale. This, in 2017, is an error. Please write back to mroberts@rapid-arts-movement.de if you need example data or clarifications. Thank you, Mark Roberts From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: Assaf Gordon Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Thu, 07 Dec 2017 16:47:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Mark Roberts , 29606@debbugs.gnu.org Received: via spool by 29606-submit@debbugs.gnu.org id=B29606.151266521026444 (code B ref 29606); Thu, 07 Dec 2017 16:47:02 +0000 Received: (at 29606) by debbugs.gnu.org; 7 Dec 2017 16:46:50 +0000 Received: from localhost ([127.0.0.1]:50883 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMzJw-0006sQ-Sm for submit@debbugs.gnu.org; Thu, 07 Dec 2017 11:46:50 -0500 Received: from mail-it0-f45.google.com ([209.85.214.45]:47063) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMzJu-0006sD-VI for 29606@debbugs.gnu.org; Thu, 07 Dec 2017 11:46:47 -0500 Received: by mail-it0-f45.google.com with SMTP id t1so16220532ite.5 for <29606@debbugs.gnu.org>; Thu, 07 Dec 2017 08:46:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=R0HclsnHtCH0CZ4Yet/a0dCUkFM0bBF3CC4uiW1GH+w=; b=uFBi5OE4NeGWyEJLeurbfzq8yVht8LOGr74eWkJN4RskSFq1vKnS0eamWeqZVb45IH MeHTe7SavBkPZKB8qSW/4elZ0UosLeEEUv0Y0EQOjZi1MTqntFD2moZyID8HUcT5u4wM HuuEmM3u1YcApWYgDTuh0OzyFs9/jlXlHZF2ibE9+9WRiXcoSjhp+i7ivJqiAChwT8uJ 6fsQqtFkttv2W1tW3fzCcn91zu0szjQnWkIuNk5J74PM4IFGOPVysomJBw6kDDSGmcnl 8bF5sL0x8mZStpHU5Iu+ViufHyriwdto3eoV4fK04DUxVXt9z0uD82LjmgrK+P+1c9M6 MNOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=R0HclsnHtCH0CZ4Yet/a0dCUkFM0bBF3CC4uiW1GH+w=; b=dAsisVXI9diQFk717B5RniyqhKRbcbHbEyLxdiU1M9bDKhnJyBrQHiboMQC+ly7P2Y D4sK77RcAZD9d+n125tQD9vtUwCQ0Je/tUcHeWYd4Rgc2LizlfDbKFQXYy+lhjNz6lL5 54ES7Tmb7XmFaLd9bOngcLVDG5u6NnSU8SoXYkTE4/jBGvgfXPjc0uzdWCAL8UbrB/hb rb+a/DHQcnqBXGHdPgxnUSUGdcsCQLsoCiX9c5cvtjEFhpN8y7UeEF8LarIGQFtbimUG G6wiS1xlWafrp7PA4UfycKvSFB8/F3B4PXX/cxBEhCfrF4NasJpWcGAbcUbAsGtii6E/ UJ8Q== X-Gm-Message-State: AKGB3mIJW1Y1Nrb1gBsLuV7/KUBmAuWP5CtF6EOSvxUSdv8j53Vkk5bS NntPeuerm9nazwJn7QmEN0AFqkSs X-Google-Smtp-Source: AGs4zMa9NKedWKNPVsZaLwTF9WxWL2zcDOTLsL+49uXq6BYMu8AIj2fOISdrE5WYTgwipEHCNbyPPg== X-Received: by 10.36.190.205 with SMTP id i196mr2118846itf.84.1512665200633; Thu, 07 Dec 2017 08:46:40 -0800 (PST) Received: from [192.168.88.239] (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id p3sm2913009itc.39.2017.12.07.08.46.39 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 07 Dec 2017 08:46:39 -0800 (PST) References: From: Assaf Gordon Message-ID: Date: Thu, 7 Dec 2017 09:46:38 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) Hello, On 2017-12-07 03:10 AM, Mark Roberts wrote: > I am using fold version 8.13 on a Debian 3.2.93-1 Do you mean Debian 7 (Wheezy) with Linux Kernel 3.2.93-1 ? >> cat filename | fold > > If 'filename' contains utf8 characters consisting of more than one byte, > fold will consider breaking the line inside such a character. There is > no option to stop it doing that. That is correct. "fold" currently (as of coreutils version 8.28) does not support UTF-8 characters. > or (3) that 'fold' fails to read my "LANG" environment variable which > clearly states a UTF-8 locale. This, in 2017, is an error. Considering you are using Debian 7 from 2013, and coreutils 8.13 from 2011, the fact it is 2017 is not very relevant. There is an on-going effort to add multibyte/utf8 support to all coreutils programs. You can read more about it here: https://crashcourse.housegordon.org/coreutils-multibyte-support.html The current development patches do have utf8 support in fold. > Please write back [...] if you need example data or clarifications. If you'd like to help us test these patches, please try an unofficial development snapshot here: https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz regards, - assaf From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: Mark Roberts Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Thu, 07 Dec 2017 17:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Assaf Gordon Cc: 29606@debbugs.gnu.org Received: via spool by 29606-submit@debbugs.gnu.org id=B29606.151266811430733 (code B ref 29606); Thu, 07 Dec 2017 17:36:02 +0000 Received: (at 29606) by debbugs.gnu.org; 7 Dec 2017 17:35:14 +0000 Received: from localhost ([127.0.0.1]:50925 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eN04o-0007zc-3B for submit@debbugs.gnu.org; Thu, 07 Dec 2017 12:35:14 -0500 Received: from mxrout01.htp-tel.de ([81.14.243.49]:55660) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMzXH-0007C9-Ug for 29606@debbugs.gnu.org; Thu, 07 Dec 2017 12:00:36 -0500 Received: from mxrin01.htp-tel.de ([81.14.243.120]) by mxrout01.htp-tel.de with ESMTPS id vB7H0SD1012938 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 7 Dec 2017 18:00:28 +0100 (CET) Received: from gold.gold (a89-183-17-13.net-htp.de [89.183.17.13]) by mxrin01.htp-tel.de with ESMTPS id vB7H0S4O012890 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Thu, 7 Dec 2017 18:00:28 +0100 (CET) Received: from mroberts (helo=localhost) by gold.gold with local-esmtp (Exim 4.80) (envelope-from ) id 1eMzXF-00009d-Jy; Thu, 07 Dec 2017 18:00:33 +0100 Date: Thu, 7 Dec 2017 18:00:33 +0100 (CET) From: Mark Roberts X-X-Sender: mroberts@gold.gold In-Reply-To: Message-ID: References: User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mxrin01.htp-tel.de [172.19.11.4]); Thu, 07 Dec 2017 18:00:28 +0100 (CET) X-Spam-Score: -0.7 (/) X-Mailman-Approved-At: Thu, 07 Dec 2017 12:35:11 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Dear Assaf, thanks for the clarification. Yes, I did mean Debian 7. I didn't realise, quite how old my Debian was. I use it eight hours a day and it is stable. > Considering you are using Debian 7 from 2013, and coreutils 8.13 from > 2011, the fact it is 2017 is not very relevant. I hadn't seen it was quite so bad. Thanks for pointing it out. > If you'd like to help us test these patches, please try > an unofficial development snapshot here: > > https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz Will do. Mark From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: Mark Roberts Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Thu, 07 Dec 2017 17:36:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Assaf Gordon Cc: 29606@debbugs.gnu.org Received: via spool by 29606-submit@debbugs.gnu.org id=B29606.151266811430739 (code B ref 29606); Thu, 07 Dec 2017 17:36:02 +0000 Received: (at 29606) by debbugs.gnu.org; 7 Dec 2017 17:35:14 +0000 Received: from localhost ([127.0.0.1]:50927 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eN04o-0007ze-Dd for submit@debbugs.gnu.org; Thu, 07 Dec 2017 12:35:14 -0500 Received: from mxrout01.htp-tel.de ([81.14.243.49]:40622) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eN00W-0007tI-BO for 29606@debbugs.gnu.org; Thu, 07 Dec 2017 12:30:49 -0500 Received: from mxrin01.htp-tel.de ([81.14.243.120]) by mxrout01.htp-tel.de with ESMTPS id vB7HUeZU017419 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 7 Dec 2017 18:30:41 +0100 (CET) Received: from gold.gold (a89-183-17-13.net-htp.de [89.183.17.13]) by mxrin01.htp-tel.de with ESMTPS id vB7HUecj009100 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Thu, 7 Dec 2017 18:30:40 +0100 (CET) Received: from mroberts (helo=localhost) by gold.gold with local-esmtp (Exim 4.80) (envelope-from ) id 1eN00T-0000Ju-SM; Thu, 07 Dec 2017 18:30:45 +0100 Date: Thu, 7 Dec 2017 18:30:45 +0100 (CET) From: Mark Roberts X-X-Sender: mroberts@gold.gold In-Reply-To: Message-ID: References: User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="-1463809791-1020338266-1512667845=:12932" X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mxrin01.htp-tel.de [172.19.11.4]); Thu, 07 Dec 2017 18:30:40 +0100 (CET) X-Spam-Score: -0.7 (/) X-Mailman-Approved-At: Thu, 07 Dec 2017 12:35:11 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463809791-1020338266-1512667845=:12932 Content-Type: TEXT/PLAIN; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8BIT Dear Assaf, > If you'd like to help us test these patches, please try > an unofficial development snapshot here: > > https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz I have taken a look and have an unexpected result: fold (version 8.28.39-79242) reacts to my LANG envirionment variable, which is good, but it ignores the --bytes or -b flag, which is surprising. My test case uses 'echo' to send the German sharp s character, which is a two byte character, and a newline to 'fold --width 1'. I then use 'head -1' and 'wc --bytes' to count the bytes in line one. If UTF-8 is set, this should strip off one character (two bytes) plus one newline. It does. If UTF-8 is not set, it should strip off one bytes and a newline. It does. If 'fold --width 1 --bytes' is used, it should always strip off one byte and a newline, regardless of environment settings. It doesn't. The '--bytes' switch has no effect. Here are the test cases (the new versions of core-utils are in src/): > export LANG="" > src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes 2 This is correct: fold splits the line between the two bytes and puts a newline after each. Counting bytes in the first line gives 2, including the newline. > export LANG="de_DE.UTF-8" > src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes 3 This is wrong: fold has kept both bytes of the character on line one, although fold --bytes --width 1 should split after one byte. > export LANG="" > src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes 2 This is correct: without language setting fold treats each byte as a character. > export LANG="de_DE.UTF-8" > src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes 3 This is correct: The two-byte character remains on line one. Have I misunderstood what "fold --bytes" is supposed to mean? Or is this an error? All the best, Mark ---1463809791-1020338266-1512667845=:12932-- From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: Mark Roberts Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Fri, 08 Dec 2017 12:05:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Assaf Gordon Cc: 29606@debbugs.gnu.org Received: via spool by 29606-submit@debbugs.gnu.org id=B29606.151273466212646 (code B ref 29606); Fri, 08 Dec 2017 12:05:01 +0000 Received: (at 29606) by debbugs.gnu.org; 8 Dec 2017 12:04:22 +0000 Received: from localhost ([127.0.0.1]:51442 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNHOA-0003Hu-Mr for submit@debbugs.gnu.org; Fri, 08 Dec 2017 07:04:22 -0500 Received: from mxrout01.htp-tel.de ([81.14.243.49]:47556) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNHO9-0003Hg-9O for 29606@debbugs.gnu.org; Fri, 08 Dec 2017 07:04:22 -0500 Received: from mxrin04.htp-tel.de ([81.14.243.120]) by mxrout01.htp-tel.de with ESMTPS id vB8C4Ebb019012 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 8 Dec 2017 13:04:14 +0100 (CET) Received: from gold.gold (a89-183-48-230.net-htp.de [89.183.48.230]) by mxrin04.htp-tel.de with ESMTPS id vB8C4DBV024487 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Fri, 8 Dec 2017 13:04:14 +0100 (CET) Received: from mroberts (helo=localhost) by gold.gold with local-esmtp (Exim 4.80) (envelope-from ) id 1eNHO8-00037b-BN; Fri, 08 Dec 2017 13:04:20 +0100 Date: Fri, 8 Dec 2017 13:04:20 +0100 (CET) From: Mark Roberts X-X-Sender: mroberts@gold.gold In-Reply-To: Message-ID: References: User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mxrin04.htp-tel.de [172.19.11.7]); Fri, 08 Dec 2017 13:04:14 +0100 (CET) X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Dear Assaf, the reason for the unexpected behavior of 'fold', namely that specifying --bytes doesn't make it count bytes, is evident after a look at the source code. When --bytes is not specified, the program treats '\b', '\r' and '\t' specially. It assumes a tab width of eight (compile-time #define) and attempts to keep track of what the output will look like. This is absolutely not what I expected. But of course, when the program was first written, the words byte and character meant the same thing for printable characters. Printable bytes. I will attempt to suggest an improved text for the man-page so that others will not be surprised. Mark From unknown Mon Jun 23 15:02:08 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Mark Roberts Subject: bug#29606: closed (Re: bug#29606: Command 'fold' dangerous with utf-8 input) Message-ID: References: X-Gnu-PR-Message: they-closed 29606 X-Gnu-PR-Package: coreutils Reply-To: 29606@debbugs.gnu.org Date: Sat, 09 Dec 2017 03:16:02 +0000 Content-Type: multipart/mixed; boundary="----------=_1512789362-26422-1" This is a multi-part message in MIME format... ------------=_1512789362-26422-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #29606: Command 'fold' dangerous with utf-8 input which was filed against the coreutils package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 29606@debbugs.gnu.org. --=20 29606: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D29606 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1512789362-26422-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 29606-done) by debbugs.gnu.org; 9 Dec 2017 03:15:24 +0000 Received: from localhost ([127.0.0.1]:53037 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNVbn-0006rJ-TW for submit@debbugs.gnu.org; Fri, 08 Dec 2017 22:15:24 -0500 Received: from mail-io0-f195.google.com ([209.85.223.195]:39187) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNVbl-0006r6-L5 for 29606-done@debbugs.gnu.org; Fri, 08 Dec 2017 22:15:22 -0500 Received: by mail-io0-f195.google.com with SMTP id h12so4359046iof.6 for <29606-done@debbugs.gnu.org>; Fri, 08 Dec 2017 19:15:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:cc:references:from:message-id:date:user-agent :mime-version:in-reply-to:content-language:content-transfer-encoding; bh=p9jCdHX8nosZZc7E7J9HZ+L+N2HATllywQ7XO+JfnWM=; b=PyA7NobrMPM4WcFQNC5tyaLyvzqp6s2PHramZwHEoo+f3EBpOHnwDEzQhsIV+CnXs6 UY52l/Pj12R/TP7WbIwi8PI16DbccJ9pL7ANlMiwCKLV1vdqTvSbDaTMeChi3DkQpQu1 Ej2804xJbpoyqiDHmQ+HNftCCnZo5vN+oWKxX80YEcUVfVddHQ0+WWz5Mktefy0Oj85j ySWHDQpPIa37lPE/rn7kCsSZY6dd4cwKRUCl37wFMMEYqxc+bTTLVfEArDBglOVfBezF cYWh+VTuU3Sieh4p4xy/bfhp5ovOV+UToMSEPtck2wro+BSsMSxRyak1UQkCzFHLtWXq 9eeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=p9jCdHX8nosZZc7E7J9HZ+L+N2HATllywQ7XO+JfnWM=; b=h44En0fJzD9YdYyb6/pKCRghh9EDYzgeV0n2oa1OHLJCXrNCzrmU7wYEwSzj+ziEOF QIT1x6mwCFM0xm1L2a9HkIkjzIvQTqMJr6cwqPdB95t1dOt0mE4COVN/89M0fyLk8Tdo SLQtAomiK7AFVUMSKK15ouH/N/zhdogt61OBOAFA5ku5R6FHlHai+VAdkaXRijDnZOP1 Y7SgLul+2d/uwU911i+Q+04CF3fmBu7NKnI8/fyc8xzg+wGfhwCO+dRvqEZnvoI9poNs qwGI7nQoNLNq37ciBlmxYDGi4xgEbtZTgRZx9OSgOAsl2EO0DjpsKOKMozlT70dO8K65 J82w== X-Gm-Message-State: AKGB3mJpp7jUAtNSiK9P8Lns6/mX95xgmOuEaF0WNCa8UUDp0hSUVaSK Rho4vXkCT0D4ytiOzA+vpBffiVID X-Google-Smtp-Source: AGs4zMbZG+/QRjoaXFzBUUaK+tMWdjEYaSLl1sGoaeAQrCYXep/tK6GUGm2bS0w4PKYe84nGyORngQ== X-Received: by 10.107.102.19 with SMTP id a19mr26490606ioc.108.1512789315502; Fri, 08 Dec 2017 19:15:15 -0800 (PST) Received: from [192.168.88.239] (moose.housegordon.com. [184.68.105.38]) by smtp.googlemail.com with ESMTPSA id 140sm2102142itx.3.2017.12.08.19.15.13 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 Dec 2017 19:15:13 -0800 (PST) Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input To: Mark Roberts References: From: Assaf Gordon Message-ID: Date: Fri, 8 Dec 2017 20:15:12 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 29606-done Cc: 29606-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hello Mark, First, thank you for taking the time and effort to test our development snapshot, and reporting results back. This kind of feedback is critical in getting multibyte support ready. Second, I can confirm the behavior you are observing, reproduced here with 'od' for easier output: ## POSIX single-byte locale: $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An 303 \n 237 \n $ echo "ß" | LC_ALL=C src/fold --width 1 | od -tc -An 303 \n 237 \n ## UTF8 locale: $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An 303 237 \n $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --width 1 | od -tc -An 303 237 \n On 2017-12-08 05:04 AM, Mark Roberts wrote: > When --bytes is not specified, the program treats '\b', '\r' and '\t' > specially. It assumes a tab width of eight (compile-time #define) and > attempts to keep track of what the output will look like. > > This is absolutely not what I expected. That is correct, and I share your sentiment: it also took me some time to try and track down why it behaves this way, and whether it's by design or a bug. > But of course, when the program > was first written, the words byte and character meant the same thing for > printable characters. Printable bytes. The reasoning for this behavior is explained in the OpenGroup's POSIX standard page for fold, in the "RATIONAL" section: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18 There, it is made clear: "Historical versions of the fold utility assumed 1 byte was one character and occupied one column position when written out. This is no longer always true. [....] Note that although the width for the -b option is in bytes, a line is never split in the middle of a character." Therefore, the current implementation (of the development version) is correct. > I will attempt to suggest an improved text for the man-page so that > others will not be surprised. I agree that once multibyte support is added to fold(1), the man pages, the help screen and texi manual must be updated to clearly indicate the "-b/--bytes" only applies to \b \t \r and never to multibyte characters. If you find the time to send such a patch - great! If not, I will add it sooner or later (hopefully sooner). As such I'm closing this bug report, but further discussion (and patches) are welcomed by replying to this thread. regards, - assaf ------------=_1512789362-26422-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 7 Dec 2017 16:26:29 +0000 Received: from localhost ([127.0.0.1]:50873 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMz0G-0006PJ-Hm for submit@debbugs.gnu.org; Thu, 07 Dec 2017 11:26:29 -0500 Received: from eggs.gnu.org ([208.118.235.92]:59006) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eMt8J-0002Jo-KU for submit@debbugs.gnu.org; Thu, 07 Dec 2017 05:10:23 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eMt89-000482-FQ for submit@debbugs.gnu.org; Thu, 07 Dec 2017 05:10:18 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,URIBL_BLOCKED autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:56645) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eMt89-00047m-Bt for submit@debbugs.gnu.org; Thu, 07 Dec 2017 05:10:13 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:49964) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eMt85-0002AN-4I for bug-coreutils@gnu.org; Thu, 07 Dec 2017 05:10:13 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eMt7y-0003qA-SA for bug-coreutils@gnu.org; Thu, 07 Dec 2017 05:10:08 -0500 Received: from mxrout04.htp-tel.de ([81.14.243.18]:65302) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eMt7y-0003nF-HD for bug-coreutils@gnu.org; Thu, 07 Dec 2017 05:10:02 -0500 Received: from mxrin03.htp-tel.de ([81.14.243.120]) by mxrout04.htp-tel.de with ESMTPS id vB7A9w0J026398 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK) for ; Thu, 7 Dec 2017 11:09:58 +0100 (CET) Received: from gold.gold (a89-183-82-77.net-htp.de [89.183.82.77]) by mxrin03.htp-tel.de with ESMTPS id vB7A9vCO000875 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Thu, 7 Dec 2017 11:09:58 +0100 (CET) Received: from mroberts (helo=localhost) by gold.gold with local-esmtp (Exim 4.80) (envelope-from ) id 1eMt7z-0002mZ-1H for bug-coreutils@gnu.org; Thu, 07 Dec 2017 11:10:03 +0100 Date: Thu, 7 Dec 2017 11:10:02 +0100 (CET) From: Mark Roberts X-X-Sender: mroberts@gold.gold To: bug-coreutils@gnu.org Subject: Command 'fold' dangerous with utf-8 input Message-ID: User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; format=flowed; charset=US-ASCII X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mxrin03.htp-tel.de [172.19.11.6]); Thu, 07 Dec 2017 11:09:58 +0100 (CET) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Thu, 07 Dec 2017 11:26:26 -0500 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Dear maintainers, I am using fold version 8.13 on a Debian 3.2.93-1 > cat filename | fold If 'filename' contains utf8 characters consisting of more than one byte, fold will consider breaking the line inside such a character. There is no option to stop it doing that. Except, of course "-s": break at spaces. But that may not be what the user wants. According to man-page, it counts columns by default, not bytes. This seems not to be true. The switch "-b": count bytes, has no influence on the output in my test case. How to fix this? I presume that either (1) the default behavior (counting columns) is not what I expect, namely to count characters instead of bytes. This would have to be clarified in man-page. or (2) that the default isn't what the man-page says it is: possibly the default set in the code is to count bytes. This would be an error. or (3) that 'fold' fails to read my "LANG" environment variable which clearly states a UTF-8 locale. This, in 2017, is an error. Please write back to mroberts@rapid-arts-movement.de if you need example data or clarifications. Thank you, Mark Roberts ------------=_1512789362-26422-1-- From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: Mark Roberts Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Sat, 09 Dec 2017 13:23:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: Assaf Gordon Cc: 29606-done@debbugs.gnu.org Received: via spool by 29606-done@debbugs.gnu.org id=D29606.151282577130667 (code D ref 29606); Sat, 09 Dec 2017 13:23:02 +0000 Received: (at 29606-done) by debbugs.gnu.org; 9 Dec 2017 13:22:51 +0000 Received: from localhost ([127.0.0.1]:53202 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNf5f-0007yZ-AF for submit@debbugs.gnu.org; Sat, 09 Dec 2017 08:22:51 -0500 Received: from mxrout01.htp-tel.de ([81.14.243.49]:47839) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNf5c-0007yK-Mq for 29606-done@debbugs.gnu.org; Sat, 09 Dec 2017 08:22:49 -0500 Received: from mxrin03.htp-tel.de ([81.14.243.120]) by mxrout01.htp-tel.de with ESMTPS id vB9DMffa006102 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Sat, 9 Dec 2017 14:22:42 +0100 (CET) Received: from gold.gold (a89-183-135-180.net-htp.de [89.183.135.180]) by mxrin03.htp-tel.de with ESMTPS id vB9DMfAM003383 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Sat, 9 Dec 2017 14:22:41 +0100 (CET) Received: from mroberts (helo=localhost) by gold.gold with local-esmtp (Exim 4.80) (envelope-from ) id 1eNf5V-0003XZ-Rj; Sat, 09 Dec 2017 14:22:41 +0100 Date: Sat, 9 Dec 2017 14:22:41 +0100 (CET) From: Mark Roberts X-X-Sender: mroberts@gold.gold In-Reply-To: Message-ID: References: User-Agent: Alpine 2.02 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.4.3 (mxrin03.htp-tel.de [172.19.11.6]); Sat, 09 Dec 2017 14:22:41 +0100 (CET) X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Dear Assaf, > I agree that once multibyte support is added to fold(1), the man pages, > the help screen and texi manual must be updated to clearly > indicate the "-b/--bytes" only applies to \b \t \r and never to > multibyte characters. My suggestion for man-page: ========================== Old: --- -b, --bytes count bytes rather than columns New: --- -b, --bytes don't treat \b, \t, and \r specially My suggestions for info-page: ============================ Old: --- `-b' `--bytes' Count bytes rather than columns, so that tabs, backspaces, and carriage returns are each counted as taking up one column, just like other characters. New: --- `-b' `--bytes' Don't treat \b, \t, and \r specially. Instead tabs, backspaces, and carriage returns are each counted as taking up one column, just like other characters. My suggestion for --help-output =============================== Old: --- -b, --bytes count bytes rather than columns New: --- -b, --bytes don't treat \b, \t, and \r specially Hope this helps. Mark From unknown Mon Jun 23 15:02:08 2025 X-Loop: help-debbugs@gnu.org Subject: bug#29606: Command 'fold' dangerous with utf-8 input Resent-From: =?UTF-8?Q?P=C3=A1draig?= Brady Original-Sender: "Debbugs-submit" Resent-CC: bug-coreutils@gnu.org Resent-Date: Sat, 09 Dec 2017 23:51:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 29606 X-GNU-PR-Package: coreutils X-GNU-PR-Keywords: To: 29606@debbugs.gnu.org, assafgordon@gmail.com, mroberts@rapid-arts-movement.de Received: via spool by 29606-submit@debbugs.gnu.org id=B29606.151286344129202 (code B ref 29606); Sat, 09 Dec 2017 23:51:02 +0000 Received: (at 29606) by debbugs.gnu.org; 9 Dec 2017 23:50:41 +0000 Received: from localhost ([127.0.0.1]:54300 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNotF-0007av-E9 for submit@debbugs.gnu.org; Sat, 09 Dec 2017 18:50:41 -0500 Received: from mail.magicbluesmoke.com ([82.195.144.49]:50510) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1eNotD-0007an-MJ for 29606@debbugs.gnu.org; Sat, 09 Dec 2017 18:50:40 -0500 Received: from localhost.localdomain (c-73-158-116-184.hsd1.ca.comcast.net [73.158.116.184]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.magicbluesmoke.com (Postfix) with ESMTPSA id 0B1269AB2; Sat, 9 Dec 2017 23:50:37 +0000 (GMT) References: From: =?UTF-8?Q?P=C3=A1draig?= Brady Message-ID: Date: Sat, 9 Dec 2017 15:50:36 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: 0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) On 08/12/17 19:15, Assaf Gordon wrote: > Hello Mark, > > First, > thank you for taking the time and effort > to test our development snapshot, and reporting results back. > This kind of feedback is critical in getting multibyte support ready. > > > Second, > I can confirm the behavior you are observing, reproduced here > with 'od' for easier output: > > ## POSIX single-byte locale: > > $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An > 303 \n 237 \n > $ echo "ß" | LC_ALL=C src/fold --width 1 | od -tc -An > 303 \n 237 \n > > ## UTF8 locale: > > $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An > 303 237 \n > > $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --width 1 | od -tc -An > 303 237 \n > > > On 2017-12-08 05:04 AM, Mark Roberts wrote: >> When --bytes is not specified, the program treats '\b', '\r' and '\t' >> specially. It assumes a tab width of eight (compile-time #define) and >> attempts to keep track of what the output will look like. >> >> This is absolutely not what I expected. > > That is correct, and I share your sentiment: it also took me some time > to try and track down why it behaves this way, and whether it's by > design or a bug. > >> But of course, when the program >> was first written, the words byte and character meant the same thing for >> printable characters. Printable bytes. > > The reasoning for this behavior is explained in the OpenGroup's POSIX > standard page for fold, in the "RATIONAL" section: > http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18 > > There, it is made clear: > "Historical versions of the fold utility assumed 1 byte was one > character and occupied one column position when written out. This is > no longer always true. > [....] > Note that although the width for the -b option is in bytes, a line is > never split in the middle of a character." > > Therefore, the current implementation (of the development version) is > correct. > >> I will attempt to suggest an improved text for the man-page so that >> others will not be surprised. > > I agree that once multibyte support is added to fold(1), the man pages, > the help screen and texi manual must be updated to clearly > indicate the "-b/--bytes" only applies to \b \t \r and never to > multibyte characters. > > If you find the time to send such a patch - great! > If not, I will add it sooner or later (hopefully sooner). > > As such I'm closing this bug report, but further discussion (and > patches) are welcomed by replying to this thread. Note while splitting in the middle of a character is incorrect, it doesn't preclude approximate counting in --bytes. This is the approach the current i18n patch takes: $ export LC_ALL=en_CA.UTF-8 $ echo "ßß" | fold-i18n --bytes --width 1 | od -tc -An 303 237 \n 303 237 \n \n $ echo "ßß" | fold-i18n --bytes --width 2 | od -tc -An 303 237 \n 303 237 \n \n $ echo "ßß" | fold-assaf --bytes --width 2 | od -tc -An 303 237 303 237 \n The i18n version of fold also has a --characters option to operate in the current fold-assaf mode. I'm not convinced we want to be different from the i18n patch in this regard at least. cheers, Pádraig.