GNU bug report logs - #29606
Command 'fold' dangerous with utf-8 input

Previous Next

Package: coreutils;

Reported by: Mark Roberts <mroberts <at> rapid-arts-movement.de>

Date: Thu, 7 Dec 2017 16:27:02 UTC

Severity: normal

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 29606 in the body.
You can then email your comments to 29606 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Thu, 07 Dec 2017 16:27:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mark Roberts <mroberts <at> rapid-arts-movement.de>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 07 Dec 2017 16:27:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mark Roberts <mroberts <at> rapid-arts-movement.de>
To: bug-coreutils <at> gnu.org
Subject: Command 'fold' dangerous with utf-8 input
Date: Thu, 7 Dec 2017 11:10:02 +0100 (CET)
Dear maintainers,

I am using fold version 8.13 on a Debian 3.2.93-1

> cat filename | fold

If 'filename' contains utf8 characters consisting of more than one byte, 
fold will consider breaking the line inside such a character. There is no 
option to stop it doing that.

Except, of course "-s": break at spaces. But that may not be what the user 
wants.

According to man-page, it counts columns by default, not bytes. This seems 
not to be true. The switch "-b": count bytes, has no influence on the 
output in my test case.

How to fix this?

I presume that either (1) the default behavior (counting columns) is not 
what I expect, namely to count characters instead of bytes. This would 
have to be clarified in man-page.

or (2) that the default isn't what the man-page says it is: possibly the 
default set in the code is to count bytes. This would be an error.

or (3) that 'fold' fails to read my "LANG" environment variable which 
clearly states a UTF-8 locale. This, in 2017, is an error.


Please write back to mroberts <at> rapid-arts-movement.de if you need example 
data or clarifications.

Thank you,
Mark Roberts




Information forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Thu, 07 Dec 2017 16:47:02 GMT) Full text and rfc822 format available.

Message #8 received at 29606 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Mark Roberts <mroberts <at> rapid-arts-movement.de>, 29606 <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Thu, 7 Dec 2017 09:46:38 -0700
Hello,

On 2017-12-07 03:10 AM, Mark Roberts wrote:

> I am using fold version 8.13 on a Debian 3.2.93-1

Do you mean Debian 7 (Wheezy) with Linux Kernel 3.2.93-1 ?

>> cat filename | fold
> 
> If 'filename' contains utf8 characters consisting of more than one byte, 
> fold will consider breaking the line inside such a character. There is 
> no option to stop it doing that.

That is correct. "fold" currently (as of coreutils version 8.28) does 
not support UTF-8 characters.

> or (3) that 'fold' fails to read my "LANG" environment variable which 
> clearly states a UTF-8 locale. This, in 2017, is an error.

Considering you are using Debian 7 from 2013,
and coreutils 8.13 from 2011, the fact it is 2017 is not very relevant.

There is an on-going effort to add multibyte/utf8 support to all 
coreutils programs. You can read more about it here:
https://crashcourse.housegordon.org/coreutils-multibyte-support.html

The current development patches do have utf8 support in fold.

> Please write back [...] if you need example data or clarifications.

If you'd like to help us test these patches, please try
an unofficial development snapshot here:

https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz



regards,
 - assaf




Information forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Thu, 07 Dec 2017 17:36:02 GMT) Full text and rfc822 format available.

Message #11 received at 29606 <at> debbugs.gnu.org (full text, mbox):

From: Mark Roberts <mroberts <at> rapid-arts-movement.de>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29606 <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Thu, 7 Dec 2017 18:00:33 +0100 (CET)
Dear Assaf,

thanks for the clarification. Yes, I did mean Debian 7.

I didn't realise, quite how old my Debian was. I use it eight hours a day 
and it is stable.

> Considering you are using Debian 7 from 2013, and coreutils 8.13 from 
> 2011, the fact it is 2017 is not very relevant.

I hadn't seen it was quite so bad. Thanks for pointing it out.

> If you'd like to help us test these patches, please try
> an unofficial development snapshot here:
>
> https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz

Will do.
Mark




Information forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Thu, 07 Dec 2017 17:36:02 GMT) Full text and rfc822 format available.

Message #14 received at 29606 <at> debbugs.gnu.org (full text, mbox):

From: Mark Roberts <mroberts <at> rapid-arts-movement.de>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29606 <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Thu, 7 Dec 2017 18:30:45 +0100 (CET)
[Message part 1 (text/plain, inline)]
Dear Assaf,

> If you'd like to help us test these patches, please try
> an unofficial development snapshot here:
>
> https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz

I have taken a look and have an unexpected result:

fold (version 8.28.39-79242) reacts to my LANG envirionment variable, 
which is good, but it ignores the --bytes or -b flag, which is surprising.

My test case uses 'echo' to send the German sharp s character, which is a 
two byte character, and a newline to 'fold --width 1'. I then use 'head 
-1' and 'wc --bytes' to count the bytes in line one.

If UTF-8 is set, this should strip off one character (two bytes) plus one 
newline. It does.

If UTF-8 is not set, it should strip off one bytes and a newline. It does.

If 'fold --width 1 --bytes' is used, it should always strip off one byte 
and a newline, regardless of environment settings. It doesn't. The 
'--bytes' switch has no effect.

Here are the test cases (the new versions of core-utils are in src/):

> export LANG=""
> src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes
2

This is correct: fold splits the line between the two bytes and puts a 
newline after each. Counting bytes in the first line gives 2, including 
the newline.

> export LANG="de_DE.UTF-8"
> src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes
3

This is wrong: fold has kept both bytes of the character on line one, 
although fold --bytes --width 1 should split after one byte.

> export LANG=""
> src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes
2

This is correct: without language setting fold treats each byte as a 
character.

> export LANG="de_DE.UTF-8"
> src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes
3

This is correct: The two-byte character remains on line one.


Have I misunderstood what "fold --bytes" is supposed to mean? Or is this 
an error?

All the best,
Mark

Information forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Fri, 08 Dec 2017 12:05:01 GMT) Full text and rfc822 format available.

Message #17 received at 29606 <at> debbugs.gnu.org (full text, mbox):

From: Mark Roberts <mroberts <at> rapid-arts-movement.de>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29606 <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Fri, 8 Dec 2017 13:04:20 +0100 (CET)
Dear Assaf,

the reason for the unexpected behavior of 'fold', namely that specifying 
--bytes doesn't make it count bytes, is evident after a look at the source 
code.

When --bytes is not specified, the program treats '\b', '\r' and '\t' 
specially. It assumes a tab width of eight (compile-time #define) and 
attempts to keep track of what the output will look like.

This is absolutely not what I expected. But of course, when the program 
was first written, the words byte and character meant the same thing 
for printable characters. Printable bytes.

I will attempt to suggest an improved text for the man-page so that 
others will not be surprised.

Mark




Reply sent to Assaf Gordon <assafgordon <at> gmail.com>:
You have taken responsibility. (Sat, 09 Dec 2017 03:16:02 GMT) Full text and rfc822 format available.

Notification sent to Mark Roberts <mroberts <at> rapid-arts-movement.de>:
bug acknowledged by developer. (Sat, 09 Dec 2017 03:16:02 GMT) Full text and rfc822 format available.

Message #22 received at 29606-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Mark Roberts <mroberts <at> rapid-arts-movement.de>
Cc: 29606-done <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Fri, 8 Dec 2017 20:15:12 -0700
Hello Mark,

First,
thank you for taking the time and effort
to test our development snapshot, and reporting results back.
This kind of feedback is critical in getting multibyte support ready.


Second,
I can confirm the behavior you are observing, reproduced here
with 'od' for easier output:

## POSIX single-byte locale:

$ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
 303  \n 237  \n
$ echo "ß" | LC_ALL=C src/fold         --width 1 | od -tc -An
 303  \n 237  \n

## UTF8 locale:

$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
 303 237  \n

$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold         --width 1 | od -tc -An
 303 237  \n


On 2017-12-08 05:04 AM, Mark Roberts wrote:
> When --bytes is not specified, the program treats '\b', '\r' and '\t' 
> specially. It assumes a tab width of eight (compile-time #define) and 
> attempts to keep track of what the output will look like.
> 
> This is absolutely not what I expected.

That is correct, and I share your sentiment: it also took me some time
to try and track down why it behaves this way, and whether it's by 
design or a bug.

> But of course, when the program 
> was first written, the words byte and character meant the same thing for 
> printable characters. Printable bytes.

The reasoning for this behavior is explained in the OpenGroup's POSIX 
standard page for fold, in the "RATIONAL" section:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18

There, it is made clear:
  "Historical versions of the fold utility assumed 1 byte was one
  character and occupied one column position when written out. This is
  no longer always true.
  [....]
  Note that although the width for the -b option is in bytes, a line is
  never split in the middle of a character."

Therefore, the current implementation (of the development version) is 
correct.

> I will attempt to suggest an improved text for the man-page so that 
> others will not be surprised.

I agree that once multibyte support is added to fold(1), the man pages,
the help screen and texi manual must be updated to clearly
indicate the "-b/--bytes" only applies to \b \t \r and never to
multibyte characters.

If you find the time to send such a patch - great!
If not, I will add it sooner or later (hopefully sooner).

As such I'm closing this bug report, but further discussion (and
patches) are welcomed by replying to this thread.

regards,
 - assaf






Information forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Sat, 09 Dec 2017 13:23:02 GMT) Full text and rfc822 format available.

Message #25 received at 29606-done <at> debbugs.gnu.org (full text, mbox):

From: Mark Roberts <mroberts <at> rapid-arts-movement.de>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: 29606-done <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Sat, 9 Dec 2017 14:22:41 +0100 (CET)
Dear Assaf,

> I agree that once multibyte support is added to fold(1), the man pages,
> the help screen and texi manual must be updated to clearly
> indicate the "-b/--bytes" only applies to \b \t \r and never to
> multibyte characters.

My suggestion for man-page:
==========================

Old:
---

-b, --bytes
              count bytes rather than columns


New:
---

-b, --bytes
              don't treat \b, \t, and \r specially


My suggestions for info-page:
============================

Old:
---

`-b'
`--bytes'
     Count bytes rather than columns, so that tabs, backspaces, and
     carriage returns are each counted as taking up one column, just
     like other characters.


New:
---

`-b'
`--bytes'
     Don't treat \b, \t, and \r specially. Instead tabs, backspaces, and
     carriage returns are each counted as taking up one column, just
     like other characters.


My suggestion for --help-output
===============================

Old:
---

  -b, --bytes         count bytes rather than columns


New:
---

  -b, --bytes         don't treat \b, \t, and \r specially



Hope this helps.
Mark




Information forwarded to bug-coreutils <at> gnu.org:
bug#29606; Package coreutils. (Sat, 09 Dec 2017 23:51:02 GMT) Full text and rfc822 format available.

Message #28 received at 29606 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: 29606 <at> debbugs.gnu.org, assafgordon <at> gmail.com,
 mroberts <at> rapid-arts-movement.de
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Sat, 9 Dec 2017 15:50:36 -0800
On 08/12/17 19:15, Assaf Gordon wrote:
> Hello Mark,
> 
> First,
> thank you for taking the time and effort
> to test our development snapshot, and reporting results back.
> This kind of feedback is critical in getting multibyte support ready.
> 
> 
> Second,
> I can confirm the behavior you are observing, reproduced here
> with 'od' for easier output:
> 
> ## POSIX single-byte locale:
> 
> $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
>   303  \n 237  \n
> $ echo "ß" | LC_ALL=C src/fold         --width 1 | od -tc -An
>   303  \n 237  \n
> 
> ## UTF8 locale:
> 
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
>   303 237  \n
> 
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold         --width 1 | od -tc -An
>   303 237  \n
> 
> 
> On 2017-12-08 05:04 AM, Mark Roberts wrote:
>> When --bytes is not specified, the program treats '\b', '\r' and '\t' 
>> specially. It assumes a tab width of eight (compile-time #define) and 
>> attempts to keep track of what the output will look like.
>>
>> This is absolutely not what I expected.
> 
> That is correct, and I share your sentiment: it also took me some time
> to try and track down why it behaves this way, and whether it's by 
> design or a bug.
> 
>> But of course, when the program 
>> was first written, the words byte and character meant the same thing for 
>> printable characters. Printable bytes.
> 
> The reasoning for this behavior is explained in the OpenGroup's POSIX 
> standard page for fold, in the "RATIONAL" section:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18
> 
> There, it is made clear:
>    "Historical versions of the fold utility assumed 1 byte was one
>    character and occupied one column position when written out. This is
>    no longer always true.
>    [....]
>    Note that although the width for the -b option is in bytes, a line is
>    never split in the middle of a character."
> 
> Therefore, the current implementation (of the development version) is 
> correct.
> 
>> I will attempt to suggest an improved text for the man-page so that 
>> others will not be surprised.
> 
> I agree that once multibyte support is added to fold(1), the man pages,
> the help screen and texi manual must be updated to clearly
> indicate the "-b/--bytes" only applies to \b \t \r and never to
> multibyte characters.
> 
> If you find the time to send such a patch - great!
> If not, I will add it sooner or later (hopefully sooner).
> 
> As such I'm closing this bug report, but further discussion (and
> patches) are welcomed by replying to this thread.

Note while splitting in the middle of a character is incorrect,
it doesn't preclude approximate counting in --bytes.
This is the approach the current i18n patch takes:

$ export LC_ALL=en_CA.UTF-8
$ echo "ßß" | fold-i18n --bytes --width 1 | od -tc -An
 303 237  \n 303 237  \n  \n
$ echo "ßß" | fold-i18n --bytes --width 2 | od -tc -An
 303 237  \n 303 237  \n  \n
$ echo "ßß" | fold-assaf --bytes --width 2 | od -tc -An
 303 237 303 237  \n

The i18n version of fold also has a --characters option
to operate in the current fold-assaf mode.
I'm not convinced we want to be different from the i18n patch in this regard at least.

cheers,
Pádraig.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 07 Jan 2018 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 7 years and 169 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.