GNU bug report logs -
#29606
Command 'fold' dangerous with utf-8 input
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 29606 in the body.
You can then email your comments to 29606 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Thu, 07 Dec 2017 16:27:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Mark Roberts <mroberts <at> rapid-arts-movement.de>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Thu, 07 Dec 2017 16:27:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Dear maintainers,
I am using fold version 8.13 on a Debian 3.2.93-1
> cat filename | fold
If 'filename' contains utf8 characters consisting of more than one byte,
fold will consider breaking the line inside such a character. There is no
option to stop it doing that.
Except, of course "-s": break at spaces. But that may not be what the user
wants.
According to man-page, it counts columns by default, not bytes. This seems
not to be true. The switch "-b": count bytes, has no influence on the
output in my test case.
How to fix this?
I presume that either (1) the default behavior (counting columns) is not
what I expect, namely to count characters instead of bytes. This would
have to be clarified in man-page.
or (2) that the default isn't what the man-page says it is: possibly the
default set in the code is to count bytes. This would be an error.
or (3) that 'fold' fails to read my "LANG" environment variable which
clearly states a UTF-8 locale. This, in 2017, is an error.
Please write back to mroberts <at> rapid-arts-movement.de if you need example
data or clarifications.
Thank you,
Mark Roberts
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Thu, 07 Dec 2017 16:47:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 29606 <at> debbugs.gnu.org (full text, mbox):
Hello,
On 2017-12-07 03:10 AM, Mark Roberts wrote:
> I am using fold version 8.13 on a Debian 3.2.93-1
Do you mean Debian 7 (Wheezy) with Linux Kernel 3.2.93-1 ?
>> cat filename | fold
>
> If 'filename' contains utf8 characters consisting of more than one byte,
> fold will consider breaking the line inside such a character. There is
> no option to stop it doing that.
That is correct. "fold" currently (as of coreutils version 8.28) does
not support UTF-8 characters.
> or (3) that 'fold' fails to read my "LANG" environment variable which
> clearly states a UTF-8 locale. This, in 2017, is an error.
Considering you are using Debian 7 from 2013,
and coreutils 8.13 from 2011, the fact it is 2017 is not very relevant.
There is an on-going effort to add multibyte/utf8 support to all
coreutils programs. You can read more about it here:
https://crashcourse.housegordon.org/coreutils-multibyte-support.html
The current development patches do have utf8 support in fold.
> Please write back [...] if you need example data or clarifications.
If you'd like to help us test these patches, please try
an unofficial development snapshot here:
https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz
regards,
- assaf
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Thu, 07 Dec 2017 17:36:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 29606 <at> debbugs.gnu.org (full text, mbox):
Dear Assaf,
thanks for the clarification. Yes, I did mean Debian 7.
I didn't realise, quite how old my Debian was. I use it eight hours a day
and it is stable.
> Considering you are using Debian 7 from 2013, and coreutils 8.13 from
> 2011, the fact it is 2017 is not very relevant.
I hadn't seen it was quite so bad. Thanks for pointing it out.
> If you'd like to help us test these patches, please try
> an unofficial development snapshot here:
>
> https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz
Will do.
Mark
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Thu, 07 Dec 2017 17:36:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 29606 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Dear Assaf,
> If you'd like to help us test these patches, please try
> an unofficial development snapshot here:
>
> https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz
I have taken a look and have an unexpected result:
fold (version 8.28.39-79242) reacts to my LANG envirionment variable,
which is good, but it ignores the --bytes or -b flag, which is surprising.
My test case uses 'echo' to send the German sharp s character, which is a
two byte character, and a newline to 'fold --width 1'. I then use 'head
-1' and 'wc --bytes' to count the bytes in line one.
If UTF-8 is set, this should strip off one character (two bytes) plus one
newline. It does.
If UTF-8 is not set, it should strip off one bytes and a newline. It does.
If 'fold --width 1 --bytes' is used, it should always strip off one byte
and a newline, regardless of environment settings. It doesn't. The
'--bytes' switch has no effect.
Here are the test cases (the new versions of core-utils are in src/):
> export LANG=""
> src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes
2
This is correct: fold splits the line between the two bytes and puts a
newline after each. Counting bytes in the first line gives 2, including
the newline.
> export LANG="de_DE.UTF-8"
> src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes
3
This is wrong: fold has kept both bytes of the character on line one,
although fold --bytes --width 1 should split after one byte.
> export LANG=""
> src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes
2
This is correct: without language setting fold treats each byte as a
character.
> export LANG="de_DE.UTF-8"
> src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes
3
This is correct: The two-byte character remains on line one.
Have I misunderstood what "fold --bytes" is supposed to mean? Or is this
an error?
All the best,
Mark
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Fri, 08 Dec 2017 12:05:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 29606 <at> debbugs.gnu.org (full text, mbox):
Dear Assaf,
the reason for the unexpected behavior of 'fold', namely that specifying
--bytes doesn't make it count bytes, is evident after a look at the source
code.
When --bytes is not specified, the program treats '\b', '\r' and '\t'
specially. It assumes a tab width of eight (compile-time #define) and
attempts to keep track of what the output will look like.
This is absolutely not what I expected. But of course, when the program
was first written, the words byte and character meant the same thing
for printable characters. Printable bytes.
I will attempt to suggest an improved text for the man-page so that
others will not be surprised.
Mark
Reply sent
to
Assaf Gordon <assafgordon <at> gmail.com>
:
You have taken responsibility.
(Sat, 09 Dec 2017 03:16:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Mark Roberts <mroberts <at> rapid-arts-movement.de>
:
bug acknowledged by developer.
(Sat, 09 Dec 2017 03:16:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 29606-done <at> debbugs.gnu.org (full text, mbox):
Hello Mark,
First,
thank you for taking the time and effort
to test our development snapshot, and reporting results back.
This kind of feedback is critical in getting multibyte support ready.
Second,
I can confirm the behavior you are observing, reproduced here
with 'od' for easier output:
## POSIX single-byte locale:
$ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
303 \n 237 \n
$ echo "ß" | LC_ALL=C src/fold --width 1 | od -tc -An
303 \n 237 \n
## UTF8 locale:
$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
303 237 \n
$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --width 1 | od -tc -An
303 237 \n
On 2017-12-08 05:04 AM, Mark Roberts wrote:
> When --bytes is not specified, the program treats '\b', '\r' and '\t'
> specially. It assumes a tab width of eight (compile-time #define) and
> attempts to keep track of what the output will look like.
>
> This is absolutely not what I expected.
That is correct, and I share your sentiment: it also took me some time
to try and track down why it behaves this way, and whether it's by
design or a bug.
> But of course, when the program
> was first written, the words byte and character meant the same thing for
> printable characters. Printable bytes.
The reasoning for this behavior is explained in the OpenGroup's POSIX
standard page for fold, in the "RATIONAL" section:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18
There, it is made clear:
"Historical versions of the fold utility assumed 1 byte was one
character and occupied one column position when written out. This is
no longer always true.
[....]
Note that although the width for the -b option is in bytes, a line is
never split in the middle of a character."
Therefore, the current implementation (of the development version) is
correct.
> I will attempt to suggest an improved text for the man-page so that
> others will not be surprised.
I agree that once multibyte support is added to fold(1), the man pages,
the help screen and texi manual must be updated to clearly
indicate the "-b/--bytes" only applies to \b \t \r and never to
multibyte characters.
If you find the time to send such a patch - great!
If not, I will add it sooner or later (hopefully sooner).
As such I'm closing this bug report, but further discussion (and
patches) are welcomed by replying to this thread.
regards,
- assaf
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Sat, 09 Dec 2017 13:23:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 29606-done <at> debbugs.gnu.org (full text, mbox):
Dear Assaf,
> I agree that once multibyte support is added to fold(1), the man pages,
> the help screen and texi manual must be updated to clearly
> indicate the "-b/--bytes" only applies to \b \t \r and never to
> multibyte characters.
My suggestion for man-page:
==========================
Old:
---
-b, --bytes
count bytes rather than columns
New:
---
-b, --bytes
don't treat \b, \t, and \r specially
My suggestions for info-page:
============================
Old:
---
`-b'
`--bytes'
Count bytes rather than columns, so that tabs, backspaces, and
carriage returns are each counted as taking up one column, just
like other characters.
New:
---
`-b'
`--bytes'
Don't treat \b, \t, and \r specially. Instead tabs, backspaces, and
carriage returns are each counted as taking up one column, just
like other characters.
My suggestion for --help-output
===============================
Old:
---
-b, --bytes count bytes rather than columns
New:
---
-b, --bytes don't treat \b, \t, and \r specially
Hope this helps.
Mark
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#29606
; Package
coreutils
.
(Sat, 09 Dec 2017 23:51:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 29606 <at> debbugs.gnu.org (full text, mbox):
On 08/12/17 19:15, Assaf Gordon wrote:
> Hello Mark,
>
> First,
> thank you for taking the time and effort
> to test our development snapshot, and reporting results back.
> This kind of feedback is critical in getting multibyte support ready.
>
>
> Second,
> I can confirm the behavior you are observing, reproduced here
> with 'od' for easier output:
>
> ## POSIX single-byte locale:
>
> $ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
> 303 \n 237 \n
> $ echo "ß" | LC_ALL=C src/fold --width 1 | od -tc -An
> 303 \n 237 \n
>
> ## UTF8 locale:
>
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
> 303 237 \n
>
> $ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --width 1 | od -tc -An
> 303 237 \n
>
>
> On 2017-12-08 05:04 AM, Mark Roberts wrote:
>> When --bytes is not specified, the program treats '\b', '\r' and '\t'
>> specially. It assumes a tab width of eight (compile-time #define) and
>> attempts to keep track of what the output will look like.
>>
>> This is absolutely not what I expected.
>
> That is correct, and I share your sentiment: it also took me some time
> to try and track down why it behaves this way, and whether it's by
> design or a bug.
>
>> But of course, when the program
>> was first written, the words byte and character meant the same thing for
>> printable characters. Printable bytes.
>
> The reasoning for this behavior is explained in the OpenGroup's POSIX
> standard page for fold, in the "RATIONAL" section:
> http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18
>
> There, it is made clear:
> "Historical versions of the fold utility assumed 1 byte was one
> character and occupied one column position when written out. This is
> no longer always true.
> [....]
> Note that although the width for the -b option is in bytes, a line is
> never split in the middle of a character."
>
> Therefore, the current implementation (of the development version) is
> correct.
>
>> I will attempt to suggest an improved text for the man-page so that
>> others will not be surprised.
>
> I agree that once multibyte support is added to fold(1), the man pages,
> the help screen and texi manual must be updated to clearly
> indicate the "-b/--bytes" only applies to \b \t \r and never to
> multibyte characters.
>
> If you find the time to send such a patch - great!
> If not, I will add it sooner or later (hopefully sooner).
>
> As such I'm closing this bug report, but further discussion (and
> patches) are welcomed by replying to this thread.
Note while splitting in the middle of a character is incorrect,
it doesn't preclude approximate counting in --bytes.
This is the approach the current i18n patch takes:
$ export LC_ALL=en_CA.UTF-8
$ echo "ßß" | fold-i18n --bytes --width 1 | od -tc -An
303 237 \n 303 237 \n \n
$ echo "ßß" | fold-i18n --bytes --width 2 | od -tc -An
303 237 \n 303 237 \n \n
$ echo "ßß" | fold-assaf --bytes --width 2 | od -tc -An
303 237 303 237 \n
The i18n version of fold also has a --characters option
to operate in the current fold-assaf mode.
I'm not convinced we want to be different from the i18n patch in this regard at least.
cheers,
Pádraig.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 07 Jan 2018 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 7 years and 169 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.