GNU bug report logs -
#29606
Command 'fold' dangerous with utf-8 input
Previous Next
Full log
Message #14 received at 29606 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Dear Assaf,
> If you'd like to help us test these patches, please try
> an unofficial development snapshot here:
>
> https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz
I have taken a look and have an unexpected result:
fold (version 8.28.39-79242) reacts to my LANG envirionment variable,
which is good, but it ignores the --bytes or -b flag, which is surprising.
My test case uses 'echo' to send the German sharp s character, which is a
two byte character, and a newline to 'fold --width 1'. I then use 'head
-1' and 'wc --bytes' to count the bytes in line one.
If UTF-8 is set, this should strip off one character (two bytes) plus one
newline. It does.
If UTF-8 is not set, it should strip off one bytes and a newline. It does.
If 'fold --width 1 --bytes' is used, it should always strip off one byte
and a newline, regardless of environment settings. It doesn't. The
'--bytes' switch has no effect.
Here are the test cases (the new versions of core-utils are in src/):
> export LANG=""
> src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes
2
This is correct: fold splits the line between the two bytes and puts a
newline after each. Counting bytes in the first line gives 2, including
the newline.
> export LANG="de_DE.UTF-8"
> src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes
3
This is wrong: fold has kept both bytes of the character on line one,
although fold --bytes --width 1 should split after one byte.
> export LANG=""
> src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes
2
This is correct: without language setting fold treats each byte as a
character.
> export LANG="de_DE.UTF-8"
> src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes
3
This is correct: The two-byte character remains on line one.
Have I misunderstood what "fold --bytes" is supposed to mean? Or is this
an error?
All the best,
Mark
This bug report was last modified 7 years and 169 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.