#29606 - Command 'fold' dangerous with utf-8 input

GNU bug report logs - #29606
Command 'fold' dangerous with utf-8 input

Reported by: Mark Roberts <mroberts <at> rapid-arts-movement.de>

Date: Thu, 7 Dec 2017 16:27:02 UTC

Severity: normal

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Message #14 received at 29606 <at> debbugs.gnu.org (full text, mbox):

From: Mark Roberts <mroberts <at> rapid-arts-movement.de> To: Assaf Gordon <assafgordon <at> gmail.com> Cc: 29606 <at> debbugs.gnu.org Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input Date: Thu, 7 Dec 2017 18:30:45 +0100 (CET)

[Message part 1 (text/plain, inline)]

Dear Assaf, > If you'd like to help us test these patches, please try > an unofficial development snapshot here: > > https://files.housegordon.org/src/coreutils-multibyte-experimental-8.28.39-79242.tar.xz I have taken a look and have an unexpected result: fold (version 8.28.39-79242) reacts to my LANG envirionment variable, which is good, but it ignores the --bytes or -b flag, which is surprising. My test case uses 'echo' to send the German sharp s character, which is a two byte character, and a newline to 'fold --width 1'. I then use 'head -1' and 'wc --bytes' to count the bytes in line one. If UTF-8 is set, this should strip off one character (two bytes) plus one newline. It does. If UTF-8 is not set, it should strip off one bytes and a newline. It does. If 'fold --width 1 --bytes' is used, it should always strip off one byte and a newline, regardless of environment settings. It doesn't. The '--bytes' switch has no effect. Here are the test cases (the new versions of core-utils are in src/): > export LANG="" > src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes 2 This is correct: fold splits the line between the two bytes and puts a newline after each. Counting bytes in the first line gives 2, including the newline. > export LANG="de_DE.UTF-8" > src/echo ß | src/fold --bytes --width 1 | src/head -1 | src/wc --bytes 3 This is wrong: fold has kept both bytes of the character on line one, although fold --bytes --width 1 should split after one byte. > export LANG="" > src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes 2 This is correct: without language setting fold treats each byte as a character. > export LANG="de_DE.UTF-8" > src/echo ß | src/fold --width 1 | src/head -1 | src/wc --bytes 3 This is correct: The two-byte character remains on line one. Have I misunderstood what "fold --bytes" is supposed to mean? Or is this an error? All the best, Mark

This bug report was last modified 7 years and 169 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #29606 Command 'fold' dangerous with utf-8 input

GNU bug report logs - #29606
Command 'fold' dangerous with utf-8 input