GNU bug report logs - #29606
Command 'fold' dangerous with utf-8 input

Previous Next

Package: coreutils;

Reported by: Mark Roberts <mroberts <at> rapid-arts-movement.de>

Date: Thu, 7 Dec 2017 16:27:02 UTC

Severity: normal

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #22 received at 29606-done <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Mark Roberts <mroberts <at> rapid-arts-movement.de>
Cc: 29606-done <at> debbugs.gnu.org
Subject: Re: bug#29606: Command 'fold' dangerous with utf-8 input
Date: Fri, 8 Dec 2017 20:15:12 -0700
Hello Mark,

First,
thank you for taking the time and effort
to test our development snapshot, and reporting results back.
This kind of feedback is critical in getting multibyte support ready.


Second,
I can confirm the behavior you are observing, reproduced here
with 'od' for easier output:

## POSIX single-byte locale:

$ echo "ß" | LC_ALL=C src/fold --bytes --width 1 | od -tc -An
 303  \n 237  \n
$ echo "ß" | LC_ALL=C src/fold         --width 1 | od -tc -An
 303  \n 237  \n

## UTF8 locale:

$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold --bytes --width 1 | od -tc -An
 303 237  \n

$ echo "ß" | LC_ALL=en_CA.UTF-8 src/fold         --width 1 | od -tc -An
 303 237  \n


On 2017-12-08 05:04 AM, Mark Roberts wrote:
> When --bytes is not specified, the program treats '\b', '\r' and '\t' 
> specially. It assumes a tab width of eight (compile-time #define) and 
> attempts to keep track of what the output will look like.
> 
> This is absolutely not what I expected.

That is correct, and I share your sentiment: it also took me some time
to try and track down why it behaves this way, and whether it's by 
design or a bug.

> But of course, when the program 
> was first written, the words byte and character meant the same thing for 
> printable characters. Printable bytes.

The reasoning for this behavior is explained in the OpenGroup's POSIX 
standard page for fold, in the "RATIONAL" section:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/fold.html#tag_20_48_18

There, it is made clear:
  "Historical versions of the fold utility assumed 1 byte was one
  character and occupied one column position when written out. This is
  no longer always true.
  [....]
  Note that although the width for the -b option is in bytes, a line is
  never split in the middle of a character."

Therefore, the current implementation (of the development version) is 
correct.

> I will attempt to suggest an improved text for the man-page so that 
> others will not be surprised.

I agree that once multibyte support is added to fold(1), the man pages,
the help screen and texi manual must be updated to clearly
indicate the "-b/--bytes" only applies to \b \t \r and never to
multibyte characters.

If you find the time to send such a patch - great!
If not, I will add it sooner or later (hopefully sooner).

As such I'm closing this bug report, but further discussion (and
patches) are welcomed by replying to this thread.

regards,
 - assaf






This bug report was last modified 7 years and 169 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.