GNU bug report logs - #54124
fmt inserts garbage in certain cases?

Previous Next

Package: coreutils;

Reported by: "JD" <john1doe <at> ya.ru>

Date: Wed, 23 Feb 2022 11:28:01 UTC

Severity: normal

To reply to this bug, email your comments to 54124 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#54124; Package coreutils. (Wed, 23 Feb 2022 11:28:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to "JD" <john1doe <at> ya.ru>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 23 Feb 2022 11:28:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "JD" <john1doe <at> ya.ru>
To: <bug-coreutils <at> gnu.org>
Subject: fmt inserts garbage in certain cases?
Date: Wed, 23 Feb 2022 12:58:54 +0200
Hi!

I have fmt from coreutils 8.32.1 installed via MacPorts.

If I run the following command: `echo х х х х х х х х х х х х х х х х х х х х х х х х х х | gfmt -sw 10` (which is just echoing 26 Cyrillic 'х' ('kha') letters), I get the following results:

https://i.imgur.com/yRx7uuz.png (iTerm2) 
https://i.imgur.com/7oQ0UPz.png (iTerm2 if passed via `more`) 
https://i.imgur.com/UlLrEMy.png (Alacritty)

And if I delete just two 'х' letters, like this: `echo х х х х х х х х х х х х х х х х х х х х х х х х | gfmt -sw 10`, evertyhitng shows just fine: https://i.imgur.com/DwuWxyx.png

Would be grateful for any advice :)



-- 
JD




Information forwarded to bug-coreutils <at> gnu.org:
bug#54124; Package coreutils. (Wed, 23 Feb 2022 17:57:02 GMT) Full text and rfc822 format available.

Message #8 received at 54124 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: JD <john1doe <at> ya.ru>, 54124 <at> debbugs.gnu.org
Subject: Re: bug#54124: fmt inserts garbage in certain cases?
Date: Wed, 23 Feb 2022 17:55:49 +0000
[Message part 1 (text/plain, inline)]
On 23/02/2022 10:58, JD wrote:
> Hi!
> 
> I have fmt from coreutils 8.32.1 installed via MacPorts.
> 
> If I run the following command: `echo х х х х х х х х х х х х х х х х х х х х х х х х х х | gfmt -sw 10` (which is just echoing 26 Cyrillic 'х' ('kha') letters), I get the following results:
> 
> https://i.imgur.com/yRx7uuz.png (iTerm2)
> https://i.imgur.com/7oQ0UPz.png (iTerm2 if passed via `more`)
> https://i.imgur.com/UlLrEMy.png (Alacritty)
> 
> And if I delete just two 'х' letters, like this: `echo х х х х х х х х х х х х х х х х х х х х х х х х | gfmt -sw 10`, evertyhitng shows just fine: https://i.imgur.com/DwuWxyx.png
> 
> Would be grateful for any advice :)

The issue here is that (on macOS 10.15.7 at least),
isspace(0x85) returns true for UTF-8 locales
(but not for "C" or "iso8859-1" locales).
BTW iscntrl() returns true for 0x85 on all non C locales
on both Linux and macOS.

Now gnulib says wrt isspace() that:

"This function's behaviour depends on the locale, but does not support
the multibyte characters that occur in strings in locales with
@code{MB_CUR_MAX > 1} (this includes all the common UTF-8 locales)."

I think isspace(x85) returning true on macOS is a bug,
but we should probably avoid isspace() in fmt altogether
given it's inconsistency with multibyte locales.
The attached uses c_isspace() instead.

cheers,
Pádraig
[fmt-utf8-macOS.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#54124; Package coreutils. (Thu, 24 Feb 2022 01:31:01 GMT) Full text and rfc822 format available.

Message #11 received at 54124 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: JD <john1doe <at> ya.ru>, 54124 <at> debbugs.gnu.org
Subject: Re: bug#54124: fmt inserts garbage in certain cases?
Date: Thu, 24 Feb 2022 01:29:56 +0000
On 23/02/2022 17:55, Pádraig Brady wrote:

> I think isspace(x85) returning true on macOS is a bug,

Bug is a bit of a strong word here.

A digression into why 0x85 is being treated specially here.
Note Cyrillic kha "х" is encoded in UTF-8 as:
 $ printf '\u0445' | od -tx1
 0000000 d1 85

What I think is happening is \u0085 represents "Next Line" in unicode.
This is present in unicode to support mapping to/from the corresponding char in EBCDIC,
which had a distinct char for this in addition to CR and LF.
Given isspace('\n') returns true, then it makes some sense that isspace("Next Line")
would return true, and I guess through implementation details
isspace(int) is operating on utf32 on macOS in UTF-8 locales
and this returning true for this value.

BTW 0xA0 is the only other value that isspace() returns true for
(other than the standard c_isspace() values of course).
This is non breaking space, so it's best we don't split on it anyway.
I.e. this is another benefit to the change.

I still think using c_isspace() to avoid this issue is best,
and intend to push the change tomorrow.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#54124; Package coreutils. (Thu, 24 Feb 2022 03:07:01 GMT) Full text and rfc822 format available.

Message #14 received at 54124 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Pádraig Brady <P <at> draigBrady.com>, JD <john1doe <at> ya.ru>,
 54124 <at> debbugs.gnu.org
Subject: Re: bug#54124: fmt inserts garbage in certain cases?
Date: Wed, 23 Feb 2022 19:06:29 -0800
On 2/23/22 17:29, Pádraig Brady wrote:
> Given isspace('\n') returns true, then it makes some sense that 
> isspace("Next Line")
> would return true,

POSIX says that the application must insure that argument to isspace is 
either EOF or "a character representable as an unsigned char", and 
arguably since 0x85 not either one of those things the behavior of 
isspace(0x85) is undefined.

However, the C standard does not have this wording, and since POSIX is 
supposed to defer to the C standard here, this appears to be a bug in 
POSIX (as well as a bug in macOS). It's understandable if the Apple C 
library's developers got confused by the POSIX wording.




This bug report was last modified 3 years and 119 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.