GNU bug report logs - #32236
df header corrupted with LANG=zh_TW.UTF-8 on macOS

Previous Next

Package: coreutils;

Reported by: Chih-Hsuan Yen <yan12125 <at> gmail.com>

Date: Sat, 21 Jul 2018 16:10:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Bruno Haible <bruno <at> clisp.org>
To: bug-gnulib <at> gnu.org
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, Pádraig Brady <P <at> draigbrady.com>, 32236 <at> debbugs.gnu.org
Subject: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 00:43:42 +0200
Hi Pádraig,

> I've attached a gnulib patch to document for iscntrl at least.

> +This function does not support arguments outside of the range of the
> +unsigned char type in locales with large character sets, on some platforms.
> +OS X 10.5 will return non zero for characters >= 0x80 in UTF-8 locales.

In UTF-8 locales, arguments >= 0x80 are invalid arguments for iscntrl().

POSIX [1] says
  "The c argument is a type int, the value of which the application shall
   ensure is a character representable as an unsigned char or equal to the
   value of the macro EOF. If the argument has any other value, the behavior
   is undefined."

The term "character" is defined here [2]:
  "A sequence of one or more bytes representing a single graphic symbol or
   control code."

So, in a UTF-8 locale, a "character representable as an unsigned char"
is a byte sequence of length 1, where the single byte has a value in the
range 0x00..0x7F.

For invalid values "the behavior is undefined." You were expecting a value 0.

Now, in the gnulib documentations, what we mention as portability problems
are the cases where
  - the behaviour for valid arguments is different on different platforms, or
  - the boundary between valid and invalid arguments is fuzzy and depends on
    the platform.
IMO there's no point in documenting that a function _really_ has undefined
behaviour when POSIX says that it has undefined behaviour.

> I've also attached an alternative patch for df (in your name).

This patch is correct (because the characters that you test for in c_iscntrl
are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).

But it does not catch control characters outside of the ASCII range. It would
make sense to catch these as well. If you want to do that,
'hide_problematic_chars' needs to be rewritten as a loop that iterates across
the multibyte characters. For example with the 'mbiter' module, in
combination with the mb_iscntrl function from the 'mbchar' module. Or
directly with mbrtowc() and iswcntrl().

Bruno

[1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/iscntrl.html
[2] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87




This bug report was last modified 6 years and 160 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.