GNU bug report logs - #32236
df header corrupted with LANG=zh_TW.UTF-8 on macOS

Previous Next

Package: coreutils;

Reported by: Chih-Hsuan Yen <yan12125 <at> gmail.com>

Date: Sat, 21 Jul 2018 16:10:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

Full log


Message #41 received at 32236 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: Chih-Hsuan Yen <yan12125 <at> gmail.com>, bug-gnulib <at> gnu.org,
 32236 <at> debbugs.gnu.org
Subject: Re: bug#32236: df header corrupted with LANG=zh_TW.UTF-8 on macOS
Date: Sun, 22 Jul 2018 23:40:35 +0200
Pádraig Brady wrote:
> > This patch is correct (because the characters that you test for in c_iscntrl
> > are 0x00..0x1F, 0x7F, which don't occur as second or later byte in a multibyte
> > character in the EUC-JP, EUC-KR, GB2312, EUC-TW, GB18030, SJIS encodings).
> 
> ... It might be worth mentioning this subtle point in the c_iscntrl() docs?
> "Note this identifies all single byte control chars even in multibyte encodings".

Only in the multibyte encodings that are currently in use. We never know what
kinds of features or misfeatures new multibyte encodings will come up with:
Before GB18030 was introduced, it was a common feature of all multibyte encodings
(including SJIS) that ASCII characters in the range 0x00..0x3F never occur as
second or later byte in a multibyte character. Well, GB18030 broke this assumption.

So, it is dangerous to rely on this property. Therefore I wouldn't like to
document it in the c_iscntrl() documentation.

Bruno





This bug report was last modified 6 years and 160 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.