GNU bug report logs - #32267
multibyte: dd: add lcase/ucase multibyte support

Reported by: Ralph Corderoy <ralph <at> inputplus.co.uk>

Date: Wed, 25 Jul 2018 08:12:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 32267 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-coreutils <at> gnu.org:
bug#32267; Package coreutils. (Wed, 25 Jul 2018 08:12:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ralph Corderoy <ralph <at> inputplus.co.uk>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 25 Jul 2018 08:12:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ralph Corderoy <ralph <at> inputplus.co.uk>
To: bug-coreutils <at> gnu.org
Subject: dd's ucase and lcase and LC_CTYPE.
Date: Wed, 25 Jul 2018 09:11:11 +0100

Hi,

Of dd(1), POSIX says

    http://pubs.opengroup.org/onlinepubs/9699919799/utilities/dd.html
    lcase
        Map uppercase characters specified by the LC_CTYPE keyword
        tolower to the corresponding lowercase character.  Characters
        for which no mapping is specified shall not be modified by this
        conversion. 

and similarly for `ucase'.

But dd in coreutils 8.29-1 on Arch Linux just has a simple 256-byte
translation table that's mapped through tolower(3) or toupper(3).

http://pubs.opengroup.org/onlinepubs/9699919799/functions/tolower.html
describes tolower(3) as handling only `unsigned char' or EOF, and being
the identity function on all values where there isn't a lowercase letter
for the uppercase value.

This deviation isn't documented AFAICS.  It means ASCII and ISO-8859-1
are re-cased just fine.  UTF-8 has its ASCII subset altered, and other
bytes left alone, so the end result is valid UTF-8, but not fully
re-cased.  But charmaps like /usr/share/i18n/charmaps/CP949.gz,
https://en.wikipedia.org/wiki/Unified_Hangul_Code, have variable-length
byte sequences where 0x41, for example, isn't always an ASCII `A' and
thus shouldn't become 0x61, `a'.

Aside from improving the documentation, actually fixing dd to match
POSIX will need to handle the re-cased character being a different
number of bytes; particularly noticeable if the output file is the input
file with `conv=notrunc'.

    $ locale | grep LC_CTYPE
    LC_CTYPE="en_GB.utf8"
    $
    $ sed 'l; s/./\u&/; l' <<<ȿ
    \310\277$
    \342\261\276$
    Ȿ
    $ sed 'l; s/./\l&/; l' <<<Ȿ
    \342\261\276$
    \310\277$
    ȿ
    $

-- 
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy

Information forwarded to bug-coreutils <at> gnu.org:
bug#32267; Package coreutils. (Thu, 26 Jul 2018 09:22:02 GMT) Full text and rfc822 format available.

Message #8 received at 32267 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Ralph Corderoy <ralph <at> inputplus.co.uk>, 32267 <at> debbugs.gnu.org
Subject: Re: bug#32267: dd's ucase and lcase and LC_CTYPE.
Date: Thu, 26 Jul 2018 02:21:35 -0700

Yes, this is a known issue with dd as with many other coreutils programs. 
Strictly speaking as I understand it, it is not a deviation from POSIX, since 
POSIX does not require support for locales with multibyte encodings. Still, it 
would be nice to fix dd at some point, although it'd be a pain to do correctly 
and efficiently and it's long been low priority since hardly anybody needs or 
uses this feature on any platform.

Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:45:01 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: dd: add lcase/ucase multibyte support' from 'dd's ucase and lcase and LC_CTYPE.' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 30 Oct 2018 03:45:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 285 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #32267 multibyte: dd: add lcase/ucase multibyte support

GNU bug report logs - #32267
multibyte: dd: add lcase/ucase multibyte support