GNU bug report logs -
#32267
multibyte: dd: add lcase/ucase multibyte support
Previous Next
To reply to this bug, email your comments to 32267 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#32267
; Package
coreutils
.
(Wed, 25 Jul 2018 08:12:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Ralph Corderoy <ralph <at> inputplus.co.uk>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Wed, 25 Jul 2018 08:12:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi,
Of dd(1), POSIX says
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/dd.html
lcase
Map uppercase characters specified by the LC_CTYPE keyword
tolower to the corresponding lowercase character. Characters
for which no mapping is specified shall not be modified by this
conversion.
and similarly for `ucase'.
But dd in coreutils 8.29-1 on Arch Linux just has a simple 256-byte
translation table that's mapped through tolower(3) or toupper(3).
http://pubs.opengroup.org/onlinepubs/9699919799/functions/tolower.html
describes tolower(3) as handling only `unsigned char' or EOF, and being
the identity function on all values where there isn't a lowercase letter
for the uppercase value.
This deviation isn't documented AFAICS. It means ASCII and ISO-8859-1
are re-cased just fine. UTF-8 has its ASCII subset altered, and other
bytes left alone, so the end result is valid UTF-8, but not fully
re-cased. But charmaps like /usr/share/i18n/charmaps/CP949.gz,
https://en.wikipedia.org/wiki/Unified_Hangul_Code, have variable-length
byte sequences where 0x41, for example, isn't always an ASCII `A' and
thus shouldn't become 0x61, `a'.
Aside from improving the documentation, actually fixing dd to match
POSIX will need to handle the re-cased character being a different
number of bytes; particularly noticeable if the output file is the input
file with `conv=notrunc'.
$ locale | grep LC_CTYPE
LC_CTYPE="en_GB.utf8"
$
$ sed 'l; s/./\u&/; l' <<<ȿ
\310\277$
\342\261\276$
Ȿ
$ sed 'l; s/./\l&/; l' <<<Ȿ
\342\261\276$
\310\277$
ȿ
$
--
Cheers, Ralph.
https://plus.google.com/+RalphCorderoy
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#32267
; Package
coreutils
.
(Thu, 26 Jul 2018 09:22:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 32267 <at> debbugs.gnu.org (full text, mbox):
Yes, this is a known issue with dd as with many other coreutils programs.
Strictly speaking as I understand it, it is not a deviation from POSIX, since
POSIX does not require support for locales with multibyte encodings. Still, it
would be nice to fix dd at some point, although it'd be a pain to do correctly
and efficiently and it's long been low priority since hardly anybody needs or
uses this feature on any platform.
Severity set to 'wishlist' from 'normal'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Tue, 30 Oct 2018 03:45:01 GMT)
Full text and
rfc822 format available.
Changed bug title to 'multibyte: dd: add lcase/ucase multibyte support' from 'dd's ucase and lcase and LC_CTYPE.'
Request was from
Assaf Gordon <assafgordon <at> gmail.com>
to
control <at> debbugs.gnu.org
.
(Tue, 30 Oct 2018 03:45:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 6 years and 233 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.