GNU bug report logs -
#17196
multibyte: printf: %s counts bytes instead of characters
Previous Next
Full log
View this message in rfc822 format
[Message part 1 (text/plain, inline)]
[adding the Austin Group]
On 04/07/2014 07:08 AM, Pádraig Brady wrote:
> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>> Pádraig Brady wrote:
>>> Yes printf follows the C standard which only considers bytes.
>>> ...
>>> I don't think we'd be able to change the current operation of printf
>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>> the existing multibyte character aware alignment/truncation code in:
>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>
>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>> that ksh uses the L modifier.
>>
>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>
>> Dan Douglas wrote:
>> > ksh93 already has this feature using the "L" modifier:
>> >
>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>> > ★★★
>>
>> At least there is prior art for it.
>
> So we can count bytes, chars or cells (graphemes).
>
> Thinking a bit more about it, I think shell level printf
> should be dealing in text of the current encoding and counting cells.
> In the edge case where you want to deal in bytes one can do:
> LC_ALL=C printf ...
>
> I see that ksh behaves as I would expect and counts cells,
> though requires the explicit %L enabler:
> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> á★★
> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> A★
> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
> A
>
> zsh seems to just count characters:
> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
> á★
> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
> á★
> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
> A★★
>
> I see that dash gives invalid directive for any of %ls %Ls %S.
>
> Pity there is no consensus here.
> Personally I would go for:
> printf '%3s' 'blah' # count cells
> printf '%3Ls' 'blah' # count chars
> LANG=C '%3Ls' 'blah' # count bytes
> LANG=C '%3s' 'blah' # count bytes
Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
and currently states that %Ls is undefined. But I would LOVE to have a
standardized spelling for counting characters instead of bytes. The
extension %Ls looks like a good candidate for standardization, precisely
because counting characters when printing a multibyte string is more
useful than counting bytes (you do NOT want to end in the middle of a
multibyte character), and because ksh offers it as existing practice.
Your idea for counting "cells" (by which I'm assuming you mean one or
more characters that all display within the same cell of the terminal,
as if the end user saw only one grapheme), on the other hand, does not
seem to have any precedence, and I would strongly object to having %s
count by cells because %s already has a standardized (if unfortunate)
meaning of counting by bytes. Maybe yet another extension is warranted
(perhaps %LLs?) as a new notion for counting by cells instead of
characters, but it's harder to justify that without existing practice.
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[signature.asc (application/pgp-signature, attachment)]
This bug report was last modified 6 years and 250 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.