GNU bug report logs -
#17196
multibyte: printf: %s counts bytes instead of characters
Previous Next
Full log
View this message in rfc822 format
On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
>
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>> Dan Douglas wrote:
>>> > ksh93 already has this feature using the "L" modifier:
>>> >
>>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>> > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>> LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>> á★★
>> $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>> A★
>> $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>> A
>>
>> zsh seems to just count characters:
>> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>> á★
>> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>> á★
>> $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>> A★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>> printf '%3s' 'blah' # count cells
>> printf '%3Ls' 'blah' # count chars
>> LANG=C '%3Ls' 'blah' # count bytes
>> LANG=C '%3s' 'blah' # count bytes
>
> Hmm. POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined. But I would LOVE to have a
> standardized spelling for counting characters instead of bytes. The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.
Note ksh seems to count cells with %Ls
> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes. Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.
At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).
But it's a fair point that there may be scripts
that don't consider the zsh behavior.
If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:
printf '%3s' 'blah' # count bytes
printf '%3Ls' 'blah' # count cells
LANG=C '%3Ls' 'blah' # count bytes
This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.
thanks,
Pádraig.
This bug report was last modified 6 years and 250 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.