GNU bug report logs - #17196
multibyte: printf: %s counts bytes instead of characters

Previous Next

Package: coreutils;

Reported by: Jan Novak <jn <at> turbo.sk>

Date: Sat, 5 Apr 2014 23:22:01 UTC

Severity: wishlist

Full log


View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com>
To: Eric Blake <eblake <at> redhat.com>
Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l <at> opengroup.org>, Bob Proulx <bob <at> proulx.com>, Jan Novak <jn <at> turbo.sk>
Subject: bug#17196: UTF-8 printf string formating  problem
Date: Tue, 08 Apr 2014 01:11:13 +0100
On 04/07/2014 10:57 PM, Eric Blake wrote:
> [adding the Austin Group]
> 
> On 04/07/2014 07:08 AM, Pádraig Brady wrote:
>> On 04/06/2014 07:24 PM, Bob Proulx wrote:
>>> Pádraig Brady wrote:
>>>> Yes printf follows the C standard which only considers bytes.
>>>> ...
>>>> I don't think we'd be able to change the current operation of printf
>>>> due to backwards compat reasons? Though we might be able to somehow leverage
>>>> the existing multibyte character aware alignment/truncation code in:
>>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
>>>
>>> Dan Douglas pointed out in the corresponding discussion in bug-bash
>>> that ksh uses the L modifier.
>>>
>>>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
>>>
>>>   Dan Douglas wrote:
>>>   > ksh93 already has this feature using the "L" modifier:
>>>   > 
>>>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>>>   > ★★★
>>>
>>> At least there is prior art for it.
>>
>> So we can count bytes, chars or cells (graphemes).
>>
>> Thinking a bit more about it, I think shell level printf
>> should be dealing in text of the current encoding and counting cells.
>> In the edge case where you want to deal in bytes one can do:
>>   LC_ALL=C printf ...
>>
>> I see that ksh behaves as I would expect and counts cells,
>> though requires the explicit %L enabler:
>>   $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★★
>>   $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>>   A★
>>   $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
>>   A
>>
>> zsh seems to just count characters:
>>   $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
>>   á★
>>   $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
>>   A★★
>>
>> I see that dash gives invalid directive for any of %ls %Ls %S.
>>
>> Pity there is no consensus here.
>> Personally I would go for:
>>   printf '%3s' 'blah'  # count cells
>>   printf '%3Ls' 'blah' # count chars
>>   LANG=C '%3Ls' 'blah' # count bytes
>>   LANG=C '%3s' 'blah'  # count bytes
> 
> Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
> and currently states that %Ls is undefined.  But I would LOVE to have a
> standardized spelling for counting characters instead of bytes.  The
> extension %Ls looks like a good candidate for standardization, precisely
> because counting characters when printing a multibyte string is more
> useful than counting bytes (you do NOT want to end in the middle of a
> multibyte character), and because ksh offers it as existing practice.

Note ksh seems to count cells with %Ls

> Your idea for counting "cells" (by which I'm assuming you mean one or
> more characters that all display within the same cell of the terminal,
> as if the end user saw only one grapheme), on the other hand, does not
> seem to have any precedence, and I would strongly object to having %s
> count by cells because %s already has a standardized (if unfortunate)
> meaning of counting by bytes.  Maybe yet another extension is warranted
> (perhaps %LLs?) as a new notion for counting by cells instead of
> characters, but it's harder to justify that without existing practice.

At the shell level I expect that the vast majority
of uses would prefer to be specifying cell counts.
I thought there might not be much backwards compat issues
with doing that, especially since zsh and gawk adjust
the meaning of %s according to the locale
(albeit for char rather than cell count).

But it's a fair point that there may be scripts
that don't consider the zsh behavior.

If we had to make it explicit for backwards compat reasons,
then I suppose counting by characters is the least useful,
so we could just standardize the existing ksh behavior and have:

   printf '%3s' 'blah'  # count bytes
   printf '%3Ls' 'blah' # count cells
   LANG=C '%3Ls' 'blah' # count bytes

This has the disadvantage of not degrading gracefully
on dash for example where %Ls is rejected.

thanks,
Pádraig.




This bug report was last modified 6 years and 250 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.