#17196 - multibyte: printf: %s counts bytes instead of characters

GNU bug report logs - #17196
multibyte: printf: %s counts bytes instead of characters

Reported by: Jan Novak <jn <at> turbo.sk>

Date: Sat, 5 Apr 2014 23:22:01 UTC

Severity: wishlist

View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com> To: Eric Blake <eblake <at> redhat.com> Cc: 17196 <at> debbugs.gnu.org, Austin Group <austin-group-l <at> opengroup.org>, Bob Proulx <bob <at> proulx.com>, Jan Novak <jn <at> turbo.sk> Subject: bug#17196: UTF-8 printf string formating problem Date: Tue, 08 Apr 2014 01:11:13 +0100

On 04/07/2014 10:57 PM, Eric Blake wrote: > [adding the Austin Group] > > On 04/07/2014 07:08 AM, Pádraig Brady wrote: >> On 04/06/2014 07:24 PM, Bob Proulx wrote: >>> Pádraig Brady wrote: >>>> Yes printf follows the C standard which only considers bytes. >>>> ... >>>> I don't think we'd be able to change the current operation of printf >>>> due to backwards compat reasons? Though we might be able to somehow leverage >>>> the existing multibyte character aware alignment/truncation code in: >>>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD >>> >>> Dan Douglas pointed out in the corresponding discussion in bug-bash >>> that ksh uses the L modifier. >>> >>> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html >>> >>> Dan Douglas wrote: >>> > ksh93 already has this feature using the "L" modifier: >>> > >>> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" >>> > ★★★ >>> >>> At least there is prior art for it. >> >> So we can count bytes, chars or cells (graphemes). >> >> Thinking a bit more about it, I think shell level printf >> should be dealing in text of the current encoding and counting cells. >> In the edge case where you want to deal in bytes one can do: >> LC_ALL=C printf ... >> >> I see that ksh behaves as I would expect and counts cells, >> though requires the explicit %L enabler: >> $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" >> á★★ >> $ ksh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'" >> Ａ★ >> $ ksh -c "printf '%.3Ls\n' $'ＡＡ\u2605\u2605\u2605'" >> Ａ >> >> zsh seems to just count characters: >> $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" >> á★ >> $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" >> á★ >> $ zsh -c "printf '%.3Ls\n' $'Ａ\u2605\u2605\u2605'" >> Ａ★★ >> >> I see that dash gives invalid directive for any of %ls %Ls %S. >> >> Pity there is no consensus here. >> Personally I would go for: >> printf '%3s' 'blah' # count cells >> printf '%3Ls' 'blah' # count chars >> LANG=C '%3Ls' 'blah' # count bytes >> LANG=C '%3s' 'blah' # count bytes > > Hmm. POSIX requires support for %ls (aka %S) according to byte counts, > and currently states that %Ls is undefined. But I would LOVE to have a > standardized spelling for counting characters instead of bytes. The > extension %Ls looks like a good candidate for standardization, precisely > because counting characters when printing a multibyte string is more > useful than counting bytes (you do NOT want to end in the middle of a > multibyte character), and because ksh offers it as existing practice. Note ksh seems to count cells with %Ls > Your idea for counting "cells" (by which I'm assuming you mean one or > more characters that all display within the same cell of the terminal, > as if the end user saw only one grapheme), on the other hand, does not > seem to have any precedence, and I would strongly object to having %s > count by cells because %s already has a standardized (if unfortunate) > meaning of counting by bytes. Maybe yet another extension is warranted > (perhaps %LLs?) as a new notion for counting by cells instead of > characters, but it's harder to justify that without existing practice. At the shell level I expect that the vast majority of uses would prefer to be specifying cell counts. I thought there might not be much backwards compat issues with doing that, especially since zsh and gawk adjust the meaning of %s according to the locale (albeit for char rather than cell count). But it's a fair point that there may be scripts that don't consider the zsh behavior. If we had to make it explicit for backwards compat reasons, then I suppose counting by characters is the least useful, so we could just standardize the existing ksh behavior and have: printf '%3s' 'blah' # count bytes printf '%3Ls' 'blah' # count cells LANG=C '%3Ls' 'blah' # count bytes This has the disadvantage of not degrading gracefully on dash for example where %Ls is rejected. thanks, Pádraig.

This bug report was last modified 6 years and 301 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #17196 multibyte: printf: %s counts bytes instead of characters

GNU bug report logs - #17196
multibyte: printf: %s counts bytes instead of characters