[adding the Austin Group] On 04/07/2014 07:08 AM, Pádraig Brady wrote: > On 04/06/2014 07:24 PM, Bob Proulx wrote: >> Pádraig Brady wrote: >>> Yes printf follows the C standard which only considers bytes. >>> ... >>> I don't think we'd be able to change the current operation of printf >>> due to backwards compat reasons? Though we might be able to somehow leverage >>> the existing multibyte character aware alignment/truncation code in: >>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD >> >> Dan Douglas pointed out in the corresponding discussion in bug-bash >> that ksh uses the L modifier. >> >> http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html >> >> Dan Douglas wrote: >> > ksh93 already has this feature using the "L" modifier: >> > >> > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'" >> > ★★★ >> >> At least there is prior art for it. > > So we can count bytes, chars or cells (graphemes). > > Thinking a bit more about it, I think shell level printf > should be dealing in text of the current encoding and counting cells. > In the edge case where you want to deal in bytes one can do: > LC_ALL=C printf ... > > I see that ksh behaves as I would expect and counts cells, > though requires the explicit %L enabler: > $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > á★★ > $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" > A★ > $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'" > A > > zsh seems to just count characters: > $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'" > á★ > $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'" > á★ > $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'" > A★★ > > I see that dash gives invalid directive for any of %ls %Ls %S. > > Pity there is no consensus here. > Personally I would go for: > printf '%3s' 'blah' # count cells > printf '%3Ls' 'blah' # count chars > LANG=C '%3Ls' 'blah' # count bytes > LANG=C '%3s' 'blah' # count bytes Hmm. POSIX requires support for %ls (aka %S) according to byte counts, and currently states that %Ls is undefined. But I would LOVE to have a standardized spelling for counting characters instead of bytes. The extension %Ls looks like a good candidate for standardization, precisely because counting characters when printing a multibyte string is more useful than counting bytes (you do NOT want to end in the middle of a multibyte character), and because ksh offers it as existing practice. Your idea for counting "cells" (by which I'm assuming you mean one or more characters that all display within the same cell of the terminal, as if the end user saw only one grapheme), on the other hand, does not seem to have any precedence, and I would strongly object to having %s count by cells because %s already has a standardized (if unfortunate) meaning of counting by bytes. Maybe yet another extension is warranted (perhaps %LLs?) as a new notion for counting by cells instead of characters, but it's harder to justify that without existing practice. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org