GNU bug report logs - #17196
multibyte: printf: %s counts bytes instead of characters

Previous Next

Package: coreutils;

Reported by: Jan Novak <jn <at> turbo.sk>

Date: Sat, 5 Apr 2014 23:22:01 UTC

Severity: wishlist

Full log


View this message in rfc822 format

From: Pádraig Brady <P <at> draigBrady.com>
To: Bob Proulx <bob <at> proulx.com>
Cc: 17196 <at> debbugs.gnu.org, Jan Novak <jn <at> turbo.sk>
Subject: bug#17196: UTF-8 printf string formating  problem
Date: Mon, 07 Apr 2014 14:08:07 +0100
On 04/06/2014 07:24 PM, Bob Proulx wrote:
> Pádraig Brady wrote:
>> Yes printf follows the C standard which only considers bytes.
>> ...
>> I don't think we'd be able to change the current operation of printf
>> due to backwards compat reasons? Though we might be able to somehow leverage
>> the existing multibyte character aware alignment/truncation code in:
>> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=gl/lib/mbsalign.c;hb=HEAD
> 
> Dan Douglas pointed out in the corresponding discussion in bug-bash
> that ksh uses the L modifier.
> 
>   http://lists.gnu.org/archive/html/bug-bash/2014-04/msg00021.html
> 
>   Dan Douglas wrote:
>   > ksh93 already has this feature using the "L" modifier:
>   > 
>   > ksh -c "printf '%.3Ls\n' $'\u2605\u2605\u2605\u2605\u2605'"
>   > ★★★
> 
> At least there is prior art for it.

So we can count bytes, chars or cells (graphemes).

Thinking a bit more about it, I think shell level printf
should be dealing in text of the current encoding and counting cells.
In the edge case where you want to deal in bytes one can do:
  LC_ALL=C printf ...

I see that ksh behaves as I would expect and counts cells,
though requires the explicit %L enabler:
  $ ksh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
  á★★
  $ ksh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
  A★
  $ ksh -c "printf '%.3Ls\n' $'AA\u2605\u2605\u2605'"
  A

zsh seems to just count characters:
  $ zsh -c "printf '%.3Ls\n' $'a\u0301\u2605\u2605\u2605'"
  á★
  $ zsh -c "printf '%.3s\n' $'a\u0301\u2605\u2605\u2605'"
  á★
  $ zsh -c "printf '%.3Ls\n' $'A\u2605\u2605\u2605'"
  A★★

I see that dash gives invalid directive for any of %ls %Ls %S.

Pity there is no consensus here.
Personally I would go for:
  printf '%3s' 'blah'  # count cells
  printf '%3Ls' 'blah' # count chars
  LANG=C '%3Ls' 'blah' # count bytes
  LANG=C '%3s' 'blah'  # count bytes

Pádraig.





This bug report was last modified 6 years and 250 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.