Rich Felker <dalias@aerifal.cx> wrote:
 |On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote:
 |> Eric Blake <eblake@redhat.com> wrote:
 |>|Hmm.  POSIX requires support for %ls (aka %S) according to byte counts,
 |>|and currently states that %Ls is undefined.  But I would LOVE to have a
 |>|standardized spelling for counting characters instead of bytes.  The
 |>|extension %Ls looks like a good candidate for standardization, precisely
 |>|because counting characters when printing a multibyte string is more
 |>|useful than counting bytes (you do NOT want to end in the middle of a
 |>|multibyte character), and because ksh offers it as existing practice.
 |>|
 |>|Your idea for counting "cells" (by which I'm assuming you mean one or
 |>|more characters that all display within the same cell of the terminal,
 |>|as if the end user saw only one grapheme), on the other hand, does not
 |>|seem to have any precedence, and I would strongly object to having %s
 [.]
 |> I see you are trying to invent the word character for code points
 |> and reserve the term "graphem" for user-perceived characters.
 |> This goes in line with the GNU library which has the existing
 |> practice to let wcwidth(3) return the value 1 for accents and
 |> other combining code points as well as so-called (Unicode)
 |> noncharacters.  And who would call wcwidth(3) on something that is
 |> not to be drawn onto the screen directly afterwards.  And, of
 |> course, which terminal will perform the composition of code points
 |> written via STD I/O to characters on its own.
 |> I think for quite a while it is up to the input methods to combine
 |> into something precomposed in order to let POSIX programs finally
 |> work with it.
 |
 |Many languages do not have precomposed forms for all the character
 |sequences they need, and for some, it would not even be practical to
 |have precomposed forms, and would force the use of complex input
 |methods instead of simple keyboard maps.

And of course with UTF-8 decomposed forms of characters from an
immense number of languages can occur in at least theory, in,
e.g., a text file.
The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very
well be «ü» but also U+0076 U+0308 «u ̈», dependent on where it
came from.  And note that my vim(1) composed U+00F6 when i tried
to input the latter string automatically, i had to separate, enter
each, and join them together to get at «u» plus, actually non-,
combining diaeresis.  (In fact actually «combining with a space».)
Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when
it really produces something visual.

Even better would nonetheless be the great picture with
a termios(4) IUTF8 flag, some extended xywidth(3) that returns
a tuple of {[EastAsianWidth indication,] is-combining,
width-if-non-combining} and best even some composition function.
I don't think that «user-perceived characters don't have any
precedence».  A whole lot of development in the past decade on the
winner side (that is, the other :) was exactly that -- making
software barrier-free.
If POSIX beams itself onto UTF-8 it should really consider to
offer a way to be able to act on what the user really deals with.
And that is, in the Unicode world -- and isn't that what the bug
report is about --, not necessarily a mbrlen(3)-division of bytes.

--steffen