Rich Felker wrote: |On Wed, Apr 09, 2014 at 02:49:37PM +0200, Steffen Nurpmeso wrote: |> Eric Blake wrote: |>|Hmm. POSIX requires support for %ls (aka %S) according to byte counts, |>|and currently states that %Ls is undefined. But I would LOVE to have a |>|standardized spelling for counting characters instead of bytes. The |>|extension %Ls looks like a good candidate for standardization, precisely |>|because counting characters when printing a multibyte string is more |>|useful than counting bytes (you do NOT want to end in the middle of a |>|multibyte character), and because ksh offers it as existing practice. |>| |>|Your idea for counting "cells" (by which I'm assuming you mean one or |>|more characters that all display within the same cell of the terminal, |>|as if the end user saw only one grapheme), on the other hand, does not |>|seem to have any precedence, and I would strongly object to having %s [.] |> I see you are trying to invent the word character for code points |> and reserve the term "graphem" for user-perceived characters. |> This goes in line with the GNU library which has the existing |> practice to let wcwidth(3) return the value 1 for accents and |> other combining code points as well as so-called (Unicode) |> noncharacters. And who would call wcwidth(3) on something that is |> not to be drawn onto the screen directly afterwards. And, of |> course, which terminal will perform the composition of code points |> written via STD I/O to characters on its own. |> I think for quite a while it is up to the input methods to combine |> into something precomposed in order to let POSIX programs finally |> work with it. | |Many languages do not have precomposed forms for all the character |sequences they need, and for some, it would not even be practical to |have precomposed forms, and would force the use of complex input |methods instead of simple keyboard maps. And of course with UTF-8 decomposed forms of characters from an immense number of languages can occur in at least theory, in, e.g., a text file. The german U+00F6 (LATIN SMALL LETTER U WITH DIAERESIS) could very well be «ü» but also U+0076 U+0308 «u ̈», dependent on where it came from. And note that my vim(1) composed U+00F6 when i tried to input the latter string automatically, i had to separate, enter each, and join them together to get at «u» plus, actually non-, combining diaeresis. (In fact actually «combining with a space».) Of course a wcwidth(3) of 1 for U+0308 is much better than 0 when it really produces something visual. Even better would nonetheless be the great picture with a termios(4) IUTF8 flag, some extended xywidth(3) that returns a tuple of {[EastAsianWidth indication,] is-combining, width-if-non-combining} and best even some composition function. I don't think that «user-perceived characters don't have any precedence». A whole lot of development in the past decade on the winner side (that is, the other :) was exactly that -- making software barrier-free. If POSIX beams itself onto UTF-8 it should really consider to offer a way to be able to act on what the user really deals with. And that is, in the Unicode world -- and isn't that what the bug report is about --, not necessarily a mbrlen(3)-division of bytes. --steffen