GNU bug report logs - #7960
multibyte: fmt: fix formatting multibyte text (bug #7372)

Previous Next

Package: coreutils;

Reported by: Kostya Stopani <hatta <at> depni.sinp.msu.ru>

Date: Wed, 2 Feb 2011 14:42:01 UTC

Severity: normal

Tags: moreinfo, patch

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eric Blake <eblake <at> redhat.com>
To: Kostya Stopani <hatta <at> depni.sinp.msu.ru>, 7960 <at> debbugs.gnu.org
Subject: bug#7960: [PATCH] fmt: fix formatting multibyte text (bug #7372)
Date: Wed, 02 Feb 2011 14:33:44 -0700
[Message part 1 (text/plain, inline)]
[readding the list]

On 02/02/2011 02:11 PM, Kostya Stopani wrote:
> On Wed, Feb 02, 2011 at 10:15:53AM -0700, Eric Blake wrote:
> 
>> Thanks for the patch.  However, it's not trivial, so it would need
>> copyright assignment.
> 
> Oh boy... Anyway I don't mind signing papers, if you (or whoever)
> don't mind bothering with it.

OK, I'll send you those details off-list.

> 
>> Furthermore, there are already known issues where upstream coreutils
>> is lacking multibyte character support, but a solution has to be
>> both maintainable and no-impact to the single-byte locale case.
> 
> I believe this patch doesn't break single-byte behavior because no
> conversion takes place. mbsnrtowcs() is used only to count
> characters. I've tested various cases (8-bit encoding was KOI8-R):
> 
> |--------+---------------+--------------------------|
> | Locale | Text encoding | Result                   |
> |--------+---------------+--------------------------|
> | UTF-8  | UTF-8         | old fmt: text too narrow |
> |        |               | new fmt: ok              |
> |--------+---------------+--------------------------|
> | UTF-8  | 8-bit         | same                     |
> |--------+---------------+--------------------------|
> | 8-bit  | UTF-8         | same                     |
> |--------+---------------+--------------------------|
> | 8-bit  | 8-bit         | same                     |
> |--------+---------------+--------------------------|
> 
> From my point of view the alternative is to convert everything to
> wchar_t, which imposes the need to keep track of conversion errors and
> gracefully fall back to single-byte.

Keeping things in multibyte rather than converting to wchar_t is the way
to go (especially given the ongoing discussion of how to handle the fact
that on cygwin, wchar_t is UTF-16 and thus still multi-unit as an
extension to POSIX, with all sorts of ramifications to programs that
expect POSIX semantics).

-- 
Eric Blake   eblake <at> redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

This bug report was last modified 6 years and 264 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.