GNU bug report logs - #24924
multibyte: pr has no concept of wide characters

Previous Next

Package: coreutils;

Reported by: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>

Date: Fri, 11 Nov 2016 16:12:01 UTC

Severity: wishlist

Full log


View this message in rfc822 format

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 24924 <at> debbugs.gnu.org
Subject: bug#24924: GNU pr only working with singlebyte 1-width characters
Date: Thu, 1 Dec 2016 08:49:39 +0000
2016-12-01 07:04:05 +0000, Stephane Chazelas:
> 2016-11-30 18:37:05 -0800, Paul Eggert:
> [...]
> > In the meantime if you could submit a patch for the
> > documentation that should fix the immediate documentation
> > problem.
> [...]
> 
> What about:
[...]
> +Please note that @command{pr} currently doesn't support multi-byte characters
> +or non-ASCII characters that have a null or double width. If such characters
> +occur in the input or column separators, column alignment may be off or lines
> +may exceed the page width. There is also no provision to support bidirectional
> +text.
[...]

Actually, it seems it can also truncate lines in the middle of
some characters though it seems it's confined to multibyte
characters that have byte values <= 127 like:

$ locale charmap
BIG5-HKSCS
$ printf '\ue9\ue9\ue9\n' | pr -w5 -t2 | hd
00000000  88 6d 88 6d 88 0a                                 |.m.m..|
00000006

See how that third é (0x88 0x6d in BIG5-HKSCS) was truncated in
the middle.

It's as if it was considering all byte values >= 128 as having
zero width in multi-byte locales (and only in multi-byte
locales, that doesn't seem to occur in single-byte ones).

So maybe:

diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index cc85f22..15088ce 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -1838,6 +1838,13 @@ For single
 column output no line truncation occurs by default.  Use @option{-W} option to
 truncate lines in that case.
 
+Please note that @command{pr} currently doesn't support multi-byte characters
+or non-ASCII characters that have a null or double width. If such characters
+occur in the input or column separators, column alignment may be off or lines
+may exceed the page width, or truncation may occur in the middle of some
+characters producing invalid text output. There is also no provision to support
+bidirectional text.
+
 The following changes were made in version 1.22i and apply to later
 versions of @command{pr}:
 @c FIXME: this whole section here sounds very awkward to me. I

-- 
Stephane




This bug report was last modified 6 years and 231 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.