GNU bug report logs - #11187
multibyte: fmt: Fix incorrect width handling of multibyte

Previous Next

Package: coreutils;

Reported by: Vladimir 'φ-coder/phcoder' Serbinenko <phcoder <at> gmail.com>

Date: Thu, 5 Apr 2012 18:23:01 UTC

Severity: wishlist

Tags: patch

To reply to this bug, email your comments to 11187 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#11187; Package coreutils. (Thu, 05 Apr 2012 18:23:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Vladimir 'φ-coder/phcoder' Serbinenko <phcoder <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Thu, 05 Apr 2012 18:23:01 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Vladimir 'φ-coder/phcoder' Serbinenko
	<phcoder <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: [PATCH] Fix incorrect width handling of multibyte characters in fmt
Date: Thu, 05 Apr 2012 20:10:53 +0200
[Message part 1 (text/plain, inline)]
Currently fmt assumes that 1 byte= 1 column which creates wrongly
formatted strings. Attached patch fixes it

-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko

[fmt_width.diff (text/x-diff, attachment)]
[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#11187; Package coreutils. (Thu, 05 Apr 2012 19:10:02 GMT) Full text and rfc822 format available.

Message #8 received at 11187 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Vladimir 'φ-coder/phcoder' Serbinenko
	<phcoder <at> gmail.com>
Cc: 11187 <at> debbugs.gnu.org
Subject: Re: bug#11187: [PATCH] Fix incorrect width handling of multibyte
	characters in fmt
Date: Thu, 05 Apr 2012 21:09:12 +0200
Vladimir "'φ-coder/phcoder'" Serbinenko wrote:
> Currently fmt assumes that 1 byte= 1 column which creates wrongly
> formatted strings. Attached patch fixes it

Hi Vlad,

Thank you for contributing.
This is a large enough change that we'll need an FSF copyright
assignment from you.  If you haven't already sent in the one for
gnulib, please just add coreutils to the list of affected projects.
(you can do up to 4 projects at a time)

Here are a few suggested adjustments:

It'd be great to have a test-suite addition that fails without
your patch, yet succeeds with it.  If you simply provide small
sample input/output pairs along with a selected locale name, we
can convert that to an actual test suite script for you.

Also, since this is a NEWS-worthy change, it is customary to
add an entry in the NEWS file, too.  I'd put this in a section
entitled "Improvements".

> diff --git a/src/fmt.c b/src/fmt.c
> index 89d13a6..56f7c0b 100644
> --- a/src/fmt.c
> +++ b/src/fmt.c
> @@ -20,6 +20,7 @@
>  #include <stdio.h>
>  #include <sys/types.h>
>  #include <getopt.h>
> +#include <wchar.h>
>
>  /* Redefine.  Otherwise, systems (Unicos for one) with headers that define
>     it to be a type get syntax errors for the variable declaration below.  */
> @@ -135,6 +136,7 @@ struct Word
>
>      const char *text;		/* the text of the word */
>      int length;			/* length of this word */
> +    int width;

Please don't follow the bad example there.  This value is always
unsigned, so it is better to use size_t, to match the type of your
new get_display_width function.

>      int space;			/* the size of the following space */
>      unsigned int paren:1;	/* starts with open paren */
>      unsigned int period:1;	/* ends in [.?!])* */
> @@ -259,6 +261,42 @@ static int next_prefix_indent;
>     paragraphs chosen by fmt_paragraph().  */
>  static int last_line_length;
>

Please add a comment saying what the function does/returns,
and naming/describing the arguments.

> +static size_t
> +get_display_width (const char *beg, const char *end)
> +{
> +  const char *ptr;
> +  size_t r = 0;
> +  mbstate_t ps;
> +
> +  memset (&ps, 0, sizeof (ps));

We prefer to initialize mbstate_t variables all on one line, like this:

    mbstate_t ps = { 0, };

> +  for (ptr = beg; *ptr && ptr < end; )
> +    {
> +      wchar_t wc;
> +      size_t s;

Oops.  You've used TABs for indentation.
Note how mixing TABs and spaces makes the indentation look invalid.
Please use only spaces instead.

> +      s = mbrtowc (&wc, ptr, end - ptr, &ps);
> +      if (s == (size_t) -1)
> +	break;
> +      if (s == (size_t) -2)
> +	{
> +	  ptr++;
> +	  r++;
> +	  continue;
> +	}
> +      if (wc == '\e' && ptr + 3 < end
> +	  && ptr[1] == '[' && (ptr[2] == '0' || ptr[2] == '1')
> +	  && ptr[3] == 'm')
> +	{
> +	  ptr += 4;
> +	  continue;
> +	}
> +      r += wcwidth (wc);
> +      ptr += s;
> +    }
> +  return r;
> +}
> +
>  void
>  usage (int status)
>  {
> @@ -669,7 +707,9 @@ get_line (FILE *f, int c)
>            c = getc (f);
>          }
>        while (c != EOF && !isspace (c));
> -      in_column += word_limit->length = wptr - word_limit->text;
> +      word_limit->length = wptr - word_limit->text;
> +      in_column += word_limit->width = get_display_width (word_limit->text,
> +							  wptr);
>        check_punctuation (word_limit);
>
>        /* Scan inter-word space.  */
> @@ -871,13 +911,13 @@ fmt_paragraph (void)
>            if (w == word_limit)
>              break;
>
> -          len += (w - 1)->space + w->length;	/* w > start >= word */
> +          len += (w - 1)->space + w->width;	/* w > start >= word */
>          }
>        while (len < max_width);
>        start->best_cost = best + base_cost (start);
>      }
>
> -  word_limit->length = saved_length;
> +  word_limit->width = saved_length;
>  }
>
>  /* Return the constant component of the cost of breaking before the
> @@ -902,13 +942,13 @@ base_cost (WORD *this)
>        else if ((this - 1)->punct)
>          cost -= PUNCT_BONUS;
>        else if (this > word + 1 && (this - 2)->final)
> -        cost += WIDOW_COST ((this - 1)->length);
> +        cost += WIDOW_COST ((this - 1)->width);
>      }
>
>    if (this->paren)
>      cost -= PAREN_BONUS;
>    else if (this->final)
> -    cost += ORPHAN_COST (this->length);
> +    cost += ORPHAN_COST (this->width);
>
>    return cost;
>  }
> @@ -983,7 +1023,7 @@ put_word (WORD *w)
>    s = w->text;
>    for (n = w->length; n != 0; n--)
>      putchar (*s++);
> -  out_column += w->length;
> +  out_column += w->width;
>  }
>
>  /* Output to stdout SPACE spaces, or equivalent tabs.  */




Information forwarded to bug-coreutils <at> gnu.org:
bug#11187; Package coreutils. (Thu, 05 Apr 2012 23:02:02 GMT) Full text and rfc822 format available.

Message #11 received at 11187 <at> debbugs.gnu.org (full text, mbox):

From: Vladimir 'φ-coder/phcoder' Serbinenko
	<phcoder <at> gmail.com>
To: Jim Meyering <jim <at> meyering.net>
Cc: 11187 <at> debbugs.gnu.org
Subject: Re: bug#11187: [PATCH] Fix incorrect width handling of multibyte
	characters in fmt
Date: Fri, 06 Apr 2012 01:00:31 +0200
[Message part 1 (text/plain, inline)]
On 05.04.2012 21:09, Jim Meyering wrote:
> Vladimir "'φ-coder/phcoder'" Serbinenko wrote:
>> Currently fmt assumes that 1 byte= 1 column which creates wrongly
>> formatted strings. Attached patch fixes it
> Hi Vlad,
>
> Thank you for contributing.
> This is a large enough change that we'll need an FSF copyright
> assignment from you.  If you haven't already sent in the one for
> gnulib, please just add coreutils to the list of affected projects.
> (you can do up to 4 projects at a time)
Ok, will do so. I'll also wait till more or less definitive version is
ready for gnulib before updating the one for coreutils.
Can I add TP in the same time?
-- 
Regards
Vladimir 'φ-coder/phcoder' Serbinenko

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#11187; Package coreutils. (Fri, 06 Apr 2012 06:36:02 GMT) Full text and rfc822 format available.

Message #14 received at 11187 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Vladimir 'φ-coder/phcoder' Serbinenko
	<phcoder <at> gmail.com>
Cc: 11187 <at> debbugs.gnu.org
Subject: Re: bug#11187: [PATCH] Fix incorrect width handling of multibyte
	characters in fmt
Date: Fri, 06 Apr 2012 08:34:46 +0200
Vladimir 'φ-coder/phcoder' Serbinenko wrote:
> On 05.04.2012 21:09, Jim Meyering wrote:
>> Vladimir "'φ-coder/phcoder'" Serbinenko wrote:
>>> Currently fmt assumes that 1 byte= 1 column which creates wrongly
>>> formatted strings. Attached patch fixes it
>> Hi Vlad,
>>
>> Thank you for contributing.
>> This is a large enough change that we'll need an FSF copyright
>> assignment from you.  If you haven't already sent in the one for
>> gnulib, please just add coreutils to the list of affected projects.
>> (you can do up to 4 projects at a time)
> Ok, will do so. I'll also wait till more or less definitive version is
> ready for gnulib before updating the one for coreutils.
> Can I add TP in the same time?

Translation Project?  I don't know if they use the same forms/addresses.
Please ask coordinator <at> translationproject.org to be sure.




Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 19 Oct 2018 01:19:01 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: fmt: Fix incorrect width handling of multibyte' from '[PATCH] Fix incorrect width handling of multibyte characters in fmt' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 19 Oct 2018 01:19:01 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 325 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.