GNU bug report logs - #11187
multibyte: fmt: Fix incorrect width handling of multibyte

Previous Next

Package: coreutils;

Reported by: Vladimir 'φ-coder/phcoder' Serbinenko <phcoder <at> gmail.com>

Date: Thu, 5 Apr 2012 18:23:01 UTC

Severity: wishlist

Tags: patch

Full log


Message #8 received at 11187 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Vladimir 'φ-coder/phcoder' Serbinenko
	<phcoder <at> gmail.com>
Cc: 11187 <at> debbugs.gnu.org
Subject: Re: bug#11187: [PATCH] Fix incorrect width handling of multibyte
	characters in fmt
Date: Thu, 05 Apr 2012 21:09:12 +0200
Vladimir "'φ-coder/phcoder'" Serbinenko wrote:
> Currently fmt assumes that 1 byte= 1 column which creates wrongly
> formatted strings. Attached patch fixes it

Hi Vlad,

Thank you for contributing.
This is a large enough change that we'll need an FSF copyright
assignment from you.  If you haven't already sent in the one for
gnulib, please just add coreutils to the list of affected projects.
(you can do up to 4 projects at a time)

Here are a few suggested adjustments:

It'd be great to have a test-suite addition that fails without
your patch, yet succeeds with it.  If you simply provide small
sample input/output pairs along with a selected locale name, we
can convert that to an actual test suite script for you.

Also, since this is a NEWS-worthy change, it is customary to
add an entry in the NEWS file, too.  I'd put this in a section
entitled "Improvements".

> diff --git a/src/fmt.c b/src/fmt.c
> index 89d13a6..56f7c0b 100644
> --- a/src/fmt.c
> +++ b/src/fmt.c
> @@ -20,6 +20,7 @@
>  #include <stdio.h>
>  #include <sys/types.h>
>  #include <getopt.h>
> +#include <wchar.h>
>
>  /* Redefine.  Otherwise, systems (Unicos for one) with headers that define
>     it to be a type get syntax errors for the variable declaration below.  */
> @@ -135,6 +136,7 @@ struct Word
>
>      const char *text;		/* the text of the word */
>      int length;			/* length of this word */
> +    int width;

Please don't follow the bad example there.  This value is always
unsigned, so it is better to use size_t, to match the type of your
new get_display_width function.

>      int space;			/* the size of the following space */
>      unsigned int paren:1;	/* starts with open paren */
>      unsigned int period:1;	/* ends in [.?!])* */
> @@ -259,6 +261,42 @@ static int next_prefix_indent;
>     paragraphs chosen by fmt_paragraph().  */
>  static int last_line_length;
>

Please add a comment saying what the function does/returns,
and naming/describing the arguments.

> +static size_t
> +get_display_width (const char *beg, const char *end)
> +{
> +  const char *ptr;
> +  size_t r = 0;
> +  mbstate_t ps;
> +
> +  memset (&ps, 0, sizeof (ps));

We prefer to initialize mbstate_t variables all on one line, like this:

    mbstate_t ps = { 0, };

> +  for (ptr = beg; *ptr && ptr < end; )
> +    {
> +      wchar_t wc;
> +      size_t s;

Oops.  You've used TABs for indentation.
Note how mixing TABs and spaces makes the indentation look invalid.
Please use only spaces instead.

> +      s = mbrtowc (&wc, ptr, end - ptr, &ps);
> +      if (s == (size_t) -1)
> +	break;
> +      if (s == (size_t) -2)
> +	{
> +	  ptr++;
> +	  r++;
> +	  continue;
> +	}
> +      if (wc == '\e' && ptr + 3 < end
> +	  && ptr[1] == '[' && (ptr[2] == '0' || ptr[2] == '1')
> +	  && ptr[3] == 'm')
> +	{
> +	  ptr += 4;
> +	  continue;
> +	}
> +      r += wcwidth (wc);
> +      ptr += s;
> +    }
> +  return r;
> +}
> +
>  void
>  usage (int status)
>  {
> @@ -669,7 +707,9 @@ get_line (FILE *f, int c)
>            c = getc (f);
>          }
>        while (c != EOF && !isspace (c));
> -      in_column += word_limit->length = wptr - word_limit->text;
> +      word_limit->length = wptr - word_limit->text;
> +      in_column += word_limit->width = get_display_width (word_limit->text,
> +							  wptr);
>        check_punctuation (word_limit);
>
>        /* Scan inter-word space.  */
> @@ -871,13 +911,13 @@ fmt_paragraph (void)
>            if (w == word_limit)
>              break;
>
> -          len += (w - 1)->space + w->length;	/* w > start >= word */
> +          len += (w - 1)->space + w->width;	/* w > start >= word */
>          }
>        while (len < max_width);
>        start->best_cost = best + base_cost (start);
>      }
>
> -  word_limit->length = saved_length;
> +  word_limit->width = saved_length;
>  }
>
>  /* Return the constant component of the cost of breaking before the
> @@ -902,13 +942,13 @@ base_cost (WORD *this)
>        else if ((this - 1)->punct)
>          cost -= PUNCT_BONUS;
>        else if (this > word + 1 && (this - 2)->final)
> -        cost += WIDOW_COST ((this - 1)->length);
> +        cost += WIDOW_COST ((this - 1)->width);
>      }
>
>    if (this->paren)
>      cost -= PAREN_BONUS;
>    else if (this->final)
> -    cost += ORPHAN_COST (this->length);
> +    cost += ORPHAN_COST (this->width);
>
>    return cost;
>  }
> @@ -983,7 +1023,7 @@ put_word (WORD *w)
>    s = w->text;
>    for (n = w->length; n != 0; n--)
>      putchar (*s++);
> -  out_column += w->length;
> +  out_column += w->width;
>  }
>
>  /* Output to stdout SPACE spaces, or equivalent tabs.  */




This bug report was last modified 6 years and 326 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.