GNU bug report logs - #9252
cut does not yet support unicode characters

Previous Next

Package: coreutils;

Reported by: Danilo Moraes <moraesdno <at> gmail.com>

Date: Sat, 6 Aug 2011 01:54:06 UTC

Severity: normal

Tags: notabug

Merged with 9253

Done: Bob Proulx <bob <at> proulx.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9252 in the body.
You can then email your comments to 9252 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9252; Package coreutils. (Sat, 06 Aug 2011 01:54:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Danilo Moraes <moraesdno <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 06 Aug 2011 01:54:06 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Danilo Moraes <moraesdno <at> gmail.com>
To: bug-coreutils <at> gnu.org
Subject: a bug in cut
Date: Fri, 5 Aug 2011 12:39:14 -0300
[Message part 1 (text/plain, inline)]
I have found a little bug (i guess). See that:

a=danilo
echo $a | cut -c -5 # shows danil

a=dánilo
echo $a | cut -c 5 # shows dáni

The option -b equal works. The cut is ignoring the letters with acentuation.

I read in infopages this:

`-c CHARACTER-LIST'
`--characters=CHARACTER-LIST'
     Select for printing only the characters in positions listed in
     CHARACTER-LIST.  The same as `-b' for now, but
     internationalization will change that.  Tabs and backspaces are
     treated like any other character; they take up 1 character.  If an
     output delimiter is specified, (see the description of
     `--output-delimiter'), then output that string between ranges of
     selected bytes.

"The same as `-b' for now, but
     internationalization will change that." this solves my problem? How it
works?

Thanks,

Danilo S. Morães
[Message part 2 (text/html, inline)]

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9252; Package coreutils. (Sat, 06 Aug 2011 17:21:02 GMT) Full text and rfc822 format available.

Message #8 received at 9252 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Danilo Moraes <moraesdno <at> gmail.com>
Cc: 9252 <at> debbugs.gnu.org
Subject: Re: bug#9252: a bug in cut
Date: Sat, 6 Aug 2011 11:19:06 -0600
forcemerge 9252 9253
retitle 9252 cut does not yet support unicode characters
tags 9252 + notabug
close 9252
thanks

Danilo Moraes wrote:
> I have found a little bug (i guess). See that:

Thank you for the report.  You have discovered that coreutils does not
yet have localization support for wide characters.

> a=danilo
> echo $a | cut -c -5 # shows danil

  $ echo "danilo" | od -tx1 -c
  0000000  64  61  6e  69  6c  6f  0a
            d   a   n   I   l   o  \n

> a=dánilo
> echo $a | cut -c 5 # shows dáni

I think you meant "cut -c-5" there.

  $ echo "dánilo" | od -tx1 -c
  0000000  64  c3  a1  6e  69  6c  6f  0a
            d 303 241   n   I   l   o  \n

As you can see accented characters are not simple single byte
characters.  The od output shows their byte values.  The accented 'a'
occupies two bytes wide.  This is why cut is counting it as two bytes.

> The option -b equal works. The cut is ignoring the letters with acentuation.

Sorry but that code has not yet been written.

> I read in infopages this:

Thank you for consulting the documentation!  And I say that
seriously.  So many people ignore it.  It is pleasant to hear that you
read it.

> `-c CHARACTER-LIST'
> `--characters=CHARACTER-LIST'
>      Select for printing only the characters in positions listed in
>      CHARACTER-LIST.  The same as `-b' for now, but
>      internationalization will change that.  Tabs and backspaces are
>      treated like any other character; they take up 1 character.  If an
>      output delimiter is specified, (see the description of
>      `--output-delimiter'), then output that string between ranges of
>      selected bytes.
> 
> "The same as `-b' for now, but
>      internationalization will change that." this solves my problem? How it
> works?

Note that it says "internationalization /will/ change that" which
means will change it in the future.  It is a future tense assertion.
It has not yet happened.  In the future when the code is written and
put into coreutils then it will do this other behavior.

Note that some software distributions have patches that add unicode
support to the coreutils.  But so far none of those patches have been
deemed appropriate to install in the upstream source due to issues of
maintainability due to issues such as code duplication and such.

Because this is not a bug in cut and is also a well known issue I am
going to go ahead and close the report.  But that does not mean no
further discussion is possible.  Please feel free to respond.
Discussion may still continue and is encouraged.

Bob




Forcibly Merged 9252 9253. Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Sat, 06 Aug 2011 17:21:02 GMT) Full text and rfc822 format available.

Changed bug title to 'cut does not yet support unicode characters' from 'a bug in cut' Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Sat, 06 Aug 2011 17:21:02 GMT) Full text and rfc822 format available.

Added tag(s) notabug. Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Sat, 06 Aug 2011 17:21:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 9252 <at> debbugs.gnu.org and Danilo Moraes <moraesdno <at> gmail.com> Request was from Bob Proulx <bob <at> proulx.com> to control <at> debbugs.gnu.org. (Sat, 06 Aug 2011 17:21:03 GMT) Full text and rfc822 format available.

Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils <at> gnu.org:
bug#9252; Package coreutils. (Sat, 06 Aug 2011 20:21:02 GMT) Full text and rfc822 format available.

Message #19 received at 9252 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: Danilo Moraes <moraesdno <at> gmail.com>
Cc: 9252 <at> debbugs.gnu.org
Subject: Re: bug#9252: a bug in cut
Date: Sat, 6 Aug 2011 14:19:08 -0600
Danilo,

> Thanks for replying so quickly. Now I understand what cut was doing with my
> string. :)
> I'm braziliam and my english is very, very weak.
> 
> > Note that it says "internationalization /will/ change that" which
> > means will change it in the future.  It is a future tense assertion.
> 
> This is the prove. I read but did not pay attention to the will. hehe
> 
> More one time, thanks for replying.

Happy to help!

Bob




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 04 Sep 2011 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 293 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.