From debbugs-submit-bounces@debbugs.gnu.org Fri Aug 05 21:53:12 2011 Received: (at submit) by debbugs.gnu.org; 6 Aug 2011 01:53:12 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QpW4d-0006Hw-Pf for submit@debbugs.gnu.org; Fri, 05 Aug 2011 21:53:12 -0400 Received: from eggs.gnu.org ([140.186.70.92]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QpNAy-00012U-8I for submit@debbugs.gnu.org; Fri, 05 Aug 2011 12:23:09 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QpNAD-0002i7-ID for submit@debbugs.gnu.org; Fri, 05 Aug 2011 12:22:22 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,FREEMAIL_FROM, HTML_MESSAGE,RCVD_IN_DNSWL_LOW,T_DKIM_INVALID,T_TO_NO_BRKTS_FREEMAIL autolearn=unavailable version=3.3.1 Received: from lists.gnu.org ([140.186.70.17]:51692) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QpNAD-0002i3-G9 for submit@debbugs.gnu.org; Fri, 05 Aug 2011 12:22:21 -0400 Received: from eggs.gnu.org ([140.186.70.92]:49333) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QpNAC-0006w1-Hi for bug-coreutils@gnu.org; Fri, 05 Aug 2011 12:22:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1QpNAB-0002h2-2o for bug-coreutils@gnu.org; Fri, 05 Aug 2011 12:22:20 -0400 Received: from mail-yw0-f41.google.com ([209.85.213.41]:41680) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1QpNAA-0002gZ-TT for bug-coreutils@gnu.org; Fri, 05 Aug 2011 12:22:19 -0400 Received: by ywa6 with SMTP id 6so2065801ywa.0 for ; Fri, 05 Aug 2011 09:22:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=5ZEW6RnpCxHeiTkgEzkLoZPu1sq3SVizYjfnCGk1lRQ=; b=DL668LIOCUa8sifKstAB9OL9Um3upfINrxcU8MpcrkP9hOd9FsCvSN/yZATxgMfRT0 A0EDtqc0G25pP7rC1i/KA+Ya9X6JHlqK9VP0yZvOuxMEV0fMcYZtH5mzlJVQioilviGn KJU8QAN4N4DnO2hijtH7jTSFL8GUDalChmpPM= MIME-Version: 1.0 Received: by 10.236.9.41 with SMTP id 29mr580610yhs.243.1312561338032; Fri, 05 Aug 2011 09:22:18 -0700 (PDT) Received: by 10.236.176.195 with HTTP; Fri, 5 Aug 2011 09:22:18 -0700 (PDT) Date: Fri, 5 Aug 2011 13:22:18 -0300 Message-ID: Subject: bug in cut - more information From: Danilo Moraes To: bug-coreutils@gnu.org Content-Type: multipart/alternative; boundary=20cf303bfc924ca1f604a9c48057 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 2) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3) X-Received-From: 140.186.70.17 X-Spam-Score: -5.9 (-----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Fri, 05 Aug 2011 21:53:02 -0400 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -5.9 (-----) --20cf303bfc924ca1f604a9c48057 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I have found a little bug in cut(i guess). See that: a=3Ddanilo echo $a | cut -c -5 # shows danil a=3Dd=E1nilo echo $a | cut -c 5 # shows d=E1ni The option -b equal works. The cut is ignoring the letters with acentuation= . I read in infopages this: `-c CHARACTER-LIST' `--characters=3DCHARACTER-LIST' Select for printing only the characters in positions listed in CHARACTER-LIST. The same as `-b' for now, but internationalization will change that. Tabs and backspaces are treated like any other character; they take up 1 character. If an output delimiter is specified, (see the description of `--output-delimiter'), then output that string between ranges of selected bytes. "The same as `-b' for now, but internationalization will change that.". Has not been changed? This is my locale: LANG=3Dpt_BR.UTF-8 LANGUAGE=3Dpt_BR:pt:en LC_CTYPE=3D"pt_BR.UTF-8" LC_NUMERIC=3D"pt_BR.UTF-8" LC_TIME=3D"pt_BR.UTF-8" LC_COLLATE=3D"pt_BR.UTF-8" LC_MONETARY=3D"pt_BR.UTF-8" LC_MESSAGES=3D"pt_BR.UTF-8" LC_PAPER=3D"pt_BR.UTF-8" LC_NAME=3D"pt_BR.UTF-8" LC_ADDRESS=3D"pt_BR.UTF-8" LC_TELEPHONE=3D"pt_BR.UTF-8" LC_MEASUREMENT=3D"pt_BR.UTF-8" LC_IDENTIFICATION=3D"pt_BR.UTF-8" LC_ALL=3D and the cut version is: cut (GNU coreutils) 7.4 Thanks, Danilo S. Mor=E3es --20cf303bfc924ca1f604a9c48057 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I have found a little bug in cut(i guess)= . See that:

a=3Ddanilo
echo $a | cut -c -5= # shows danil

a=3Dd=E1nilo
echo $a | cut -c = 5 # shows d=E1ni

The option -b equal = works. The cut is ignoring the letters with acentuation.

I read in infopages = this:=A0
<= br>
`-c CHARA= CTER-LIST'
`--characters=3DCHARACTER-LIST'
=A0 =A0 =A0Select for printing only the characters in positions listed in
=A0 =A0 =A0CHAR= ACTER-LIST. =A0The same as `-b' for now, but
=A0 =A0 =A0internationalization will= change that. =A0Tabs and backspaces are
=A0 =A0 =A0trea= ted like any other character; they take up 1 character. =A0If an
=A0 =A0 =A0outp= ut delimiter is specified, (see the description of
=A0 =A0 =A0`--output-delimiter'= ;), then output that string between ranges of
=A0 =A0 =A0sele= cted bytes.
&qu= ot;The same as `-b' for now, but
= =A0 =A0 =A0internationalization will change that.".=A0Has not been changed?
This is my locale= :

LANG=3Dpt_BR.UTF-8
LANGUAGE=3Dpt_BR:pt:en
LC_CTYPE= =3D"pt_BR.UTF-8"
LC_NUMERIC=3D"pt_BR.UTF-8"
LC_TIME=3D"pt_BR.UTF-8"
LC_COLLATE=3D"pt_BR.U= TF-8"
LC_MONETARY=3D"pt_BR.UTF-8"
LC_MES= SAGES=3D"pt_BR.UTF-8"
LC_PAPER=3D"pt_BR.UTF-8"= ;
LC_NAME=3D"pt_BR.UTF-8"
LC_ADDRESS=3D"pt_BR.U= TF-8"
LC_TELEPHONE=3D"pt_BR.UTF-8"
LC_ME= ASUREMENT=3D"pt_BR.UTF-8"
LC_IDENTIFICATION=3D"pt_= BR.UTF-8"
LC_ALL=3D

and the cut version is:=A0cut (GNU coreutils) 7.4

Thank= s,

Danilo S. Mo= r=E3es
--20cf303bfc924ca1f604a9c48057-- From debbugs-submit-bounces@debbugs.gnu.org Sat Aug 06 13:20:04 2011 Received: (at control) by debbugs.gnu.org; 6 Aug 2011 17:20:04 +0000 Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QpkXc-0007o1-4f for submit@debbugs.gnu.org; Sat, 06 Aug 2011 13:20:04 -0400 Received: from joseki.proulx.com ([216.17.153.58]) by debbugs.gnu.org with esmtp (Exim 4.69) (envelope-from ) id 1QpkXX-0007nS-GN; Sat, 06 Aug 2011 13:20:01 -0400 Received: from hysteria.proulx.com (hysteria.proulx.com [192.168.230.119]) by joseki.proulx.com (Postfix) with ESMTP id B945A21361; Sat, 6 Aug 2011 11:19:06 -0600 (MDT) Received: by hysteria.proulx.com (Postfix, from userid 1000) id 99A772DC71; Sat, 6 Aug 2011 11:19:06 -0600 (MDT) Date: Sat, 6 Aug 2011 11:19:06 -0600 From: Bob Proulx To: Danilo Moraes Subject: Re: bug#9252: a bug in cut Message-ID: <20110806171906.GB16380@hysteria.proulx.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.5 (--) X-Debbugs-Envelope-To: control Cc: 9252@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.11 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: debbugs-submit-bounces@debbugs.gnu.org Errors-To: debbugs-submit-bounces@debbugs.gnu.org X-Spam-Score: -2.5 (--) forcemerge 9252 9253 retitle 9252 cut does not yet support unicode characters tags 9252 + notabug close 9252 thanks Danilo Moraes wrote: > I have found a little bug (i guess). See that: Thank you for the report. You have discovered that coreutils does not yet have localization support for wide characters. > a=3Ddanilo > echo $a | cut -c -5 # shows danil $ echo "danilo" | od -tx1 -c 0000000 64 61 6e 69 6c 6f 0a d a n I l o \n > a=3Dd=E1nilo > echo $a | cut -c 5 # shows d=E1ni I think you meant "cut -c-5" there. $ echo "d=E1nilo" | od -tx1 -c 0000000 64 c3 a1 6e 69 6c 6f 0a d 303 241 n I l o \n As you can see accented characters are not simple single byte characters. The od output shows their byte values. The accented 'a' occupies two bytes wide. This is why cut is counting it as two bytes. > The option -b equal works. The cut is ignoring the letters with acentua= tion. Sorry but that code has not yet been written. > I read in infopages this: Thank you for consulting the documentation! And I say that seriously. So many people ignore it. It is pleasant to hear that you read it. > `-c CHARACTER-LIST' > `--characters=3DCHARACTER-LIST' > Select for printing only the characters in positions listed in > CHARACTER-LIST. The same as `-b' for now, but > internationalization will change that. Tabs and backspaces are > treated like any other character; they take up 1 character. If an > output delimiter is specified, (see the description of > `--output-delimiter'), then output that string between ranges of > selected bytes. >=20 > "The same as `-b' for now, but > internationalization will change that." this solves my problem? Ho= w it > works? Note that it says "internationalization /will/ change that" which means will change it in the future. It is a future tense assertion. It has not yet happened. In the future when the code is written and put into coreutils then it will do this other behavior. Note that some software distributions have patches that add unicode support to the coreutils. But so far none of those patches have been deemed appropriate to install in the upstream source due to issues of maintainability due to issues such as code duplication and such. Because this is not a bug in cut and is also a well known issue I am going to go ahead and close the report. But that does not mean no further discussion is possible. Please feel free to respond. Discussion may still continue and is encouraged. Bob From unknown Sat Jun 21 03:15:14 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sun, 04 Sep 2011 11:24:04 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator