GNU bug report logs - #21395
multibyte: cut and Spanish characters

Previous Next

Package: coreutils;

Reported by: Michael Lee <michaellee213 <at> yahoo.com>

Date: Wed, 2 Sep 2015 00:54:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 21395 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#21395; Package coreutils. (Wed, 02 Sep 2015 00:54:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Michael Lee <michaellee213 <at> yahoo.com>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 02 Sep 2015 00:54:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael Lee <michaellee213 <at> yahoo.com>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: Bug with cut and Spanish characters from text file with UTF-8 encoding
Date: Wed, 2 Sep 2015 00:41:09 +0000 (UTC)
[Message part 1 (text/plain, inline)]
To whom it may concern:
To preface the explanation of this possible bug, the following was tested:
Encoding(s) was/were determined by opening the Spanish text files with vi and using ":set" to view the encoding type(s).

Text files containing Spanish letters/characters were used in this test.  First, the locale in the bash shell was set to UTF-8 (default setting with Ubuntu) and the encoding on the first test file was encoded with Latin1.  Under these conditions head and tail were used to try to output several Spanish letters/characters with accents above the letter.  Trying to use "head spanish.txt" and "tail spanish.txt" resulted in output with spaces in place of the Spanish letters/characters.
After spanish.txt was converted from Latin1 to UTF-8 with iconv, the test was repeated with the head and tail utilities and then the output was correct.  The Spanish letters/characters then displayed correctly instead of what previously appeared to be blank spaces.  When the "cut" command was added to this, the behavior of spaces taking the place of letters returned.
For example, "head -n 50 spanish.txt | cut -c 1" or "tail -n 50 spanish.txt | cut -c 1" will result in the first character showing only blank spaces where there are Spanish letters/characters.  Letters with accents are displayed as blank spaces.  Using only head or tail will show the Spanish letters correctly, but not with the cut command.

When using cut as, "cut -c 1" with a text file with Spanish characters, it does not display those characters.
For example, the character ã or á will not display if it is the first character and the file is trimmed using the cut command.
Converting the file from Latin1 to UTF-8 solved the problem with head and tail, but not cut.
The cut command does not seem to output the special letters/characters correctly.
Is there an environment variable that could fix this or could it possibly be a bug?
Thank you for your time.
Sincerely,Michael Lee
 
[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#21395; Package coreutils. (Wed, 02 Sep 2015 11:04:02 GMT) Full text and rfc822 format available.

Message #8 received at 21395 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Michael Lee <michaellee213 <at> yahoo.com>, 21395 <at> debbugs.gnu.org
Subject: Re: bug#21395: Bug with cut and Spanish characters from text file
 with UTF-8 encoding
Date: Wed, 02 Sep 2015 12:03:10 +0100
On 02/09/15 01:41, Michael Lee wrote:
> When using cut as, "cut -c 1" with a text file with Spanish characters, it does not display those characters.
> For example, the character ã or á will not display if it is the first character and the file is trimmed using the cut command.

Debian/Ubuntu do not use the i18n patch used in Fedora/RHEL/Suse for example,
and so do not support multi-byte characters. Now that i18n patch is
problematic and incomplete, and there are plans to bring the
functionality upstream at some stage:

http://www.pixelbeat.org/docs/coreutils_i18n/

cheers,
Pádraig




Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 24 Oct 2018 21:11:02 GMT) Full text and rfc822 format available.

Changed bug title to 'multibyte: cut and Spanish characters' from 'Bug with cut and Spanish characters from text file with UTF-8 encoding' Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 24 Oct 2018 21:11:02 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 296 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.