GNU bug report logs - #31185
Why is there no full support for Unicode?

Reported by: Keepun <keepun <at> gmail.com>

Date: Mon, 16 Apr 2018 22:02:01 UTC

Severity: normal

To reply to this bug, email your comments to 31185 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-diffutils <at> gnu.org:
bug#31185; Package diffutils. (Mon, 16 Apr 2018 22:02:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Keepun <keepun <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-diffutils <at> gnu.org. (Mon, 16 Apr 2018 22:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Keepun <keepun <at> gmail.com>
To: bug-diffutils <at> gnu.org
Subject: Why is there no full support for Unicode?
Date: Tue, 17 Apr 2018 00:38:10 +0300

[Message part 1 (text/plain, inline)]

Why is there no full support for Unicode?

Set the encoding using BOM.

The status of the binary file should be given only after checking 0x00 
characters.

BOM is part of the Unicode standard. 
http://www.unicode.org/faq/utf_bom.html#bom4

Files with encoding greater than 8 bits without BOM at the beginning can 
be immediately identified as binary.

My function in C#:

/// <summary>
/// </summary>
/// <param name="stream"></param>
/// <returns>null - binary</returns>
public static Encoding GetEncodingStream(Stream stream)
{
    BinaryReader bin = new BinaryReader(stream);
    byte[] bom = new byte[4];
    bin.BaseStream.Seek(0, SeekOrigin.Begin);
    bin.BaseStream.Read(bom, 0, bom.Length);
    bin.BaseStream.Seek(0, SeekOrigin.Begin);
    if (bom[0] == 0x00 && bom[1] == 0x00 && bom[2] == 0xFE && bom[3] == 
0xFF) {
        return new UTF32Encoding(true, true); // UTF-32, big-endian
    } else if (bom[0] == 0xFE && bom[1] == 0xFF) {
        return new UnicodeEncoding(true, true); // UTF-16, big-endian
    } else if (bom[0] == 0xFF && bom[1] == 0xFE) {
        if (bom[2] == 0x00 && bom[2] == 0x00) {
            return new UTF32Encoding(false, true); // UTF-32, little-endian
        } else {
            return new UnicodeEncoding(false, true); // UTF-16, 
little-endian
        }
    } else if (bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF) {
        return new UTF8Encoding(true);
    } else {
        bool binary = false;
        long fsize = bin.BaseStream.Length;
        if (fsize > 100000) {
            fsize = 100000;
        }
        byte[] bts = new byte[fsize];
        bin.BaseStream.Seek(0, SeekOrigin.Begin);
        bin.BaseStream.Read(bts, 0, (int)fsize);
        bin.BaseStream.Seek(0, SeekOrigin.Begin);
        for (int x = 0; x < fsize; x++) {
            if (bts[x] == 0) {
                binary = true;
                break;
            }
        }
        if (binary) {
            return null;
        }

        return Encoding.Default;
    }
}

[Message part 2 (text/html, inline)]

Information forwarded to bug-diffutils <at> gnu.org:
bug#31185; Package diffutils. (Tue, 17 Apr 2018 07:38:01 GMT) Full text and rfc822 format available.

Message #8 received at 31185 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Keepun <keepun <at> gmail.com>, 31185 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for
 Unicode?
Date: Tue, 17 Apr 2018 00:37:18 -0700

Keepun wrote:
> Files with encoding greater than 8 bits without BOM at the beginning can be 
> immediately identified as binary.

No, the BOM is not required or recommended in UTF-8, so it would be a mistake to 
identify GNU/Linux text files as binary merely because they lack a BOM. 
Typically these files do not have a BOM, and when they do one of the first 
things many users do is remove the BOM because it can cause trouble in practice.

Diffutils does not support UTF-16, where a BOM would make more sense, and there 
are no plans to add support for UTF-16 (or for UTF-32, for that matter).

Information forwarded to bug-diffutils <at> gnu.org:
bug#31185; Package diffutils. (Tue, 17 Apr 2018 20:28:02 GMT) Full text and rfc822 format available.

Message #11 received at 31185 <at> debbugs.gnu.org (full text, mbox):

From: Keepun <keepun <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>, 31185 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for
 Unicode?
Date: Tue, 17 Apr 2018 23:27:36 +0300

[Message part 1 (text/plain, inline)]

UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always 
present. Files with UTF-16 and UTF-32 without the BOM should be 
identified as binary.

But why there are no plans to support UTF-16 and UTF-32? Diff is part of 
the Git and is used all over the world. Now 2018 and Unicode solved 
problems with encodings.

17.04.2018 10:37, Paul Eggert:
> Keepun wrote:
>> Files with encoding greater than 8 bits without BOM at the beginning 
>> can be immediately identified as binary.
>
> No, the BOM is not required or recommended in UTF-8, so it would be a 
> mistake to identify GNU/Linux text files as binary merely because they 
> lack a BOM. Typically these files do not have a BOM, and when they do 
> one of the first things many users do is remove the BOM because it can 
> cause trouble in practice.
>
> Diffutils does not support UTF-16, where a BOM would make more sense, 
> and there are no plans to add support for UTF-16 (or for UTF-32, for 
> that matter).

[Message part 2 (text/html, inline)]

Information forwarded to bug-diffutils <at> gnu.org:
bug#31185; Package diffutils. (Tue, 17 Apr 2018 20:46:01 GMT) Full text and rfc822 format available.

Message #14 received at 31185 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Keepun <keepun <at> gmail.com>, 31185 <at> debbugs.gnu.org
Subject: Re: [bug-diffutils] bug#31185: Why is there no full support for
 Unicode?
Date: Tue, 17 Apr 2018 13:45:40 -0700

On 04/17/2018 01:27 PM, Keepun wrote:
> why there are no plans to support UTF-16 and UTF-32?

Nobody has volunteered to do it, and there hasn't been a pressing need. 
UTF-16 and UTF-32 are primarily used for internal representation, not 
for text files. For more on the subject, please see:

http://utf8everywhere.org/

This bug report was last modified 7 years and 60 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #31185 Why is there no full support for Unicode?

GNU bug report logs - #31185
Why is there no full support for Unicode?