GNU bug report logs -
#31185
Why is there no full support for Unicode?
Previous Next
To reply to this bug, email your comments to 31185 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-diffutils <at> gnu.org
:
bug#31185
; Package
diffutils
.
(Mon, 16 Apr 2018 22:02:01 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Keepun <keepun <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-diffutils <at> gnu.org
.
(Mon, 16 Apr 2018 22:02:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Why is there no full support for Unicode?
Set the encoding using BOM.
The status of the binary file should be given only after checking 0x00
characters.
BOM is part of the Unicode standard.
http://www.unicode.org/faq/utf_bom.html#bom4
Files with encoding greater than 8 bits without BOM at the beginning can
be immediately identified as binary.
My function in C#:
/// <summary>
/// </summary>
/// <param name="stream"></param>
/// <returns>null - binary</returns>
public static Encoding GetEncodingStream(Stream stream)
{
BinaryReader bin = new BinaryReader(stream);
byte[] bom = new byte[4];
bin.BaseStream.Seek(0, SeekOrigin.Begin);
bin.BaseStream.Read(bom, 0, bom.Length);
bin.BaseStream.Seek(0, SeekOrigin.Begin);
if (bom[0] == 0x00 && bom[1] == 0x00 && bom[2] == 0xFE && bom[3] ==
0xFF) {
return new UTF32Encoding(true, true); // UTF-32, big-endian
} else if (bom[0] == 0xFE && bom[1] == 0xFF) {
return new UnicodeEncoding(true, true); // UTF-16, big-endian
} else if (bom[0] == 0xFF && bom[1] == 0xFE) {
if (bom[2] == 0x00 && bom[2] == 0x00) {
return new UTF32Encoding(false, true); // UTF-32, little-endian
} else {
return new UnicodeEncoding(false, true); // UTF-16,
little-endian
}
} else if (bom[0] == 0xEF && bom[1] == 0xBB && bom[2] == 0xBF) {
return new UTF8Encoding(true);
} else {
bool binary = false;
long fsize = bin.BaseStream.Length;
if (fsize > 100000) {
fsize = 100000;
}
byte[] bts = new byte[fsize];
bin.BaseStream.Seek(0, SeekOrigin.Begin);
bin.BaseStream.Read(bts, 0, (int)fsize);
bin.BaseStream.Seek(0, SeekOrigin.Begin);
for (int x = 0; x < fsize; x++) {
if (bts[x] == 0) {
binary = true;
break;
}
}
if (binary) {
return null;
}
return Encoding.Default;
}
}
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-diffutils <at> gnu.org
:
bug#31185
; Package
diffutils
.
(Tue, 17 Apr 2018 07:38:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 31185 <at> debbugs.gnu.org (full text, mbox):
Keepun wrote:
> Files with encoding greater than 8 bits without BOM at the beginning can be
> immediately identified as binary.
No, the BOM is not required or recommended in UTF-8, so it would be a mistake to
identify GNU/Linux text files as binary merely because they lack a BOM.
Typically these files do not have a BOM, and when they do one of the first
things many users do is remove the BOM because it can cause trouble in practice.
Diffutils does not support UTF-16, where a BOM would make more sense, and there
are no plans to add support for UTF-16 (or for UTF-32, for that matter).
Information forwarded
to
bug-diffutils <at> gnu.org
:
bug#31185
; Package
diffutils
.
(Tue, 17 Apr 2018 20:28:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 31185 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
UTF-8 does not require BOM, but for UTF-16 and UTF-32 BOM is always
present. Files with UTF-16 and UTF-32 without the BOM should be
identified as binary.
But why there are no plans to support UTF-16 and UTF-32? Diff is part of
the Git and is used all over the world. Now 2018 and Unicode solved
problems with encodings.
17.04.2018 10:37, Paul Eggert:
> Keepun wrote:
>> Files with encoding greater than 8 bits without BOM at the beginning
>> can be immediately identified as binary.
>
> No, the BOM is not required or recommended in UTF-8, so it would be a
> mistake to identify GNU/Linux text files as binary merely because they
> lack a BOM. Typically these files do not have a BOM, and when they do
> one of the first things many users do is remove the BOM because it can
> cause trouble in practice.
>
> Diffutils does not support UTF-16, where a BOM would make more sense,
> and there are no plans to add support for UTF-16 (or for UTF-32, for
> that matter).
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-diffutils <at> gnu.org
:
bug#31185
; Package
diffutils
.
(Tue, 17 Apr 2018 20:46:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 31185 <at> debbugs.gnu.org (full text, mbox):
On 04/17/2018 01:27 PM, Keepun wrote:
> why there are no plans to support UTF-16 and UTF-32?
Nobody has volunteered to do it, and there hasn't been a pressing need.
UTF-16 and UTF-32 are primarily used for internal representation, not
for text files. For more on the subject, please see:
http://utf8everywhere.org/
This bug report was last modified 7 years and 60 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.