#48324 - 27.2; hexl-mode duplicates the UTF-8 BOM

GNU bug report logs - #48324
27.2; hexl-mode duplicates the UTF-8 BOM

Package: emacs;

Reported by: "R. Diez" <rdiezmail-emacs <at> yahoo.de>

Date: Sun, 9 May 2021 21:39:02 UTC

Severity: normal

Found in version 27.2

Fixed in version 29.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Lars Ingebrigtsen <larsi <at> gnus.org> Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org Subject: bug#48324: 27.2; hexl-mode duplicates the UTF-8 BOM Date: Mon, 04 Jul 2022 14:31:01 +0300

> From: Lars Ingebrigtsen <larsi <at> gnus.org> > Cc: rgm <at> gnu.org, schwab <at> linux-m68k.org, 48324 <at> debbugs.gnu.org > Date: Mon, 04 Jul 2022 12:34:29 +0200 > > Eli Zaretskii <eliz <at> gnu.org> writes: > > > I see that it's actually 6 bytes _including_ the BOM. So I think this > > is confusing: if we are going to return a string with the BOM, we > > should not count the BOM as part of the LENGTH bytes. Because if I > > requested to get characters which fit into N bytes, I should get those > > N bytes of payload. Or maybe we should have an optional argument to > > control whether LENGTH includes or excludes the BOM. > > It the caller has asked for a max number of bytes in a coding system > that includes a BOM, then the BOM has to be counted -- otherwise the > bytes won't fit into whatever field the protocol they're using limits > the string to. You obviously have a very specific use case in mind. But there are others. Moreover, UTF and BOM is a special case, where the prefix is known in advance. Other encodings, notably from the ISO-2022 family, are harder because the exact shift-ion sequence is not always easy to guess. Which is why I thought a way to control this aspect could be needed. But we could just document the subtlety and wait for someone to come up with a practical scenario where it would be needed. > (And we don't have a -without-signature variant, do we?) We do: utf-16le and utf-16be. > > In any case, we should mention this aspect in the doc string, I think. > > Yes. But should we have -without-signature variants for utf-16? Then > the doc string could recommend using that if the caller wants BOM-less > bytes. See above.

This bug report was last modified 3 years and 13 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #48324 27.2; hexl-mode duplicates the UTF-8 BOM

GNU bug report logs - #48324
27.2; hexl-mode duplicates the UTF-8 BOM