GNU bug report logs - #9747
M-x untabify with "ZERO WIDTH NO-BREAK SPACE" (aka "BYTE ORDER MARK")

Package: emacs;

Date: Thu, 13 Oct 2011 23:33:01 UTC

Severity: normal

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 9747 in the body.
You can then email your comments to 9747 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#9747; Package emacs. (Thu, 13 Oct 2011 23:33:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to noloader <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 13 Oct 2011 23:33:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jeffrey Walton <noloader <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: C-x h TAB and M-x untabify
Date: Thu, 13 Oct 2011 19:27:54 -0400

[Message part 1 (text/plain, inline)]

I often use C-x h TAB and M-x untabify to format C, C++, and Java code.

If a document has an errant UTF-8 byte order mark (a UTF-8 BOM is EF
BB BF), Emacs cannot always format the source file.

For example, the attached Java file (JavaEncryptor.java-backup) has
1845 BOMs sprinkled throughout. I'm not sure what editor put them in,
but Emacs does not properly handle some operations with them present.
If I strip the errant BOMs with the attached program
(efbbbf-strip.cpp), Emacs will properly format the file.

[JavaEncryptor.java-backup (application/octet-stream, attachment)]

[efbbbf-strip.cpp (text/x-c++src, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#9747; Package emacs. (Wed, 19 Oct 2011 23:57:01 GMT) Full text and rfc822 format available.

Message #8 received at 9747 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: noloader <at> gmail.com
Cc: 9747 <at> debbugs.gnu.org
Subject: Re: bug#9747: C-x h TAB and M-x untabify
Date: Thu, 20 Oct 2011 02:32:14 +0300

> I often use C-x h TAB and M-x untabify to format C, C++, and Java code.
>
> If a document has an errant UTF-8 byte order mark (a UTF-8 BOM is EF
> BB BF), Emacs cannot always format the source file.
>
> For example, the attached Java file (JavaEncryptor.java-backup) has
> 1845 BOMs sprinkled throughout. I'm not sure what editor put them in,
> but Emacs does not properly handle some operations with them present.
> If I strip the errant BOMs with the attached program
> (efbbbf-strip.cpp), Emacs will properly format the file.

"BYTE ORDER MARK" is the old name of the U+FEFF character.
The new name is "ZERO WIDTH NO-BREAK SPACE".

You can add to your .emacs something like:

(eval-after-load "cc-mode"
  '(progn (modify-syntax-entry ?\uFEFF " " java-mode-syntax-table)))

and the most of indentation code will work correctly.

However, in some places in core packages we need to replace such code

  (skip-chars-forward " \t")

with

  (skip-chars-forward " \t\uFEFF")

to take into account other whitespace characters.

Changed bug title to 'M-x untabify with "ZERO WIDTH NO-BREAK SPACE" (aka "BYTE ORDER MARK")' from 'C-x h TAB and M-x untabify' Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Sat, 25 Mar 2017 01:21:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#9747; Package emacs. (Fri, 16 Jul 2021 13:59:01 GMT) Full text and rfc822 format available.

Message #13 received at 9747 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: noloader <at> gmail.com, 9747 <at> debbugs.gnu.org
Subject: Re: bug#9747: M-x untabify with "ZERO WIDTH NO-BREAK SPACE" (aka
 "BYTE ORDER MARK")
Date: Fri, 16 Jul 2021 15:57:52 +0200

Juri Linkov <juri <at> jurta.org> writes:

>> I often use C-x h TAB and M-x untabify to format C, C++, and Java code.
>>
>> If a document has an errant UTF-8 byte order mark (a UTF-8 BOM is EF
>> BB BF), Emacs cannot always format the source file.
>>
>> For example, the attached Java file (JavaEncryptor.java-backup) has
>> 1845 BOMs sprinkled throughout. I'm not sure what editor put them in,
>> but Emacs does not properly handle some operations with them present.
>> If I strip the errant BOMs with the attached program
>> (efbbbf-strip.cpp), Emacs will properly format the file.
>
> "BYTE ORDER MARK" is the old name of the U+FEFF character.
> The new name is "ZERO WIDTH NO-BREAK SPACE".

So I don't think there's anything here to fix on the Emacs side --
zero-width spaces aren't necessarily supposed to be handled identically
to other white space here.  So I'm closing this bug report.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

bug closed, send any further explanations to 9747 <at> debbugs.gnu.org and noloader <at> gmail.com Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Fri, 16 Jul 2021 13:59:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 14 Aug 2021 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 4 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #9747 M-x untabify with "ZERO WIDTH NO-BREAK SPACE" (aka "BYTE ORDER MARK")

GNU bug report logs - #9747
M-x untabify with "ZERO WIDTH NO-BREAK SPACE" (aka "BYTE ORDER MARK")