GNU bug report logs - #12054
24.1; regression? font-lock no-break-space with nil nobreak-char-display

Package: emacs;

Reported by: "Drew Adams" <drew.adams <at> oracle.com>

Date: Thu, 26 Jul 2012 05:51:02 UTC

Severity: normal

Found in version 24.1

Done: Chong Yidong <cyd <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 12054 in the body.
You can then email your comments to 12054 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Thu, 26 Jul 2012 05:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Drew Adams" <drew.adams <at> oracle.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 26 Jul 2012 05:51:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: <bug-gnu-emacs <at> gnu.org>
Subject: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Wed, 25 Jul 2012 22:43:29 -0700

emacs -Q
 
(defface foo '((t (:background "Yellow"))) "" :group 'faces)

(setq nobreak-char-display nil)

(font-lock-add-keywords nil '(("[\240]+" (0 'foo t))) 'APPEND)
 
Insert a no-break space:
C-x 8 RET no-break-space (or C-q 240 RET)
 
Turn font-lock-mode off, then back on.
 
With point before the no-break-space, C-u C-x =.  That shows that the
character is indeed a no-break-space, and there is no face on it.
 
In Emacs 22, the char is shown clearly in face foo.  Am I missing
something?
 
The same recipe with non-breaking-hyphen highlights that character fine.
What is different about no-break-space?  Shouldn't it be treated
similarly?  This works in Emacs 22 but stops working in Emacs 23.
Normal?  Regression?

In GNU Emacs 24.1.1 (i386-mingw-nt5.1.2600)
 of 2012-06-10 on MARVIN
Windowing system distributor `Microsoft Corp.', version 5.1.2600
Configured using:
 `configure --with-gcc (4.6) --cflags
 -ID:/devel/emacs/libs/libXpm-3.5.8/include
 -ID:/devel/emacs/libs/libXpm-3.5.8/src
 -ID:/devel/emacs/libs/libpng-dev_1.4.3-1/include
 -ID:/devel/emacs/libs/zlib-dev_1.2.5-2/include
 -ID:/devel/emacs/libs/giflib-4.1.4-1/include
 -ID:/devel/emacs/libs/jpeg-6b-4/include
 -ID:/devel/emacs/libs/tiff-3.8.2-1/include
 -ID:/devel/emacs/libs/gnutls-3.0.9/include'

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sun, 16 Sep 2012 23:42:02 GMT) Full text and rfc822 format available.

Message #8 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: <12054 <at> debbugs.gnu.org>
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sun, 16 Sep 2012 16:40:25 -0700

ping

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 10:54:02 GMT) Full text and rfc822 format available.

Message #11 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 18:50:53 +0800

"Drew Adams" <drew.adams <at> oracle.com> writes:

> (defface foo '((t (:background "Yellow"))) "" :group 'faces)
> (setq nobreak-char-display nil)
> (font-lock-add-keywords nil '(("[\240]+" (0 'foo t))) 'APPEND)
>  
> Insert a no-break space:
> C-x 8 RET no-break-space (or C-q 240 RET)
>  
> Turn font-lock-mode off, then back on.
>  
> With point before the no-break-space, C-u C-x =.  That shows that the
> character is indeed a no-break-space, and there is no face on it.

"[\240]+" doesn't do what you want.  Octal 240 is a unibyte character,
so that string constant specifies a unibyte string.  When this unibyte
string is converted to multibyte, the raw byte becomes codepoint
#x3ffa0.

You should use either of these instead:

(font-lock-add-keywords nil '(("[\u00a0]+" (0 'foo t))) 'APPEND)
(font-lock-add-keywords nil '(("[ ]+" (0 'foo t))) 'APPEND)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 11:07:01 GMT) Full text and rfc822 format available.

Message #14 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 19:03:48 +0800

Chong Yidong <cyd <at> gnu.org> writes:

> "[\240]+" doesn't do what you want.  Octal 240 is a unibyte character,
> so that string constant specifies a unibyte string.  When this unibyte
> string is converted to multibyte, the raw byte becomes codepoint
> #x3ffa0.

I've updated the docs to clarify this situation.  Closing the bug.

bug closed, send any further explanations to 12054 <at> debbugs.gnu.org and "Drew Adams" <drew.adams <at> oracle.com> Request was from Chong Yidong <cyd <at> gnu.org> to control <at> debbugs.gnu.org. (Sat, 03 Nov 2012 11:07:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 16:29:01 GMT) Full text and rfc822 format available.

Message #19 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Chong Yidong'" <cyd <at> gnu.org>
Cc: 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 09:25:35 -0700

> > With point before the no-break-space, C-u C-x =.  That 
> > shows that the character is indeed a no-break-space,
> > and there is no face on it.
> 
> "[\240]+" doesn't do what you want.  Octal 240 is a unibyte character,
> so that string constant specifies a unibyte string.  When this unibyte
> string is converted to multibyte, the raw byte becomes codepoint
> #x3ffa0.
> 
> You should use either of these instead:
> (font-lock-add-keywords nil '(("[\u00a0]+" (0 'foo t))) 'APPEND)
> (font-lock-add-keywords nil '(("[ ]+" (0 'foo t))) 'APPEND)

I still have some questions.

`C-q 240' and `C-x 8 RET no-break space' insert the same char.
C-u C-x = says this about it: (codepoint 160, #o240, #xa0)
And with your font-lock sexp that char is indeed highlighted
as expected (yellow bg).  Emacs says the char is octal 240.

Just why is it that the regexp "[\240]+" does not match this char?  Why should a
character-alternative expression care whether the representation is unibyte or
multibyte?  Isn't that a bug?

How to use octal syntax to match that char?  The Elisp manual says clearly that
"The most general read syntax for a character represents the character code in
either octal or hex."  MOST GENERAL, not most limited and partial.

Are you saying that for regexps octal and hex are no longer "the most general
syntax", and that to represent (at least some) unicode chars in a regexp we must
use the \u... syntax?  Is there no way for the `font-lock-add-keywords' sexp to
use either octal or hex here?

With the current state of affairs, which you say is not bugged, how can an Emacs
version < 23 (i.e., without \u... syntax) be used to highlight the char?
Shouldn't it be possible in Emacs 22 to pick up a file that has Unicode chars
and highlight them using font-lock, even if you cannot use Emacs 22 to insert
such chars?

And for Emacs 20 there is not even hex syntax - shouldn't we be able to do
everything using just octal syntax, since it is supposedly "the most general
syntax"?

I haven't seen your doc clarification yet, but given the questions above I would
imagine that things need to be clarified in several places of the manual.

But isn't treating this as a doc bug a bit of a cop-out?  Shouldn't it be
possible to use octal syntax to match Unicode chars?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 16:41:02 GMT) Full text and rfc822 format available.

Message #22 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Chong Yidong <cyd <at> gnu.org>
Cc: Drew Adams <drew.adams <at> oracle.com>, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 17:37:36 +0100

Chong Yidong <cyd <at> gnu.org> writes:

> (font-lock-add-keywords nil '(("[\u00a0]+" (0 'foo t))) 'APPEND)
> (font-lock-add-keywords nil '(("[ ]+" (0 'foo t))) 'APPEND)

None of these need bracket expressions.

(font-lock-add-keywords nil '(("\u00a0+" (0 'foo t))) 'append)
(font-lock-add-keywords nil '((" +" (0 'foo t))) 'append)

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 17:01:02 GMT) Full text and rfc822 format available.

Message #25 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 18:56:49 +0200

> From: "Drew Adams" <drew.adams <at> oracle.com>
> Date: Sat, 3 Nov 2012 09:25:35 -0700
> Cc: 12054 <at> debbugs.gnu.org
> 
> Just why is it that the regexp "[\240]+" does not match this char?

Because, for histerical reasons, 'insert' treats strings such as
"\nnn" as unibyte strings.

> Why should a character-alternative expression care whether the
> representation is unibyte or multibyte?  Isn't that a bug?

It's an unfortunate dark corner, due to the ambiguity of what \240
really means in a string.

> How to use octal syntax to match that char?

Why do you need the octal syntax?  Why not just use a literal  ?  Is
that only for the sake of old Emacs versions, or for some other
reason?

> The Elisp manual says clearly that
> "The most general read syntax for a character represents the character code in
> either octal or hex."  MOST GENERAL, not most limited and partial.

I see no contradiction or incorrect information in this cited text.
The octal notation does work in your example, it's just that its
semantics is not what you expected.  Or am I missing something?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 17:09:02 GMT) Full text and rfc822 format available.

Message #28 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Andreas Schwab'" <schwab <at> linux-m68k.org>, "'Chong Yidong'" <cyd <at> gnu.org>
Cc: 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 10:05:36 -0700

> None of these need bracket expressions.
> (font-lock-add-keywords nil '(("\u00a0+" (0 'foo t))) 'append)
> (font-lock-add-keywords nil '((" +" (0 'foo t))) 'append)

Good point.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 17:10:01 GMT) Full text and rfc822 format available.

Message #31 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sun, 04 Nov 2012 01:06:28 +0800

"Drew Adams" <drew.adams <at> oracle.com> writes:

> Just why is it that the regexp "[\240]+" does not match this char?
> Why should a character-alternative expression care whether the
> representation is unibyte or multibyte?  Isn't that a bug?

When \240 occurs in a unibyte string, Emacs recognizes it as an
eight-bit raw byte.  When converting unibyte strings to multibyte, Emacs
does not "unify" eight-bit raw bytes with Unicode characters #x80-#xff;
they get their own code points, in this case #x3fffa0.  (One reason for
doing this is to allow unibyte strings to be specified using string
constants in Emacs Lisp source code.)

> How to use octal syntax to match that char?  The Elisp manual says
> clearly that "The most general read syntax for a character represents
> the character code in either octal or hex."  MOST GENERAL, not most
> limited and partial.

I've already edited the documentation to take out this sentence.  It is
incorrect anyway, for the reason that octal escapes are limited to three
digits.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 17:27:01 GMT) Full text and rfc822 format available.

Message #34 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Eli Zaretskii'" <eliz <at> gnu.org>
Cc: cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 10:22:59 -0700

> > Just why is it that the regexp "[\240]+" does not match this char?
> 
> Because, for histerical reasons, 'insert' treats strings such as
> "\nnn" as unibyte strings.

Sorry, I don't understand your point.  My question was about the regexp (not)
matching, not about (not) being able to insert the char.

I don't see a problem with inserting the char.  As I said, the correct char gets
inserted AFAICT, as shown both by `C-u C-x =' and by Yidong's correction of the
font-lock regexp.

You can insert the _same_ char using either `C-q 240' or `C-x 8 RET no-break
space', at least AFAICT (via Yidong's highlighting and via `C-u C-x =').

> > Why should a character-alternative expression care whether the
> > representation is unibyte or multibyte?  Isn't that a bug?
> 
> It's an unfortunate dark corner, due to the ambiguity of what \240
> really means in a string.

That just makes it darker for me.  Can you please elaborate?

> > How to use octal syntax to match that char?
> 
> Why do you need the octal syntax?  Why not just use a literal  ?  Is
> that only for the sake of old Emacs versions, or for some other
> reason?

1. Yes, for the sake of older Emacs versions.

2. The manual says that octal syntax is the most general syntax.
So one would expect that one can use it more, not less. ;-)

3. Why not?  Why turn it around and speak of "need" to use it?
The real question is why _not_ be able to use octal syntax here?

> > The Elisp manual says clearly that
> > "The most general read syntax for a character represents 
> > the character code in either octal or hex."
> >
> > MOST GENERAL, not most limited and partial.
> 
> I see no contradiction or incorrect information in this cited text.
> The octal notation does work in your example, it's just that its
> semantics is not what you expected.  Or am I missing something?

Dunno whether you are missing something.  I am missing how the octal notation
"works" in my example.  It certainly does not highlight the char I want to
highlight, i.e., does not do what I intended.  How to do that?

I'm missing how to use octal notation in such a font-lock-add-keywords sexp to
match that char.  IOW, my incorrect use of it doesn't do the job.  Please show
me how to use octal notation to get that char highlighted.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 17:36:01 GMT) Full text and rfc822 format available.

Message #37 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Chong Yidong'" <cyd <at> gnu.org>
Cc: 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 10:32:36 -0700

> > Just why is it that the regexp "[\240]+" does not match this char?
> > Why should a character-alternative expression care whether the
> > representation is unibyte or multibyte?  Isn't that a bug?
> 
> When \240 occurs in a unibyte string, Emacs recognizes it as an
> eight-bit raw byte.  When converting unibyte strings to 
> multibyte, Emacs does not "unify" eight-bit raw bytes with
> Unicode characters #x80-#xff; they get their own code points,
> in this case #x3fffa0.  (One reason for doing this is to allow
> unibyte strings to be specified using string constants in Emacs
> Lisp source code.)
> 
> > How to use octal syntax to match that char?  The Elisp manual says
> > clearly that "The most general read syntax for a character 
> > represents the character code in either octal or hex."
> > MOST GENERAL, not most limited and partial.
> 
> I've already edited the documentation to take out this 
> sentence.  It is incorrect anyway, for the reason that
> octal escapes are limited to three digits.

Hm.  I admit that I do not have a grasp of this yet.  I will read the updated
doc when I get hold of it.  You didn't answer the question "How to use..."  I
guess that silence indicates that it is impossible (?).

Anyway, trying to put together your statement that the old text was incorrect
with Eli's claim that it is still correct has me perplexed.

So just what is the "most general read syntax for a char" now?

And what is a general read syntax that will work also for older Emacs versions
when reading Unicode chars present in a file?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 18:04:02 GMT) Full text and rfc822 format available.

Message #40 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sun, 04 Nov 2012 02:00:05 +0800

"Drew Adams" <drew.adams <at> oracle.com> writes:

> So just what is the "most general read syntax for a char" now?

The literal representation of the character.  This should work on older
Emacsen too, I think.  And on Emacs >= 22, you can use \uNNNN and
\U00NNNNNN escape sequences if you like.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 18:08:02 GMT) Full text and rfc822 format available.

Message #43 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Chong Yidong'" <cyd <at> gnu.org>
Cc: 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 11:04:47 -0700

> > So just what is the "most general read syntax for a char" now?
> 
> The literal representation of the character.  This should 
> work on older Emacsen too, I think.  And on Emacs >= 22, you
> can use \uNNNN and \U00NNNNNN escape sequences if you like.

Got it.  So I guess there is no escape syntax that will work with older Emacs
versions also.  (You didn't say that, but I'm guessing.)

One problem with using a literal char is when you need the Lisp code to be
digestible by applications that choke on such chars.  That's one reason we
_have_ an escape syntax.

For example, uploading files containing certain control chars to certain sites
can result in them being filtered out.  Using escape syntax allows the actual
chars in the file to be ascii.

I understand that the \u and \U escape syntax fits the bill here, but not for
older Emacs versions.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 19:05:02 GMT) Full text and rfc822 format available.

Message #46 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Chong Yidong'" <cyd <at> gnu.org>
Cc: 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 12:01:29 -0700

> > Just why is it that the regexp "[\240]+" does not match this char?
> > Why should a character-alternative expression care whether the
> > representation is unibyte or multibyte?  Isn't that a bug?
> 
> When \240 occurs in a unibyte string, Emacs recognizes it as an
> eight-bit raw byte.  When converting unibyte strings to 
> multibyte, Emacs does not "unify" eight-bit raw bytes with
> Unicode characters #x80-#xff; they get their own code points,
> in this case #x3fffa0.

I think I understand this (but I might be misunderstanding).  The \240 in the
4-char ASCII regexp string "\240" is interpreted (read?) as a raw byte, not as
the char I wanted.

That is, the literal string in my code is read as a string that contains only a
single raw byte of octal 240 in place of the 4 chars \240 (and instead of as a
string with the multibyte char no-break space).  Is that right?

And putting that together with Eli's statement about insertion ("'insert' treats
strings such as "\nnn" as unibyte strings"), I understand that the buffer text
after I type `C-q 240' contains a unibyte raw byte, and not the multibyte char
no-break space.

But in that case I do not understand why `C-u C-x =' says that it _is_ the
Unicode no-break space char.  And I do not understand why Yidong's font-lock
correction also shows that it is a no-break space char.

So I'm confused about what is actually in the buffer.  From the doc and from
Eli's statement, I gather that there is a unibyte raw byte (octal 240) at that
position.  But `C-u C-x =' and font-lock seem to tell me that there is a
(multibyte) no-break space char there.

If there is in fact a multibyte char there and the literal "\240" in my
font-lock sexp results in a unibyte raw byte search, that would explain the
mismatch.

But I still wonder about this motivation for the treatment of \nnn in literal
strings in Lisp code:

> (One reason for doing this is to allow unibyte strings to
> be specified using string constants in Emacs Lisp source code.)

I can see how that can be useful.  But I can also see how it would be useful to
have some way of using octal syntax to match multibyte chars.  Isn't there some
reasonable way to allow for both?

E.g. can I specify a multibyte string somehow, starting with octal syntax?  Is
there a way, for example, to use octal sytax to provide octal codes 0302 and
0240, which together define U+00AO for UTF8?  [See below.]

Is there, for example, (or could there be added) a function that one can apply
to the unibyte string for \240 that would convert it to a string that DTRT wrt
multibyte?

So I could do something like this (assuming the function is available for older
Emacs versions too), where `foo' is the function:

(font-lock-add-keywords nil `((,(foo "\240+") (0 'foo t))) 'APPEND)

From the doc, I was thinking that perhaps `string-to-multibyte' would do the
trick, i.e., (string-to-multibyte "\240+") would return "\u00a0+" or the literal
Unibyte char in a multibyte string.  But it returns "\240+".

I can understand that the actual chars in that input string are all ASCII, so
that makes sense, I guess.  But I was thinking from Yidong's statement above
that such a literal string in Lisp code gets read as a unibyte, raw-byte string.

Since that doesn't seem to be the case here (?), is there a function that will
convert "\240" (4 chars) to a string with just that one "eight-bit raw byte"
char?  I tried `read', but that didn't help.

I hope I'm just missing something, and that there is a function (or combination
of functions) to which I can pass the 4-char ASCII string "\240" (or the 8-char
string "\302\240") and that will return the proper multibyte string containing
the Unicode no-break space char.

Ideal would be such a function that works also in older Emacs versions.

...

OK, digging some more, it seems that this will do the trick:

(decode-coding-string "\302\240" 'utf-8)

That allows use of only octal syntax - good.  But it still doesn't solve the
problem for older Emacs versions - they raise the error (coding-system-error
utf-8).

Is there a way to use only octal syntax with older Emacs versions, so the
font-locking code highlights such a Unicode char in a file/buffer?

Judging by my current confusion, I am sure that my statements above must be full
of misconceptions.  I will be glad to be shown my misunderstanding and a simple
solution.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 19:54:01 GMT) Full text and rfc822 format available.

Message #49 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: cyd <at> gnu.org, Drew Adams <drew.adams <at> oracle.com>, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 15:50:19 -0400

> Because, for histerical reasons, 'insert' treats strings such as
> "\nnn" as unibyte strings.

Actually, this has nothing to do with `insert', right?
It's the reader that interprets the \240 in "[\240]+" as a byte rather
than a char.

>> Why should a character-alternative expression care whether the
>> representation is unibyte or multibyte?  Isn't that a bug?

There are many different ways to interpret this, and I can give you one
where the behavior is explained without paying attention to
multibyte/unibyte differences.

\240 in your string means "the byte with octal number 0240".


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 20:06:01 GMT) Full text and rfc822 format available.

Message #52 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Stefan Monnier'" <monnier <at> iro.umontreal.ca>,
	"'Eli Zaretskii'" <eliz <at> gnu.org>
Cc: cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 13:02:23 -0700

> >> Why should a character-alternative expression care whether the
> >> representation is unibyte or multibyte?  Isn't that a bug?
> 
> There are many different ways to interpret this, and I can 
> give you one where the behavior is explained without paying
> attention to multibyte/unibyte differences.
> 
> \240 in your string means "the byte with octal number 0240".

OK, so then do you think this should DTRT? 

(font-lock-add-keywords nil '(("\\(\302\240\\)+" (0 'foo t))) 'APPEND)

I'm guessing it shouldn't, because IIUC the buffer in fact contains only the
single raw byte \240 and not the multibyte sequence of two raw bytes \302 and
\240.

But I barely understand this stuff at all; I mostly misunderstand it still.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 20:40:02 GMT) Full text and rfc822 format available.

Message #55 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 'Eli Zaretskii' <eliz <at> gnu.org>, cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 16:36:42 -0400

> OK, so then do you think this should DTRT?
> (font-lock-add-keywords nil '(("\\(\302\240\\)+" (0 'foo t))) 'APPEND)

That will match if your buffer contains a \302 byte or a \240 byte.
"contains" is different from "is represented internally".

The internal representation should normally stay hidden and only appear
if you use dangerous things like string-as-multibyte or call
set-buffer-multibyte in a non-empty buffer.

> I'm guessing it shouldn't, because IIUC the buffer in fact contains only the
> single raw byte \240 and not the multibyte sequence of two raw bytes \302 and
> \240.

AFAIK your buffer contains none of that.  It contains a NBSP character,
which is not a byte.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 20:46:02 GMT) Full text and rfc822 format available.

Message #58 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Stefan Monnier'" <monnier <at> iro.umontreal.ca>
Cc: 'Eli Zaretskii' <eliz <at> gnu.org>, cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 3 Nov 2012 13:42:40 -0700

> That will match if your buffer contains a \302 byte or a \240 byte.
> "contains" is different from "is represented internally".
> 
> The internal representation should normally stay hidden

So much the better.  So that was a red herring, I guess.

> AFAIK your buffer contains none of that.  It contains a NBSP 
> character, which is not a byte.

Yes.  I was thinking about internal representation, which I'm glad I don't have
to worry about.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 21:02:01 GMT) Full text and rfc822 format available.

Message #61 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 22:57:36 +0200

> From: "Drew Adams" <drew.adams <at> oracle.com>
> Cc: <cyd <at> gnu.org>, <12054 <at> debbugs.gnu.org>
> Date: Sat, 3 Nov 2012 10:22:59 -0700
> 
> > > Just why is it that the regexp "[\240]+" does not match this char?
> > 
> > Because, for histerical reasons, 'insert' treats strings such as
> > "\nnn" as unibyte strings.
> 
> Sorry, I don't understand your point.  My question was about the regexp (not)
> matching, not about (not) being able to insert the char.

It doesn't matter.  "\nnn" in a string is still interpreted as unibyte.

> I don't see a problem with inserting the char.  As I said, the correct char gets
> inserted AFAICT, as shown both by `C-u C-x =' and by Yidong's correction of the
> font-lock regexp.

Insertion with C-q does something different.

> > It's an unfortunate dark corner, due to the ambiguity of what \240
> > really means in a string.
> 
> That just makes it darker for me.  Can you please elaborate?

\240 could be taken as NBPS or as a literal byte.  They have different
representations in Emacs and are treated differently, but are
identical numerically outside of Emacs.

> 3. Why not?  Why turn it around and speak of "need" to use it?
> The real question is why _not_ be able to use octal syntax here?

For the same reason you'd use ?a and not \141: it's more clear to the
human reader.

Using octal escapes for non-ASCII characters in Emacs is deprecated
and dangerous.  You just bumped into one danger; there are more.  I
suggest you avoid this notation as much as you can.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 21:05:02 GMT) Full text and rfc822 format available.

Message #64 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Chong Yidong <cyd <at> gnu.org>
Cc: drew.adams <at> oracle.com, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 23:00:35 +0200

> From: Chong Yidong <cyd <at> gnu.org>
> Date: Sun, 04 Nov 2012 02:00:05 +0800
> Cc: 12054 <at> debbugs.gnu.org
> 
> "Drew Adams" <drew.adams <at> oracle.com> writes:
> 
> > So just what is the "most general read syntax for a char" now?
> 
> The literal representation of the character.  This should work on older
> Emacsen too, I think.

It doesn't, AFAIR: in Emacs before v23, an NBSP would be decoded into
a different internal representation depending on the encoding of the
file from which it is read.  That encoding could be explicit, using
the coding: cookie, or implicit, based on the current locale.  But in
any case, the result will only match NBSP in the same charset.  E.g.,
if \240 was decoded into a Latin-2 NBSP, it will not match a Latin-1
NBSP.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sat, 03 Nov 2012 21:17:02 GMT) Full text and rfc822 format available.

Message #67 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: Re: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sat, 03 Nov 2012 23:13:40 +0200

> From: "Drew Adams" <drew.adams <at> oracle.com>
> Date: Sat, 3 Nov 2012 12:01:29 -0700
> Cc: 12054 <at> debbugs.gnu.org
> 
> I think I understand this (but I might be misunderstanding).  The \240 in the
> 4-char ASCII regexp string "\240" is interpreted (read?) as a raw byte, not as
> the char I wanted.

Yes.

> That is, the literal string in my code is read as a string that contains only a
> single raw byte of octal 240 in place of the 4 chars \240 (and instead of as a
> string with the multibyte char no-break space).  Is that right?

Yes.

> And putting that together with Eli's statement about insertion ("'insert' treats
> strings such as "\nnn" as unibyte strings"), I understand that the buffer text
> after I type `C-q 240' contains a unibyte raw byte, and not the multibyte char
> no-break space.

No.  It contains the NBSP.  Try it.  C-q inserts a multibyte
character, unlike '(insert "\240")', for example.

> But in that case I do not understand why `C-u C-x =' says that it _is_ the
> Unicode no-break space char.

Because it is.

> And I do not understand why Yidong's font-lock correction also shows
> that it is a no-break space char.

Chong didn't use "\240".

> So I'm confused about what is actually in the buffer.  From the doc and from
> Eli's statement, I gather that there is a unibyte raw byte (octal 240) at that
> position.  But `C-u C-x =' and font-lock seem to tell me that there is a
> (multibyte) no-break space char there.

Try '(insert "\240")' and then "C-x =" will show a unibyte byte.

> > (One reason for doing this is to allow unibyte strings to
> > be specified using string constants in Emacs Lisp source code.)
> 
> I can see how that can be useful.  But I can also see how it would be useful to
> have some way of using octal syntax to match multibyte chars.  Isn't there some
> reasonable way to allow for both?

Maybe, but we didn't find one, at least not one that would be
backward-compatible.

> Is there, for example, (or could there be added) a function that one can apply
> to the unibyte string for \240 that would convert it to a string that DTRT wrt
> multibyte?

Such functions do exist, see the "Converting Representations" node in
the ELisp manual.

> (decode-coding-string "\302\240" 'utf-8)
> 
> That allows use of only octal syntax - good.  But it still doesn't solve the
> problem for older Emacs versions - they raise the error (coding-system-error
> utf-8).

You don't want this, because even if you succeed in producing a NBSP
in Emacs 22 and older, the result will not match NBSP in other
charsets.  It's simply impossible with those versions of Emacs.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#12054; Package emacs. (Sun, 04 Nov 2012 23:38:01 GMT) Full text and rfc822 format available.

Message #70 received at 12054 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Eli Zaretskii'" <eliz <at> gnu.org>
Cc: cyd <at> gnu.org, 12054 <at> debbugs.gnu.org
Subject: RE: bug#12054: 24.1;
	regression? font-lock no-break-space with nil nobreak-char-display
Date: Sun, 4 Nov 2012 15:34:20 -0800

> > That is, the literal string in my code is read as a string 
> > that contains only a single raw byte of octal 240 in place
> > of the 4 chars \240 (and instead of as a string with the
> > multibyte char no-break space).  Is that right?
> 
> Yes.
> 
> > And putting that together with Eli's statement about 
> > insertion ("'insert' treats strings such as "\nnn" as
> > unibyte strings"), I understand that the buffer text
> > after I type `C-q 240' contains a unibyte raw byte, and
> > not the multibyte char no-break space.
> 
> No.  It contains the NBSP.  Try it.

Well, I was saying since the beginning tha that appeared to be the case.  But
you replied that insertion inserted a raw \240 byte.  That red herring threw me
off.

> C-q inserts a multibyte character, unlike '(insert "\240")', for example.

Thanks, I finally got that from what Stefan said.  It would have been clearer if
you had said that from the beginning, since I mentioned `C-q' and you replied
instead about "insert".  Anyway, I understand now.

> Try '(insert "\240")' and then "C-x =" will show a unibyte byte.

Yes, I got it (from Stefan's reply).  But no one mentioned using `insert' or
insertion, except you.  I know you were trying to help, but that just confused
things, for me.

> > I can see how that can be useful.  But I can also see how 
> > it would be useful to have some way of using octal syntax to
> > match multibyte chars.  Isn't there some reasonable way to
> > allow for both?
> 
> Maybe, but we didn't find one, at least not one that would be
> backward-compatible.

OK, that was my question.  Thx.

> > (decode-coding-string "\302\240" 'utf-8)
> > 
> > That allows use of only octal syntax - good.  But it still 
> > doesn't solve the problem for older Emacs versions - they
> > raise the error (coding-system-error utf-8).
> 
> You don't want this, because even if you succeed in producing a NBSP
> in Emacs 22 and older, the result will not match NBSP in other
> charsets.  It's simply impossible with those versions of Emacs.

Got it.  That is the bottom line - the answer to my question.

Thx to all who took the time to help me understand better.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 03 Dec 2012 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 12 years and 258 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #12054 24.1; regression? font-lock no-break-space with nil nobreak-char-display

GNU bug report logs - #12054
24.1; regression? font-lock no-break-space with nil nobreak-char-display