GNU bug report logs - #4848
23.1.50; \u and \x in string

Previous Next

Package: emacs;

Reported by: rms <at> gnu.org

Date: Mon, 2 Nov 2009 05:35:06 UTC

Severity: wishlist

Done: Noam Postavsky <npostavs <at> users.sourceforge.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 4848 in the body.
You can then email your comments to 4848 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Mon, 02 Nov 2009 05:35:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
New bug report received and forwarded. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Nov 2009 05:35:07 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Richard Stallman <rms <at> gnu.org>
To: emacs-pretest-bug <at> gnu.org
Subject: 23.1.50; \u and \x in string
Date: Mon, 02 Nov 2009 00:31:17 -0500
"\ue1" gives the error "Non-hex digit used for Unicode escape".
Why doesn't it work to give the Unicode character á?

Note that \xe1 does not work for this any more.
It gives a different character, which displays as \341 and
is described as follows by C-x =.

  Char: \341 (4194273, #o17777741, #x3fffe1, raw-byte) point=442 of 2980 (15%) column=0

That too is confusing, and certainly not documented clearly where \x
is explained.  Is there any way to specify unicode e1 with \x?


In GNU Emacs 23.1.50.4 (mipsel-unknown-linux-gnu, GTK+ Version 2.12.12)
 of 2009-08-11 on theobromine2
configured using `configure  'CFLAGS=-O0 -g -Wno-pointer-sign' 'mipsel-unknown-linux-gnu' 'build_alias=mipsel-unknown-linux-gnu' 'host_alias=mipsel-unknown-linux-gnu' 'target_alias=mipsel-unknown-linux-gnu''

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: en_US.UTF-8
  value of $XMODIFIERS: nil
  locale-coding-system: utf-8-unix
  default-enable-multibyte-characters: t

Major mode: RMAIL Edit

Minor modes in effect:
  shell-dirtrack-mode: t
  diff-auto-refine-mode: t
  gpm-mouse-mode: t
  display-battery-mode: t
  tooltip-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  global-auto-composition-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t
  abbrev-mode: t

Recent input:
b R TAB RET ESC < C-u C-n C-u C-u C-n C-u C-n C-n C-n 
C-n C-f 4 b o u t C-_ C-x b o u t - 2 2 RET C-a C-p 
C-x 4 b R TAB RET C-u ESC x c o m p a r e RET C-x o 
C-x o C-x b RET C-b C-b C-b C-b | ESC C-x C-x C-s C-x 
b RET C-x o C-b C-b C-x ESC ESC ESC p ESC p RET C-x 
o C-x o C-x o C-x C-g C-x 4 b RET C-a ESC f C-f C-@ 
ESC C-f ESC w ESC : C-y RET C-x o ESC : ( l o o k i 
n g - a t SPC C-y ) RET C-x o C-e ESC b ESC d 2 4 0 
ESC C-x C-x o ESC : ESC p RET C-x = C-x o o C-_ C-x 
o ESC : ESC p C-e ESC DEL ESC DEL ESC DEL " \ 2 4 0 
DEL DEL DEL x a 0 " ) RET C-u C-x = C-\ a ' C-g e C-x 
= C-f a ' C-b C-x = ESC : ESC p C-e C-b C-b ESC DEL 
DEL C-\ a ' C-e RET C-x = ESC : ESC p C-e C-b C-b DEL 
\ 3 4 1 RET C-x = ESC : ESC p C-e C-b C-b DEL DEL DEL 
x e 1 RET C-x = ESC : ESC p C-e C-b C-b C-b C-b DEL 
u C-e RET ESC : ESC p C-e C-b C-b C-b C-b ESC u C-e 
RET ESC : ESC p C-e C-b C-b C-b C-b 0 0 C-e RET ESC 
x r e p o r t SPC e m a c s SPC b u g RET

Recent messages:
Char: =e1 (225, #o341, #xe1) point=1382 of 28873 (5%) column=57
t
Char: =e1 (225, #o341, #xe1) point=1382 of 28873 (5%) column=57
nil
Char: =e1 (225, #o341, #xe1) point=1382 of 28873 (5%) column=57
nil
Char: =e1 (225, #o341, #xe1) point=1382 of 28873 (5%) column=57
let: Non-hex digit used for Unicode escape [2 times]
t
Source file `/home/rms/emacs-cvs/lisp/mail/emacsbug.el' newer than byte-compiled file

Load-path shadows:
None found.



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Mon, 02 Nov 2009 07:25:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Nov 2009 07:25:06 GMT) Full text and rfc822 format available.

Message #10 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: rms <at> gnu.org
Cc: 4848 <at> debbugs.gnu.org, emacs-pretest-bug <at> gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Mon, 02 Nov 2009 02:17:10 -0500
> "\ue1" gives the error "Non-hex digit used for Unicode escape".
> Why doesn't it work to give the Unicode character á?

I think you mean \u00e1

> Note that \xe1 does not work for this any more.

Indeed, this refers to the byte 225 rather than to the char 225.

> That too is confusing, and certainly not documented clearly where \x
> is explained.  Is there any way to specify unicode e1 with \x?

\x00e1 also works like \u00e1.


        Stefan



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Mon, 02 Nov 2009 07:25:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Nov 2009 07:25:08 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Mon, 02 Nov 2009 07:40:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jason Rumney <jasonr <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Mon, 02 Nov 2009 07:40:05 GMT) Full text and rfc822 format available.

Message #20 received at 4848 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Jason Rumney <jasonr <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>, 4848 <at> debbugs.gnu.org
Cc: rms <at> gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Mon, 02 Nov 2009 15:33:49 +0800
Stefan Monnier wrote:

>> "\ue1" gives the error "Non-hex digit used for Unicode escape".
>> Why doesn't it work to give the Unicode character á?
>>     
>
> I think you mean \u00e1
>   

I think the error message means "Insufficient hex digits used for 
Unicode escape".

>> Note that \xe1 does not work for this any more.
>>     
>
> Indeed, this refers to the byte 225 rather than to the char 225.
>   
>
> \x00e1 also works like \u00e1.
>   

That is definitely confusing.




Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Tue, 03 Nov 2009 13:45:04 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 03 Nov 2009 13:45:04 GMT) Full text and rfc822 format available.

Message #25 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Richard Stallman <rms <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 4848 <at> debbugs.gnu.org, emacs-pretest-bug <at> gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Tue, 03 Nov 2009 08:39:00 -0500
    > "\ue1" gives the error "Non-hex digit used for Unicode escape".
    > Why doesn't it work to give the Unicode character á?

    I think you mean \u00e1

Why shouldn't \ue1 work?

    > Note that \xe1 does not work for this any more.

    Indeed, this refers to the byte 225 rather than to the char 225.

This needs to be documented.  But is it a good meaning for \x?  It
will rarely be useful this way.  Also, is it an incompatible change?



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Tue, 03 Nov 2009 13:45:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 03 Nov 2009 13:45:06 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Tue, 03 Nov 2009 14:55:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 03 Nov 2009 14:55:05 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: rms <at> gnu.org
Cc: 4848 <at> debbugs.gnu.org, emacs-pretest-bug <at> gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Tue, 03 Nov 2009 09:49:54 -0500
>> "\ue1" gives the error "Non-hex digit used for Unicode escape".
>> Why doesn't it work to give the Unicode character á?
>     I think you mean \u00e1
> Why shouldn't \ue1 work?

Because the \u format is \uNNNN with exactly 4 hex digits.

>> Note that \xe1 does not work for this any more.
>     Indeed, this refers to the byte 225 rather than to the char 225.
> This needs to be documented.  But is it a good meaning for \x?  It
> will rarely be useful this way.  Also, is it an incompatible change?

I haven't managed to keep track of all the changes w.r.t how we treat
\NNN vs \xMM vs \xMMMMM and how it impacts whether the resulting string
is unibyte or multibyte.  My understanding is that there have been
several incompatible changes in this area (and some of those were
inevitable).  E.g. in Emacs-22:

   ELISP> "\222"
   "\222"
   ELISP> "\xa4"
   "\xa4"
   ELISP> (multibyte-string-p "\222")
   nil
   ELISP> (multibyte-string-p "\xa4")
   t
   ELISP> (multibyte-string-p "\xa45")
   t
   ELISP> 

whereas in Emacs-23.1:

   ELISP> "\222"
   "\222"
   ELISP> "\xa4"
   "\244"
   ELISP> (multibyte-string-p "\222")
   nil
   ELISP> (multibyte-string-p "\xa4")
   nil
   ELISP> (multibyte-string-p "\xa45")
   t
   ELISP> 

Of course, given that fact that char-numbers have changed, the
backward compatibility of \xNNNN is irrelevant since they do not
represent the same char any more.


        Stefan



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Tue, 03 Nov 2009 14:55:07 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 03 Nov 2009 14:55:07 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Tue, 03 Nov 2009 18:45:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Tue, 03 Nov 2009 18:45:04 GMT) Full text and rfc822 format available.

Message #45 received at 4848 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: rms <at> gnu.org, 4848 <at> debbugs.gnu.org
Cc: monnier <at> iro.umontreal.ca
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Tue, 03 Nov 2009 20:35:40 +0200
> From: Richard Stallman <rms <at> gnu.org>
> Date: Tue, 03 Nov 2009 08:39:00 -0500
> Cc: emacs-pretest-bug <at> gnu.org, 4848 <at> emacsbugs.donarmstrong.com
> 
> This needs to be documented.

I'm not sure what you wanted to be documented.  Is the description in
"(elisp)General Escape Syntax" what you were looking for?



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Thu, 05 Nov 2009 02:05:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Thu, 05 Nov 2009 02:05:05 GMT) Full text and rfc822 format available.

Message #50 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Richard Stallman <rms <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 4848 <at> debbugs.gnu.org, emacs-pretest-bug <at> gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Wed, 04 Nov 2009 20:57:07 -0500
    > Why shouldn't \ue1 work?

    Because the \u format is \uNNNN with exactly 4 hex digits.

In other words, "it doesn't work because we decided it should't work".
But why should't it work?  Why shouldn't two digits be allowed?

Is there a good reason not to allow that?



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Thu, 05 Nov 2009 02:05:08 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Thu, 05 Nov 2009 02:05:08 GMT) Full text and rfc822 format available.

Message #55 received at 4848 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Richard Stallman <rms <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 4848 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Wed, 04 Nov 2009 20:56:45 -0500
    I'm not sure what you wanted to be documented.  Is the description in
    "(elisp)General Escape Syntax" what you were looking for?

The version I have is from August.  If it has been substantially
improved since then, maybe it is good.  The text from August was
inadequate and even wrong:

      To use hex, write a question mark followed by a backslash, @samp{x},
    and the hexadecimal character code.  You can use any number of hex
    digits, so you can represent any character code in this way.
    Thus, @samp{?\x41} for the character @kbd{A}, @samp{?\x1} for the
    character @kbd{C-a}, and @code{?\x8e0} for the Latin-1 character
    @iftex
    @samp{@`a}.
    @end iftex
    @ifnottex
    @samp{a} with grave accent.
    @end ifnottex

And here is something from Non-ASCII In Strings:

      You can also represent a multibyte non-@acronym{ASCII} character with its
    character code: use a hex escape, @samp{\x <at> var{nnnnnnn}}, with as many
    digits as necessary.  (Multibyte non-@acronym{ASCII} character codes are all
    greater than 256.)  Any character which is not a valid hex digit
    terminates this construct.  If the next character in the string could be
    interpreted as a hex digit, write @w{@samp{\ }} (backslash and space) to
    terminate the hex escape---for example, @w{@samp{\x8e0\ }} represents
    one character, @samp{a} with grave accent.  @w{@samp{\ }} in a string
    constant is just like backslash-newline; it does not contribute any
    character to the string, but it does terminate the preceding hex escape.



Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Thu, 05 Nov 2009 02:05:10 GMT) Full text and rfc822 format available.

Acknowledgement sent to rms <at> gnu.org:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Thu, 05 Nov 2009 02:05:10 GMT) Full text and rfc822 format available.

Information forwarded to bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>:
bug#4848; Package emacs. (Thu, 05 Nov 2009 02:55:09 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Monnier <monnier <at> iro.umontreal.ca>:
Extra info received and forwarded to list. Copy sent to Emacs Bugs <bug-gnu-emacs <at> gnu.org>. (Thu, 05 Nov 2009 02:55:10 GMT) Full text and rfc822 format available.

Message #65 received at 4848 <at> emacsbugs.donarmstrong.com (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: rms <at> gnu.org
Cc: 4848 <at> debbugs.gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Wed, 04 Nov 2009 21:48:04 -0500
>> Why shouldn't \ue1 work?
>     Because the \u format is \uNNNN with exactly 4 hex digits.

> In other words, "it doesn't work because we decided it should't work".
> But why should't it work?  Why shouldn't two digits be allowed?
> Is there a good reason not to allow that?

I think the \u format is taken from C and it doesn't have an "end" like
our \x format has.  So for example "\u11111" means (concat "\u1111" "1").


        Stefan




Severity set to 'wishlist' from 'normal' Request was from Chong Yidong <cyd <at> stupidchicken.com> to control <at> debbugs.gnu.org. (Sun, 24 Jul 2011 04:52:01 GMT) Full text and rfc822 format available.

Reply sent to Noam Postavsky <npostavs <at> users.sourceforge.net>:
You have taken responsibility. (Tue, 14 Jun 2016 02:46:02 GMT) Full text and rfc822 format available.

Notification sent to rms <at> gnu.org:
bug acknowledged by developer. (Tue, 14 Jun 2016 02:46:02 GMT) Full text and rfc822 format available.

Message #72 received at 4848-done <at> debbugs.gnu.org (full text, mbox):

From: Noam Postavsky <npostavs <at> users.sourceforge.net>
To: 4848-done <at> debbugs.gnu.org
Subject: Re: bug#4848: 23.1.50; \u and \x in string
Date: Mon, 13 Jun 2016 22:45:33 -0400
"Non-ASCII In Strings" now (24.5) says the following which explains
about "\xN" producing unibyte characters.

   You can also use hexadecimal escape sequences (‘\xN’) and octal
escape sequences (‘\N’) in string constants.  *But beware:* If a string
constant contains hexadecimal or octal escape sequences, and these
escape sequences all specify unibyte characters (i.e., less than 256),
and there are no other literal non-ASCII characters or Unicode-style
escape sequences in the string, then Emacs automatically assumes that it
is a unibyte string.  That is to say, it assumes that all non-ASCII
characters occurring in the string are 8-bit raw bytes.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 12 Jul 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 35 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.