GNU bug report logs - #16448
24.3; Messages from (error "...") with UTF-8 chars are printed wrongly in Emacs Lisp scripts

Package: emacs;

Reported by: Sergey Tselikh <stselikh <at> gmail.com>

Date: Wed, 15 Jan 2014 00:19:01 UTC

Severity: normal

Found in version 24.3

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 16448 in the body.
You can then email your comments to 16448 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#16448; Package emacs. (Wed, 15 Jan 2014 00:19:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Sergey Tselikh <stselikh <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 15 Jan 2014 00:19:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Sergey Tselikh <stselikh <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.3; Messages from (error "...") with UTF-8 chars are printed
 wrongly in Emacs Lisp scripts
Date: Wed, 15 Jan 2014 11:10:09 +1100

Hello.

In a script, when (error "...") instruction is executed with some UTF-8
characters in its text, the message is not printed correctly.

LANG environment variable is set to en_US.UTF-8 for all programs, my terminal is
x11-terms/rxvt-unicode with adequate UTF-8 support, Emacs version is GNU Emacs
24.3.1.


Examples (all of them are with LANG=en_US.UTF-8 in environment):

$ cat error.el 
(message "hello привет")
(message "привет hello")
(error "hello привет")

$ emacs --script error.el 
hello привет
привет hello
hello ?@825B

But: 
$ emacs -nw --eval '(error "hello привет")'  
^^^ successfully prints "hello привет" in minibuffer.


This ?@825B is not some trash.  Created a small table showing its origins (It
is ``echo hello привет | print-bits | cat -t'' vs. ``echo hello привет |
high-bits-01 | print-bits | cat -t''):

h    01101000  |   h  01101000  |
e    01100101  |   e  01100101  |
l    01101100  |   l  01101100  |
l    01101100  |   l  01101100  |
o    01101111  |   o  01101111  |
     00100000  |      00100000  |
M-P  11010000  |   P  01010000  |
M-?  10111111  |   ?  00111111  |   ?
M-Q  11010001  |   Q  01010001  |
M-^@ 10000000  |   @  01000000  |   @
M-P  11010000  |   P  01010000  |
M-8  10111000  |   8  00111000  |   8
M-P  11010000  |   P  01010000  |
M-2  10110010  |   2  00110010  |   2
M-P  11010000  |   P  01010000  |
M-5  10110101  |   5  00110101  |   5
M-Q  11010001  |   Q  01010001  |
M-^B 10000010  |   B  01000010  |   B



More examples:

$ cat any-other.el 
(error "cons:%s list:%s string:%s" (cons 'на 'речке) '(на речке на том бере) "be Быть beat Бить become Становиться begin Начинать bleed Кровоточить stung Жалить sweep Выметать swell Разбухать swim Плавать swing Качать take Брать, взять")

$ emacs --script any-other.el 
cons:(=0 . @5G:5) list:(=0 @5G:5 =0 B>< 15 <at> 5) string:be KBL beat 8BL become !B0=>28BLAO begin 0G8=0BL bleed @>2>B>G8BL stung 0;8BL sweep K<5B0BL swell  071CE0BL swim ;020BL swing 0G0BL take @0BL, 27OBL

$ cat ja.el 
(setq jstr "案ずるより産むが易し。 Anzuru yori umu ga yasushi. 出る杭は打たれる。 Deru kui wa utareru.")
(message "%s" jstr)
(error "%s" jstr)

$ emacs --script ja.el 
案ずるより産むが易し。 Anzuru yori umu ga yasushi. 出る杭は打たれる。 Deru kui wa utareru.
HZ???#?LW Anzuru yori umu ga yasushi. ?moS_?? Deru kui wa utareru.



In GNU Emacs 24.3.1 (x86_64-pc-linux-gnu, GTK+ Version 2.24.17)
 of 2013-10-10 on laptop
Windowing system distributor `The X.Org Foundation', version 11.0.11403000
Configured using:
 `configure '--prefix=/usr' '--build=x86_64-pc-linux-gnu'
 '--host=x86_64-pc-linux-gnu' '--mandir=/usr/share/man'
 '--infodir=/usr/share/info' '--datadir=/usr/share' '--sysconfdir=/etc'
 '--localstatedir=/var/lib' '--libdir=/usr/lib64'
 '--disable-silent-rules' '--disable-dependency-tracking'
 '--program-suffix=-emacs-24' '--infodir=/usr/share/info/emacs-24'
 '--enable-locallisppath=/etc/emacs:/usr/share/emacs/site-lisp'
 '--with-crt-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.3/../../../../lib64'
 '--with-gameuser=games' '--without-compress-info' '--without-hesiod'
 '--without-kerberos' '--without-kerberos5' '--with-gpm' '--with-dbus'
 '--with-gnutls' '--with-xml2' '--without-selinux' '--without-wide-int'
 '--with-sound' '--with-x' '--without-ns' '--with-gconf'
 '--without-gsettings' '--with-toolkit-scroll-bars' '--with-gif'
 '--with-jpeg' '--with-png' '--with-rsvg' '--with-tiff' '--with-xpm'
 '--with-imagemagick' '--with-xft' '--with-libotf' '--with-m17n-flt'
 '--with-x-toolkit=gtk2' 'GENTOO_PACKAGE=app-editors/emacs-24.3-r2'
 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-pc-linux-gnu'
 'CFLAGS=-pipe -march=corei7-avx -mno-aes -O2' 'LDFLAGS=-Wl,-O1
 -Wl,--as-needed' 'CPPFLAGS=''

Important settings:
  value of $LC_COLLATE: C
  value of $LANG: en_US.UTF-8
  locale-coding-system: utf-8-unix
  default enable-multibyte-characters: t


-- 
Sergey Tselikh <stselikh <at> gmail.com>

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16448; Package emacs. (Wed, 15 Jan 2014 04:03:01 GMT) Full text and rfc822 format available.

Message #8 received at 16448 <at> debbugs.gnu.org (full text, mbox):

From: Dmitry Antipov <dmantipov <at> yandex.ru>
To: Sergey Tselikh <stselikh <at> gmail.com>
Cc: 16448 <at> debbugs.gnu.org
Subject: Re: bug#16448: 24.3; Messages from (error "...") with UTF-8 chars
 are printed wrongly in Emacs Lisp scripts
Date: Wed, 15 Jan 2014 08:02:49 +0400

On 01/15/2014 04:10 AM, Sergey Tselikh wrote:

> In a script, when (error "...") instruction is executed with some UTF-8
> characters in its text, the message is not printed correctly.

In batch mode, (error ...) is handled by external-debugging-output, and the
latter just does:

putc (XINT (character) & 0xFF, stderr);
                       ^^^^^^
To allow multibyte sequences here, we should use something like:

=== modified file 'src/print.c'
--- src/print.c	2014-01-01 07:43:34 +0000
+++ src/print.c	2014-01-15 03:55:39 +0000
@@ -709,8 +709,14 @@
 to make it write to the debugging output.  */)
   (Lisp_Object character)
 {
+  unsigned char str[MAX_MULTIBYTE_LENGTH];
+  unsigned int ch;
+  ptrdiff_t len;
+
   CHECK_NUMBER (character);
-  putc (XINT (character) & 0xFF, stderr);
+  ch = XINT (character);
+  len = CHAR_STRING (ch, str);
+  fwrite (str, len, 1, stderr);

 #ifdef WINDOWSNT
   /* Send the output to a debugger (nothing happens if there isn't one).  */

Dmitry

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#16448; Package emacs. (Wed, 15 Jan 2014 15:36:01 GMT) Full text and rfc822 format available.

Message #11 received at 16448 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Dmitry Antipov <dmantipov <at> yandex.ru>
Cc: 16448 <at> debbugs.gnu.org, stselikh <at> gmail.com
Subject: Re: bug#16448: 24.3;
 Messages from (error "...") with UTF-8 chars are printed wrongly
 in	Emacs Lisp scripts
Date: Wed, 15 Jan 2014 17:35:43 +0200

> Date: Wed, 15 Jan 2014 08:02:49 +0400
> From: Dmitry Antipov <dmantipov <at> yandex.ru>
> Cc: 16448 <at> debbugs.gnu.org
> 
> On 01/15/2014 04:10 AM, Sergey Tselikh wrote:
> 
> > In a script, when (error "...") instruction is executed with some UTF-8
> > characters in its text, the message is not printed correctly.
> 
> In batch mode, (error ...) is handled by external-debugging-output, and the
> latter just does:
> 
> putc (XINT (character) & 0xFF, stderr);
>                         ^^^^^^
> To allow multibyte sequences here, we should use something like:
> 
> === modified file 'src/print.c'
> --- src/print.c	2014-01-01 07:43:34 +0000
> +++ src/print.c	2014-01-15 03:55:39 +0000
> @@ -709,8 +709,14 @@
>   to make it write to the debugging output.  */)
>     (Lisp_Object character)
>   {
> +  unsigned char str[MAX_MULTIBYTE_LENGTH];
> +  unsigned int ch;
> +  ptrdiff_t len;
> +
>     CHECK_NUMBER (character);
> -  putc (XINT (character) & 0xFF, stderr);
> +  ch = XINT (character);
> +  len = CHAR_STRING (ch, str);
> +  fwrite (str, len, 1, stderr);

This will only work correctly in a UTF-8 locale.  In the general case,
we need to run the resulting multibyte sequence through ENCODE_SYSTEM,
before writing it to stderr.

Btw, the way we output text in this case cries for refactoring: we
first assemble individual characters from their multibyte sequences,
then pass those characters one by one to external-debugging-output,
which will now have to unroll each character back into its multibyte
sequence, and encode each character individually.  Something for after
the branch, I guess.

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 01 Feb 2014 12:01:02 GMT) Full text and rfc822 format available.

Notification sent to Sergey Tselikh <stselikh <at> gmail.com>:
bug acknowledged by developer. (Sat, 01 Feb 2014 12:01:02 GMT) Full text and rfc822 format available.

Message #16 received at 16448-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: stselikh <at> gmail.com
Cc: 16448-done <at> debbugs.gnu.org, dmantipov <at> yandex.ru
Subject: Re: bug#16448: 24.3;
 Messages from (error "...") with UTF-8 chars are printed
 wrongly	in	Emacs Lisp scripts
Date: Sat, 01 Feb 2014 14:00:04 +0200

> Date: Wed, 15 Jan 2014 17:35:43 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 16448 <at> debbugs.gnu.org, stselikh <at> gmail.com
> 
> > Date: Wed, 15 Jan 2014 08:02:49 +0400
> > From: Dmitry Antipov <dmantipov <at> yandex.ru>
> > Cc: 16448 <at> debbugs.gnu.org
> > 
> > On 01/15/2014 04:10 AM, Sergey Tselikh wrote:
> > 
> > > In a script, when (error "...") instruction is executed with some UTF-8
> > > characters in its text, the message is not printed correctly.
> > 
> > In batch mode, (error ...) is handled by external-debugging-output, and the
> > latter just does:
> > 
> > putc (XINT (character) & 0xFF, stderr);
> >                         ^^^^^^
> > To allow multibyte sequences here, we should use something like:
> > 
> > === modified file 'src/print.c'
> > --- src/print.c	2014-01-01 07:43:34 +0000
> > +++ src/print.c	2014-01-15 03:55:39 +0000
> > @@ -709,8 +709,14 @@
> >   to make it write to the debugging output.  */)
> >     (Lisp_Object character)
> >   {
> > +  unsigned char str[MAX_MULTIBYTE_LENGTH];
> > +  unsigned int ch;
> > +  ptrdiff_t len;
> > +
> >     CHECK_NUMBER (character);
> > -  putc (XINT (character) & 0xFF, stderr);
> > +  ch = XINT (character);
> > +  len = CHAR_STRING (ch, str);
> > +  fwrite (str, len, 1, stderr);
> 
> This will only work correctly in a UTF-8 locale.  In the general case,
> we need to run the resulting multibyte sequence through ENCODE_SYSTEM,
> before writing it to stderr.

Done in trunk revision 116232.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 01 Mar 2014 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 11 years and 169 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #16448 24.3; Messages from (error "...") with UTF-8 chars are printed wrongly in Emacs Lisp scripts

GNU bug report logs - #16448
24.3; Messages from (error "...") with UTF-8 chars are printed wrongly in Emacs Lisp scripts