GNU bug report logs - #25288
25.1; term, ansi-term, broken output of utf8 text

Previous Next

Package: emacs;

Reported by: Vjacheslav <fvamail <at> gmail.com>

Date: Wed, 28 Dec 2016 16:58:02 UTC

Severity: normal

Tags: confirmed, fixed, patch

Found in versions 24.5, 25.1

Fixed in version 26.1

Done: npostavs <at> users.sourceforge.net

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 25288 in the body.
You can then email your comments to 25288 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#25288; Package emacs. (Wed, 28 Dec 2016 16:58:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Vjacheslav <fvamail <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 28 Dec 2016 16:58:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Vjacheslav <fvamail <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 25.1; term, ansi-term, broken output of utf8 text
Date: Wed, 28 Dec 2016 13:41:55 +0300
Trying to use this command from terminal running bash:

[fva <at> localhost ~]$ python -c 'print "ш"*5000'

produces garbage (шшш\321\210шшш) in output. Terminal needs reset. Possibly this 
is a bug which seen in very old linux, (breaks multibyte characters on buffer 
borders).

default-process-coding-system is OK:

default-process-coding-system is a variable defined in ‘C source code’.
Its value is (utf-8-unix . utf-8-unix)




In GNU Emacs 25.1.1 (x86_64-redhat-linux-gnu, GTK+ Version 3.22.4)
 of 2016-12-15 built on buildvm-30.phx2.fedoraproject.org
Windowing system distributor 'Fedora Project', version 11.0.11900000
Configured using:
 'configure --build=x86_64-redhat-linux-gnu
 --host=x86_64-redhat-linux-gnu --program-prefix=
 --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr
 --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc
 --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64
 --libexecdir=/usr/libexec --localstatedir=/var
 --sharedstatedir=/var/lib --mandir=/usr/share/man
 --infodir=/usr/share/info --with-dbus --with-gif --with-jpeg --with-png
 --with-rsvg --with-tiff --with-xft --with-xpm --with-x-toolkit=gtk3
 --with-gpm=no --with-xwidgets build_alias=x86_64-redhat-linux-gnu
 host_alias=x86_64-redhat-linux-gnu 'CFLAGS=-DMAIL_USE_LOCKF -O2 -g
 -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2
 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4
 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1
 -m64 -mtune=generic' LDFLAGS=-Wl,-z,relro
 PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig'

Configured features:
XPM JPEG TIFF GIF PNG RSVG IMAGEMAGICK SOUND DBUS GCONF GSETTINGS NOTIFY
ACL LIBSELINUX GNUTLS LIBXML2 FREETYPE M17N_FLT LIBOTF XFT ZLIB
TOOLKIT_SCROLL_BARS GTK3 X11 XWIDGETS

Important settings:
  value of $LANG: ru_RU.UTF-8
  value of $XMODIFIERS: @im=ibus
  locale-coding-system: utf-8-unix

Major mode: Term

Minor modes in effect:
  show-paren-mode: t
  recentf-mode: t
  delete-selection-mode: t
  global-auto-complete-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  line-number-mode: t
  transient-mark-mode: t

Recent messages:
Checking 120 files in /usr/share/emacs/25.1/lisp/obsolete...
Checking for load-path shadows...done
Auto-saving...
next-line: End of buffer [2 times]
previous-line: Beginning of buffer [7 times]
Quit
funcall-interactively: End of buffer [4 times]
previous-line: Beginning of buffer [2 times]
mwheel-scroll: Beginning of buffer [2 times]
Making completion list... [2 times]

Load-path shadows:
None found.

Features:
(pp shadow sort mail-extr emacsbug message idna dired format-spec rfc822
mml mml-sec password-cache epg epg-config gnus-util mm-decode mm-bodies
mm-encode mail-parse rfc2231 mailabbrev gmm-utils mailheader sendmail
rfc2047 rfc2045 ietf-drums mm-util mail-prsvr mail-utils thingatpt
help-fns help-mode term disp-table ehelp easy-mmode ropemacs ring pymacs
advice paren recentf tree-widget wid-edit easymenu delsel cus-start
cus-load erlang-start auto-complete-config auto-complete edmacro kmacro
cl-loaddefs pcase cl-lib popup time-date mule-util cyril-util tooltip
eldoc electric uniquify ediff-hook vc-hooks lisp-float-type mwheel x-win
term/common-win x-dnd tool-bar dnd fontset image regexp-opt fringe
tabulated-list newcomment elisp-mode lisp-mode prog-mode register page
menu-bar rfn-eshadow timer select scroll-bar mouse jit-lock font-lock
syntax facemenu font-core frame cl-generic cham georgian utf-8-lang
misc-lang vietnamese tibetan thai tai-viet lao korean japanese eucjp-ms
cp51932 hebrew greek romanian slovak czech european ethiopic indian
cyrillic chinese charscript case-table epa-hook jka-cmpr-hook help
simple abbrev minibuffer cl-preloaded nadvice loaddefs button faces
cus-face macroexp files text-properties overlay sha1 md5 base64 format
env code-pages mule custom widget hashtable-print-readable backquote
dbusbind inotify dynamic-setting system-font-setting font-render-setting
xwidget-internal move-toolbar gtk x-toolkit x multi-tty
make-network-process emacs)

Memory information:
((conses 16 118333 17341)
 (symbols 48 23114 0)
 (miscs 40 145 285)
 (strings 32 22117 5473)
 (string-bytes 1 586321)
 (vectors 16 15669)
 (vector-slots 8 490744 11337)
 (floats 8 203 310)
 (intervals 56 965 1)
 (buffers 976 25))




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#25288; Package emacs. (Wed, 28 Dec 2016 19:10:02 GMT) Full text and rfc822 format available.

Message #8 received at 25288 <at> debbugs.gnu.org (full text, mbox):

From: npostavs <at> users.sourceforge.net
To: Vjacheslav <fvamail <at> gmail.com>
Cc: 25288 <at> debbugs.gnu.org
Subject: Re: bug#25288: 25.1; term, ansi-term, broken output of utf8 text
Date: Wed, 28 Dec 2016 14:10:30 -0500
found 25288 24.5
tags 25288 confirmed
quit

Vjacheslav <fvamail <at> gmail.com> writes:

> Trying to use this command from terminal running bash:
>
> [fva <at> localhost ~]$ python -c 'print "ш"*5000'
>
> produces garbage (шшш\321\210шшш) in output. Terminal needs
> reset. Possibly this is a bug which seen in very old linux, (breaks
> multibyte characters on buffer borders).
>
> default-process-coding-system is OK:
>
> default-process-coding-system is a variable defined in ‘C source code’.
> Its value is (utf-8-unix . utf-8-unix)

It looks like the problem is that the process filter function,
term-emulate-terminal, receives the output in chunks of 4096 bytes[1].  The
ш character is encoded in 2 bytes, which means it can be split across
chunks.

Is there a way to recognize incomplete decoding from lisp?  I can't see
any.


[1]: It's getting bytes rather than characters because in term-exec-1 we
have:

	;; The process's output contains not just chars but also binary
	;; escape codes, so we need to see the raw output.  We will have to
	;; do the decoding by hand on the parts that are made of chars.
	(coding-system-for-read 'binary))





bug Marked as found in versions 24.5. Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Wed, 28 Dec 2016 19:10:02 GMT) Full text and rfc822 format available.

Added tag(s) confirmed. Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Wed, 28 Dec 2016 19:10:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#25288; Package emacs. (Wed, 28 Dec 2016 19:32:01 GMT) Full text and rfc822 format available.

Message #15 received at 25288 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: npostavs <at> users.sourceforge.net
Cc: 25288 <at> debbugs.gnu.org, fvamail <at> gmail.com
Subject: Re: bug#25288: 25.1; term, ansi-term, broken output of utf8 text
Date: Wed, 28 Dec 2016 21:31:14 +0200
> From: npostavs <at> users.sourceforge.net
> Date: Wed, 28 Dec 2016 14:10:30 -0500
> Cc: 25288 <at> debbugs.gnu.org
> 
> Is there a way to recognize incomplete decoding from lisp?  I can't see
> any.

If you know the encoding of the byte stream (and term.el must, since
it evidently decodes it later on), then you could probably use
char-charset, after decoding: if you get 'eight-bit, then you've got
incomplete byte sequence.  But I didn't try that.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#25288; Package emacs. (Thu, 29 Dec 2016 02:37:01 GMT) Full text and rfc822 format available.

Message #18 received at 25288 <at> debbugs.gnu.org (full text, mbox):

From: npostavs <at> users.sourceforge.net
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 25288 <at> debbugs.gnu.org, fvamail <at> gmail.com
Subject: Re: bug#25288: 25.1; term, ansi-term, broken output of utf8 text
Date: Wed, 28 Dec 2016 21:37:19 -0500
[Message part 1 (text/plain, inline)]
tags 25288 patch
quit

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: npostavs <at> users.sourceforge.net
>> Date: Wed, 28 Dec 2016 14:10:30 -0500
>> Cc: 25288 <at> debbugs.gnu.org
>> 
>> Is there a way to recognize incomplete decoding from lisp?  I can't see
>> any.
>
> If you know the encoding of the byte stream (and term.el must, since
> it evidently decodes it later on), then you could probably use
> char-charset, after decoding: if you get 'eight-bit, then you've got
> incomplete byte sequence.  But I didn't try that.

That should work at least for encodings like utf-8 for which undecoded
bytes are not ascii.  I guess parsing of escape codes would only work on
such encodings anyway, so it should be fine.  Patch attached.

[v1-0001-Handle-multibyte-chars-spanning-chunks-in-term.el.patch (text/plain, attachment)]

Added tag(s) patch. Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Thu, 29 Dec 2016 02:37:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#25288; Package emacs. (Thu, 29 Dec 2016 16:07:02 GMT) Full text and rfc822 format available.

Message #23 received at 25288 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: npostavs <at> users.sourceforge.net
Cc: 25288 <at> debbugs.gnu.org, fvamail <at> gmail.com
Subject: Re: bug#25288: 25.1; term, ansi-term, broken output of utf8 text
Date: Thu, 29 Dec 2016 18:06:27 +0200
> From: npostavs <at> users.sourceforge.net
> Cc: 25288 <at> debbugs.gnu.org,  fvamail <at> gmail.com
> Date: Wed, 28 Dec 2016 21:37:19 -0500
> 
> > If you know the encoding of the byte stream (and term.el must, since
> > it evidently decodes it later on), then you could probably use
> > char-charset, after decoding: if you get 'eight-bit, then you've got
> > incomplete byte sequence.  But I didn't try that.
> 
> That should work at least for encodings like utf-8 for which undecoded
> bytes are not ascii.  I guess parsing of escape codes would only work on
> such encodings anyway, so it should be fine.  Patch attached.

LGTM, thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#25288; Package emacs. (Tue, 03 Jan 2017 14:06:01 GMT) Full text and rfc822 format available.

Message #26 received at 25288 <at> debbugs.gnu.org (full text, mbox):

From: npostavs <at> users.sourceforge.net
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 25288 <at> debbugs.gnu.org, fvamail <at> gmail.com
Subject: Re: bug#25288: 25.1; term, ansi-term, broken output of utf8 text
Date: Tue, 03 Jan 2017 09:05:57 -0500
tags 25288 fixed
close 25288 26.1
quit

Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: npostavs <at> users.sourceforge.net
>> Cc: 25288 <at> debbugs.gnu.org,  fvamail <at> gmail.com
>> Date: Wed, 28 Dec 2016 21:37:19 -0500
>> 
>> > If you know the encoding of the byte stream (and term.el must, since
>> > it evidently decodes it later on), then you could probably use
>> > char-charset, after decoding: if you get 'eight-bit, then you've got
>> > incomplete byte sequence.  But I didn't try that.
>> 
>> That should work at least for encodings like utf-8 for which undecoded
>> bytes are not ascii.  I guess parsing of escape codes would only work on
>> such encodings anyway, so it should be fine.  Patch attached.
>
> LGTM, thanks.

Pushed as 134e86b360ca.




Added tag(s) fixed. Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Tue, 03 Jan 2017 14:06:02 GMT) Full text and rfc822 format available.

bug marked as fixed in version 26.1, send any further explanations to 25288 <at> debbugs.gnu.org and Vjacheslav <fvamail <at> gmail.com> Request was from npostavs <at> users.sourceforge.net to control <at> debbugs.gnu.org. (Tue, 03 Jan 2017 14:06:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 01 Feb 2017 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 196 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.