GNU bug report logs -
#1215
23.0.60; unibyte->multibyte conversion problem (in search-forward and friends)
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 1215 in the body.
You can then email your comments to 1215 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
Full text and
rfc822 format available.
Acknowledgement sent to
"Eduardo Ochs" <eduardoochs <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
Full text and
rfc822 format available.
Message #5 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):
Hello,
this may not be exactly a bug, I'm just struggling with an obscure
part of Emacs... anyway, I did my best to make this look like a nice
bug report, and to make the tests clear enough to help other people
who also find unibyte<->multibyte conversions obscure...
The short story
===============
Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand
for guillemets, i.e., the characters that we type with `C-x 8 <' and
`C-x 8 >' - as "anchors". So: if I produce an anchor string in a
unibyte buffer and then I search for an occurrence of that string in
multibyte buffer, the search fails.
The two small blocks below illustrate this. Instructions: save the
first one to "/tmp/1.txt", the second one to "/tmp/2.txt", and then
run:
(load-file "/tmp/1.txt")
It will show "uni" in the "*Messages*" buffer, and the search will
fail. The detailed message about the failure of the search will be
like this:
progn: Search failed: "\302\253foo\302\273"
meaning the anchor string has been incorrectly converted.
;;--------snip,snip--------
;; -*- coding: raw-text-unix -*-
;; (save-this-block-as "/tmp/1.txt")
(progn
(find-file "/tmp/2.txt")
(goto-char (point-min))
(setq anchorstr "«foo»")
(message (if (multibyte-string-p anchorstr) "multi" "uni"))
(search-forward anchorstr))
;;--------snip,snip--------
;;--------snip,snip--------
;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/2.txt")
(search-forward "«foo»")
;; «foo»
;;--------snip,snip--------
The long story
==============
Save the block below as "/tmp/3.txt" and follow the instructions in
it. Note that it doesn't have any non-ascii characters - the anchors
are produced by running the "(insert ...)" sexps.
;;--------snip,snip--------
;; -*- coding: latin-1 -*-
;; (save-this-block-as "/tmp/3.txt")
;; Run the "progn" below with C-x C-e.
;; It will create a line like this:
;; <<anchor>>\253anchor\273\253anchor\273\253anchor\273
;; (but the "<<", ">>", "\253", "\273" are single characters).
;; Don't delete that line, it will be used later.
;;
(progn
(defun mmb (str) (string-make-multibyte str))
(defun mub (str) (string-make-unibyte str))
(insert 171 "anchor" 187)
(insert "\253anchor\273")
(insert (mub "\253anchor\273"))
(insert (mmb (mub "\253anchor\273")))
)
;; Now try to save this file.
;; Emacs will complain about the "\253"s and "\273"s - it will
;; say that iso-latin-1-unix and utf-8-unix cannot encode them.
;; The "<<" and ">>" are ok, though...
;;
;; So: leave the "<<anchor>>" above, delete the "\253anchor\273"s,
;; save this file, and reload it. DON'T SKIP THIS STEP - the
;; charset properties mentioned below behave differently before
;; and after reloads, and I don't know exactly the mechanics of
;; this... 8-\
;;
;; If we inspect the "<<", ">>" "\253", "\273" with `C-x ='
;; we see this:
;; Char: << (171, #o253, #xab, file #xAB)
;; Char: >> (187, #o273, #xbb, file #xBB)
;; Char: \253 (4194219, #o17777653, #x3fffab, raw-byte)
;; Char: \253 (4194235, #o17777673, #x3fffbb, raw-byte)
;;
;; Now mark the "<<anchor>>" above and copy it to the top of
;; the kill ring with `M-w'. Let's examine the results of
;; several obvious ways to (re)create the "<<anchor>>"
;; above as a string...
;; Here are some of the results:
;;
;; "\253anchor\273" ==> "<<anchor>>"
;; (mub "\253anchor\273") ==> "<<anchor>>"
;; (mmb (mub "\253anchor\273")) ==> "\253anchor\273"
;; (car kill-ring) ==>
;; #("<<anchor>>" 0 8 (charset iso-8859-1))
;; (mub (car kill-ring)) ==> "<<anchor>>"
;; (mmb (mub (car kill-ring))) ==> "\253anchor\273"
"\253anchor\273"
(mub "\253anchor\273")
(mmb (mub "\253anchor\273"))
(mub (mmb (mub "\253anchor\273")))
(mapcar 'identity "\253anchor\273")
(mapcar 'identity (mub "\253anchor\273"))
(mapcar 'identity (mmb (mub "\253anchor\273")))
(car kill-ring)
(mub (car kill-ring))
(mmb (mub (car kill-ring)))
(mapcar 'identity (car kill-ring))
(mapcar 'identity (mub (car kill-ring)))
(mapcar 'identity (mmb (mub (car kill-ring))))
;; This is the weird part.
;; Let's insert another "<<anchor>>"/"\253anchor\273" pair, and
;; let's try to jump to its "anchors" with `search-backward'.
(insert 171 "anchor" 187 "\n\253anchor\273")
(search-backward "\253anchor\273")
(search-backward (mub "\253anchor\273"))
(search-backward (mmb (mub "\253anchor\273")))
(search-backward (car kill-ring))
(search-backward (mub (car kill-ring)))
(search-backward (mmb (mub (car kill-ring))))
;; Only "(search-backward (car kill-ring))" jumps to
;; "<<anchor>>" - all the others jump to "\253anchor\273".
;; The trick - aha! - is that "(car kill-ring)" holds this
;; string,
;;
;; (car kill-ring) ==>
;; #("<<anchor>>" 0 8 (charset iso-8859-1))
;;
;; and the "(charset iso-8859-1)" property is essential...
;;--------snip,snip--------
What is the standard way to convert unibyte strings (for example
anchor strings, generated from code in raw-text-unix ".el" files) to
strings with the right charset property (if needed) and the right
encoding? I couldn't find the functions for that...
Cheers, thanks in advance,
Eduardo Ochs
eduardoochs at gmail.com
http://angg.twu.net/
P.S.: (emacs-version) ==>
"GNU Emacs 23.0.60.1 (i686-pc-linux-gnu, GTK+ Version 2.8.20)
of 2008-10-11 on dekooning"
Information forwarded to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
Full text and
rfc822 format available.
Acknowledgement sent to
Stefan Monnier <monnier <at> iro.umontreal.ca>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
Full text and
rfc822 format available.
Message #10 received at submit <at> emacsbugs.donarmstrong.com (full text, mbox):
> Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand
> for guillemets, i.e., the characters that we type with `C-x 8 <' and
> `C-x 8 >' - as "anchors". So: if I produce an anchor string in a
> unibyte buffer and then I search for an occurrence of that string in
> multibyte buffer, the search fails.
There are no guillemets in unibyte buffers.
> ;;--------snip,snip--------
> ;; -*- coding: raw-text-unix -*-
> ;; (save-this-block-as "/tmp/1.txt")
> (progn
> (find-file "/tmp/2.txt")
> (goto-char (point-min))
> (setq anchorstr "«foo»")
> (message (if (multibyte-string-p anchorstr) "multi" "uni"))
> (search-forward anchorstr))
There's a bug here, indeed: Emacs should refuse to save such a file,
because raw-text-unix (to which I prefer to refer as `binary') cannot
encode « and ».
Stefan
Information forwarded to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
Full text and
rfc822 format available.
Acknowledgement sent to
Stefan Monnier <monnier <at> iro.umontreal.ca>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
Full text and
rfc822 format available.
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Fri, 16 Jan 2009 00:25:06 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Juanma Barranquero <lekktu <at> gmail.com>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Fri, 16 Jan 2009 00:25:06 GMT)
Full text and
rfc822 format available.
Message #20 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
On Wed, Oct 22, 2008 at 15:51, Stefan Monnier <monnier <at> iro.umontreal.ca> wrote:
> There's a bug here, indeed: Emacs should refuse to save such a file,
> because raw-text-unix (to which I prefer to refer as `binary') cannot
> encode « and ».
Why not? « is U+00AB and » is U+00BB.
(with-temp-file "/temp/guillemets.txt"
(set-buffer-multibyte nil)
(setq buffer-file-coding-system 'raw-text-unix)
(insert ?« "Test" ?» ?\n))
=>
0000 0000 ab 54 65 73 74 bb 0a ½Test╗.
Juanma
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Fri, 16 Jan 2009 02:50:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stefan Monnier <monnier <at> iro.umontreal.ca>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Fri, 16 Jan 2009 02:50:04 GMT)
Full text and
rfc822 format available.
Message #25 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
>> There's a bug here, indeed: Emacs should refuse to save such a file,
>> because raw-text-unix (to which I prefer to refer as `binary') cannot
>> encode « and ».
> Why not? « is U+00AB and » is U+00BB.
Neither of which is a byte. The byte 0xAB is the Emacs character
#x3fffab, as shown by (unibyte-char-to-multibyte #xab).
If you save that file and read it back in, you'll see that its content
has changed. `save-buffer' should not silently save if it will
lose information.
Stefan
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Fri, 16 Jan 2009 03:05:05 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Juanma Barranquero <lekktu <at> gmail.com>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Fri, 16 Jan 2009 03:05:06 GMT)
Full text and
rfc822 format available.
Message #30 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
> If you save that file and read it back in, you'll see that its content
> has changed.
Sorry, but I don't see that.
emacs -Q
then I evaluate this:
(with-temp-file "/temp/guillemets.txt"
(set-buffer-multibyte nil)
(setq buffer-file-coding-system 'raw-text-unix)
(insert ?« "Test" ?» ?\n))
then
C-x C-f /temp/guillemets.txt
I get a buffer guillemets.txt with
«Test»
as a multibyte file in iso-latin-1-unix. I can modify it and save it,
and still the guillemets are bytes 0xab and 0xbb in the resulting
file.
Juanma
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Fri, 16 Jan 2009 03:45:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stefan Monnier <monnier <at> iro.umontreal.ca>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Fri, 16 Jan 2009 03:45:04 GMT)
Full text and
rfc822 format available.
Message #35 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
> Sorry, but I don't see that.
> emacs -Q
> then I evaluate this:
> (with-temp-file "/temp/guillemets.txt"
> (set-buffer-multibyte nil)
> (setq buffer-file-coding-system 'raw-text-unix)
> (insert ?« "Test" ?» ?\n))
You're cheating: remove the (set-buffer-multibyte nil).
Otherwise you're not actually inserting the ?« char but the #xAB
byte instead.
Stefan
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Fri, 16 Jan 2009 11:15:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Juanma Barranquero <lekktu <at> gmail.com>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Fri, 16 Jan 2009 11:15:03 GMT)
Full text and
rfc822 format available.
Message #40 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
On Fri, Jan 16, 2009 at 04:37, Stefan Monnier <monnier <at> iro.umontreal.ca> wrote:
> You're cheating: remove the (set-buffer-multibyte nil).
> Otherwise you're not actually inserting the ?« char but the #xAB
> byte instead.
OK, I see.
You said:
"There's a bug here, indeed: Emacs should refuse to save such a file,
because raw-text-unix (to which I prefer to refer as `binary') cannot
encode « and »."
but according to raw-text-unix's description:
t -- raw-text-unix
Raw text, which means text contains random 8-bit codes.
Encoding text with this coding system produces the actual byte
sequence of the text in buffers and strings. An exception is made for
eight-bit-control characters. Each of them is encoded into a single
byte.
you can save (almost) anything with it. What is the bug?
Juanma
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Fri, 16 Jan 2009 21:05:05 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stefan Monnier <monnier <at> iro.umontreal.ca>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Fri, 16 Jan 2009 21:05:06 GMT)
Full text and
rfc822 format available.
Message #45 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
> but according to raw-text-unix's description:
> t -- raw-text-unix
> Raw text, which means text contains random 8-bit codes.
> Encoding text with this coding system produces the actual byte
> sequence of the text in buffers and strings. An exception is made for
> eight-bit-control characters. Each of them is encoded into a single
> byte.
> you can save (almost) anything with it. What is the bug?
The bug is that you can currently save (almost) anything with it. This is
due to historical reasons, where different notions of "no encoding" were
mixed up. So on save, raw-text-unix behaves pretty much like
utf-8-mule under Emacs-23 and emacs-mule under Emacs-22. On load, it
behaves pretty much like `binary'.
Stefan
Information forwarded
to
bug-submit-list <at> lists.donarmstrong.com, Emacs Bugs <bug-gnu-emacs <at> gnu.org>
:
bug#1215
; Package
emacs
.
(Sat, 17 Jan 2009 10:20:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Eli Zaretskii <eliz <at> gnu.org>
:
Extra info received and forwarded to list. Copy sent to
Emacs Bugs <bug-gnu-emacs <at> gnu.org>
.
(Sat, 17 Jan 2009 10:20:04 GMT)
Full text and
rfc822 format available.
Message #50 received at 1215 <at> emacsbugs.donarmstrong.com (full text, mbox):
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Date: Fri, 16 Jan 2009 15:56:44 -0500
> Cc: 1215 <at> emacsbugs.donarmstrong.com
>
> > but according to raw-text-unix's description:
>
> > t -- raw-text-unix
>
> > Raw text, which means text contains random 8-bit codes.
> > Encoding text with this coding system produces the actual byte
> > sequence of the text in buffers and strings. An exception is made for
> > eight-bit-control characters. Each of them is encoded into a single
> > byte.
>
> > you can save (almost) anything with it. What is the bug?
>
> The bug is that you can currently save (almost) anything with it. This is
> due to historical reasons, where different notions of "no encoding" were
> mixed up. So on save, raw-text-unix behaves pretty much like
> utf-8-mule under Emacs-23 and emacs-mule under Emacs-22. On load, it
> behaves pretty much like `binary'.
I documented this in the ELisp manual.
bug closed, send any further explanations to "Eduardo Ochs" <eduardoochs <at> gmail.com>
Request was from
Chong Yidong <cyd <at> stupidchicken.com>
to
control <at> emacsbugs.donarmstrong.com
.
(Wed, 08 Jul 2009 14:10:06 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> emacsbugs.donarmstrong.com
.
(Sun, 09 Aug 2009 14:24:12 GMT)
Full text and
rfc822 format available.
This bug report was last modified 15 years and 183 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.