Package: emacs;
Reported by: "Eduardo Ochs" <eduardoochs <at> gmail.com>
Date: Tue, 21 Oct 2008 16:10:03 UTC
Severity: normal
Done: Chong Yidong <cyd <at> stupidchicken.com>
Bug is archived. No further changes may be made.
View this message in rfc822 format
From: "Eduardo Ochs" <eduardoochs <at> gmail.com> To: emacs-pretest-bug <at> gnu.org Subject: bug#1215: 23.0.60; unibyte->multibyte conversion problem (in search-forward and friends) Date: Tue, 21 Oct 2008 12:00:58 -0400
Hello, this may not be exactly a bug, I'm just struggling with an obscure part of Emacs... anyway, I did my best to make this look like a nice bug report, and to make the tests clear enough to help other people who also find unibyte<->multibyte conversions obscure... The short story =============== Let me refer to strings like "<<tag>>" - where the "<<" and ">>" stand for guillemets, i.e., the characters that we type with `C-x 8 <' and `C-x 8 >' - as "anchors". So: if I produce an anchor string in a unibyte buffer and then I search for an occurrence of that string in multibyte buffer, the search fails. The two small blocks below illustrate this. Instructions: save the first one to "/tmp/1.txt", the second one to "/tmp/2.txt", and then run: (load-file "/tmp/1.txt") It will show "uni" in the "*Messages*" buffer, and the search will fail. The detailed message about the failure of the search will be like this: progn: Search failed: "\302\253foo\302\273" meaning the anchor string has been incorrectly converted. ;;--------snip,snip-------- ;; -*- coding: raw-text-unix -*- ;; (save-this-block-as "/tmp/1.txt") (progn (find-file "/tmp/2.txt") (goto-char (point-min)) (setq anchorstr "«foo»") (message (if (multibyte-string-p anchorstr) "multi" "uni")) (search-forward anchorstr)) ;;--------snip,snip-------- ;;--------snip,snip-------- ;; -*- coding: latin-1 -*- ;; (save-this-block-as "/tmp/2.txt") (search-forward "«foo»") ;; «foo» ;;--------snip,snip-------- The long story ============== Save the block below as "/tmp/3.txt" and follow the instructions in it. Note that it doesn't have any non-ascii characters - the anchors are produced by running the "(insert ...)" sexps. ;;--------snip,snip-------- ;; -*- coding: latin-1 -*- ;; (save-this-block-as "/tmp/3.txt") ;; Run the "progn" below with C-x C-e. ;; It will create a line like this: ;; <<anchor>>\253anchor\273\253anchor\273\253anchor\273 ;; (but the "<<", ">>", "\253", "\273" are single characters). ;; Don't delete that line, it will be used later. ;; (progn (defun mmb (str) (string-make-multibyte str)) (defun mub (str) (string-make-unibyte str)) (insert 171 "anchor" 187) (insert "\253anchor\273") (insert (mub "\253anchor\273")) (insert (mmb (mub "\253anchor\273"))) ) ;; Now try to save this file. ;; Emacs will complain about the "\253"s and "\273"s - it will ;; say that iso-latin-1-unix and utf-8-unix cannot encode them. ;; The "<<" and ">>" are ok, though... ;; ;; So: leave the "<<anchor>>" above, delete the "\253anchor\273"s, ;; save this file, and reload it. DON'T SKIP THIS STEP - the ;; charset properties mentioned below behave differently before ;; and after reloads, and I don't know exactly the mechanics of ;; this... 8-\ ;; ;; If we inspect the "<<", ">>" "\253", "\273" with `C-x =' ;; we see this: ;; Char: << (171, #o253, #xab, file #xAB) ;; Char: >> (187, #o273, #xbb, file #xBB) ;; Char: \253 (4194219, #o17777653, #x3fffab, raw-byte) ;; Char: \253 (4194235, #o17777673, #x3fffbb, raw-byte) ;; ;; Now mark the "<<anchor>>" above and copy it to the top of ;; the kill ring with `M-w'. Let's examine the results of ;; several obvious ways to (re)create the "<<anchor>>" ;; above as a string... ;; Here are some of the results: ;; ;; "\253anchor\273" ==> "<<anchor>>" ;; (mub "\253anchor\273") ==> "<<anchor>>" ;; (mmb (mub "\253anchor\273")) ==> "\253anchor\273" ;; (car kill-ring) ==> ;; #("<<anchor>>" 0 8 (charset iso-8859-1)) ;; (mub (car kill-ring)) ==> "<<anchor>>" ;; (mmb (mub (car kill-ring))) ==> "\253anchor\273" "\253anchor\273" (mub "\253anchor\273") (mmb (mub "\253anchor\273")) (mub (mmb (mub "\253anchor\273"))) (mapcar 'identity "\253anchor\273") (mapcar 'identity (mub "\253anchor\273")) (mapcar 'identity (mmb (mub "\253anchor\273"))) (car kill-ring) (mub (car kill-ring)) (mmb (mub (car kill-ring))) (mapcar 'identity (car kill-ring)) (mapcar 'identity (mub (car kill-ring))) (mapcar 'identity (mmb (mub (car kill-ring)))) ;; This is the weird part. ;; Let's insert another "<<anchor>>"/"\253anchor\273" pair, and ;; let's try to jump to its "anchors" with `search-backward'. (insert 171 "anchor" 187 "\n\253anchor\273") (search-backward "\253anchor\273") (search-backward (mub "\253anchor\273")) (search-backward (mmb (mub "\253anchor\273"))) (search-backward (car kill-ring)) (search-backward (mub (car kill-ring))) (search-backward (mmb (mub (car kill-ring)))) ;; Only "(search-backward (car kill-ring))" jumps to ;; "<<anchor>>" - all the others jump to "\253anchor\273". ;; The trick - aha! - is that "(car kill-ring)" holds this ;; string, ;; ;; (car kill-ring) ==> ;; #("<<anchor>>" 0 8 (charset iso-8859-1)) ;; ;; and the "(charset iso-8859-1)" property is essential... ;;--------snip,snip-------- What is the standard way to convert unibyte strings (for example anchor strings, generated from code in raw-text-unix ".el" files) to strings with the right charset property (if needed) and the right encoding? I couldn't find the functions for that... Cheers, thanks in advance, Eduardo Ochs eduardoochs at gmail.com http://angg.twu.net/ P.S.: (emacs-version) ==> "GNU Emacs 23.0.60.1 (i686-pc-linux-gnu, GTK+ Version 2.8.20) of 2008-10-11 on dekooning"
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.