GNU bug report logs - #20704
info.el bug fix; Interprets Info format wrongly

Previous Next

Package: emacs;

Reported by: Teddy Hogeborn <teddy <at> recompile.se>

Date: Sun, 31 May 2015 17:53:03 UTC

Severity: normal

Tags: patch

Merged with 13431

Found in version 24.2

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20704 in the body.
You can then email your comments to 20704 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Sun, 31 May 2015 17:53:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Teddy Hogeborn <teddy <at> recompile.se>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 31 May 2015 17:53:04 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Teddy Hogeborn <teddy <at> recompile.se>
To: bug-gnu-emacs <at> gnu.org
Subject: info.el bug fix; Interprets Info format wrongly
Date: Sun, 31 May 2015 16:54:05 +0200

[Message part 1 (text/plain, inline)]

The Info file format (see (texinfo)Info Format Tag Table.)  is
documented as having the reference position in bytes.  However, the
info.el functions "Info-find-in-tag-table-1", "Info-read-subfile", and
"Info-search" reads the byte value and adds it to (point-min), which is
a character position, not a byte position.  This causes the Emacs Info
reader to jump to the wrong position in Info files with a lot of
non-ascii characters.  Solution: Convert the read value to position
using byte-to-position:

diff --git a/lisp/info.el b/lisp/info.el
index 80428e7..b179510 100644
--- a/lisp/info.el
+++ b/lisp/info.el
@@ -1020,7 +1020,8 @@ which the match was found."
       (beginning-of-line)
       (when (re-search-forward regexp nil t)
 	(list (string-equal "Ref:" (match-string 1))
-	      (+ (point-min) (read (current-buffer)))
+	      (+ (point-min) (byte-to-position
+                              (read (current-buffer))))
 	      major-mode)))))
 
 (defun Info-find-in-tag-table (marker regexp &optional strict-case)
@@ -1523,7 +1524,9 @@ is non-nil)."
 			thisfilepos thisfilename)
 		    (search-forward ": ")
 		    (setq thisfilename  (buffer-substring beg (- (point) 2)))
-		    (setq thisfilepos (+ (point-min) (read (current-buffer))))
+		    (setq thisfilepos (+ (point-min)
+                                         (byte-to-position
+                                          (read (current-buffer)))))
 		    ;; read in version 19 stops at the end of number.
 		    ;; Advance to the next line.
 		    (forward-line 1)
@@ -2013,9 +2016,11 @@ If DIRECTION is `backward', search in the reverse direction."
 		        (re-search-backward "\\(^.*\\): [0-9]+$")
 		      (re-search-forward "\\(^.*\\): [0-9]+$"))
 		    (goto-char (+ (match-end 1) 2))
-		    (setq list (cons (cons (+ (point-min)
-					      (read (current-buffer)))
-					   (match-string-no-properties 1))
+		    (setq list (cons (cons
+                                      (+ (point-min)
+                                         (byte-to-position
+                                          (read (current-buffer))))
+                                      (match-string-no-properties 1))
 				     list))
 		    (goto-char (if backward
                                    (1- (match-beginning 0))

Suggested ChangeLog:

----
Convert reference byte positions from Info file to character position.

* lisp/info.el (Info-find-in-tag-table-1, Info-read-subfile)
(Info-search): Convert position read from Info file from bytes to
character position.  Patch by Teddy Hogeborn <teddy <at> recompile.se>.
----

/Teddy Hogeborn

[signature.asc (application/pgp-signature, inline)]

Forcibly Merged 13431 20704. Request was from Teddy Hogeborn <teddy <at> recompile.se> to control <at> debbugs.gnu.org. (Sun, 31 May 2015 18:36:03 GMT) Full text and rfc822 format available.

Added tag(s) patch. Request was from Teddy Hogeborn <teddy <at> recompile.se> to control <at> debbugs.gnu.org. (Sun, 31 May 2015 18:39:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Mon, 01 Jun 2015 14:03:02 GMT) Full text and rfc822 format available.

Message #12 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Teddy Hogeborn <teddy <at> recompile.se>
Cc: 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Mon, 01 Jun 2015 10:01:59 -0400

Thanks,

> +	      (+ (point-min) (byte-to-position
> +                              (read (current-buffer))))

Hmm... this only works if the Info file is encoded in UTF-8.
I guess in the case of Info, 99% of the files are just ASCII and there's
a chance that the vast majority of the rest is (or will be) UTF-8,
so maybe this hack works well in practice.

But I think we should define an `Info-bytepos-to-charpos' function for that.
It can be defined as an alias for byte-to-position, but at least it
concentrates this utf-8 assumption at a single place where we can place
a clear comment.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Mon, 01 Jun 2015 15:13:02 GMT) Full text and rfc822 format available.

Message #15 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Mon, 01 Jun 2015 18:12:35 +0300

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Date: Mon, 01 Jun 2015 10:01:59 -0400
> Cc: 20704 <at> debbugs.gnu.org
> 
> Thanks,
> 
> > +	      (+ (point-min) (byte-to-position
> > +                              (read (current-buffer))))
> 
> Hmm... this only works if the Info file is encoded in UTF-8.
> I guess in the case of Info, 99% of the files are just ASCII and there's
> a chance that the vast majority of the rest is (or will be) UTF-8,
> so maybe this hack works well in practice.

Using byte-to-position would make things worse for Latin-1 and the
likes.

But it shouldn't be hard to add a simple test of
buffer-file-coding-system: if it states fixed-size encoding, like any
of the 8-bit encodings, or UTF-16, the conversion to character
position is trivial.  AFAIR, the only problems will be with ISO-2022
derived encodings, and those are really rare in Info.  So IMO adding
such a simple test would go a long way towards making the solution
almost perfect.

> But I think we should define an `Info-bytepos-to-charpos' function for that.
> It can be defined as an alias for byte-to-position, but at least it
> concentrates this utf-8 assumption at a single place where we can place
> a clear comment.

Right.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Tue, 09 Jun 2015 11:10:02 GMT) Full text and rfc822 format available.

Message #18 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Teddy Hogeborn <teddy <at> recompile.se>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Tue, 09 Jun 2015 13:09:08 +0200

[Message part 1 (text/plain, inline)]

Eli Zaretskii <eliz <at> gnu.org> writes:

> > > +	      (+ (point-min) (byte-to-position
> > > +                              (read (current-buffer))))
> > 
> > Hmm... this only works if the Info file is encoded in UTF-8.  I
> > guess in the case of Info, 99% of the files are just ASCII and
> > there's a chance that the vast majority of the rest is (or will be)
> > UTF-8, so maybe this hack works well in practice.
>
> Using byte-to-position would make things worse for Latin-1 and the
> likes.

No, byte-to-position already checks for that:

---- src/marker.c, line 302
  /* If this buffer has as many characters as bytes,
     each character must be one byte.
     This takes care of the case where enable-multibyte-characters is nil.  */
  if (best_above == best_above_byte)
    return bytepos;
----

Therefore, an Info file in Latin-1 should work just fine.

> But it shouldn't be hard to add a simple test of
> buffer-file-coding-system: if it states fixed-size encoding, like any
> of the 8-bit encodings, or UTF-16,
> the conversion to character position is trivial.

I think you mean UTF-32 instead of UTF-16, since UTF-16 is variable-
length.

/Teddy Hogeborn

[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Tue, 09 Jun 2015 14:30:06 GMT) Full text and rfc822 format available.

Message #21 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Teddy Hogeborn <teddy <at> recompile.se>
Cc: monnier <at> iro.umontreal.ca, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Tue, 09 Jun 2015 17:29:09 +0300

> From: Teddy Hogeborn <teddy <at> recompile.se>
> Cc: Stefan Monnier <monnier <at> iro.umontreal.ca>,  20704 <at> debbugs.gnu.org
> Date: Tue, 09 Jun 2015 13:09:08 +0200
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > > > +	      (+ (point-min) (byte-to-position
> > > > +                              (read (current-buffer))))
> > > 
> > > Hmm... this only works if the Info file is encoded in UTF-8.  I
> > > guess in the case of Info, 99% of the files are just ASCII and
> > > there's a chance that the vast majority of the rest is (or will be)
> > > UTF-8, so maybe this hack works well in practice.
> >
> > Using byte-to-position would make things worse for Latin-1 and the
> > likes.
> 
> No, byte-to-position already checks for that:
> 
> ---- src/marker.c, line 302
>   /* If this buffer has as many characters as bytes,
>      each character must be one byte.
>      This takes care of the case where enable-multibyte-characters is nil.  */
>   if (best_above == best_above_byte)
>     return bytepos;
> ----

I think you are misreading the code: the above snippet is for unibyte
buffers, whereas a Latin-1 encoded Info file will be read into a
multibyte buffer (and decoded into the internal Emacs representation
of characters during the read).  So this optimization is not going to
work in that case.

IOW, what matters for byte-to-position is the encoding used in
representing characters in Emacs buffers, not the one used externally
by the Info file on disk.

> Therefore, an Info file in Latin-1 should work just fine.
> 
> > But it shouldn't be hard to add a simple test of
> > buffer-file-coding-system: if it states fixed-size encoding, like any
> > of the 8-bit encodings, or UTF-16,
> > the conversion to character position is trivial.
> 
> I think you mean UTF-32 instead of UTF-16, since UTF-16 is variable-
> length.

UTF-16 is fixed length for characters in the BMP.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Tue, 09 Jun 2015 16:02:02 GMT) Full text and rfc822 format available.

Message #24 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Teddy Hogeborn <teddy <at> recompile.se>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Tue, 09 Jun 2015 12:01:18 -0400

>> Using byte-to-position would make things worse for Latin-1 and the
>> likes.

> No, byte-to-position already checks for that:

> ---- src/marker.c, line 302
>   /* If this buffer has as many characters as bytes,
>      each character must be one byte.
>      This takes care of the case where enable-multibyte-characters is nil.  */
>   if (best_above == best_above_byte)
>     return bytepos;
> ----

> Therefore, an Info file in Latin-1 should work just fine.

No, because the representation in the buffer will still be a utf-8
derivative, so best_above will generally not be equal to best_above_byte.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Wed, 10 Jun 2015 17:51:02 GMT) Full text and rfc822 format available.

Message #27 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Wed, 10 Jun 2015 13:50:29 -0400

> Using byte-to-position would make things worse for Latin-1 and the likes.

There's also the problem of EOL encoding, but I'll just ignore it for now.
Could someone test the patch below?


        Stefan


diff --git a/lisp/info.el b/lisp/info.el
index 9602337..0de7f1e 100644
--- a/lisp/info.el
+++ b/lisp/info.el
@@ -1020,7 +1020,7 @@ which the match was found."
       (beginning-of-line)
       (when (re-search-forward regexp nil t)
 	(list (string-equal "Ref:" (match-string 1))
-	      (+ (point-min) (read (current-buffer)))
+              (filepos-to-bufferpos (read (current-buffer)) 'approximate)
 	      major-mode)))))
 
 (defun Info-find-in-tag-table (marker regexp &optional strict-case)
@@ -1187,7 +1187,8 @@ is non-nil)."
 
 		  (when found
 		    ;; FOUND is (ANCHOR POS MODE).
-		    (setq guesspos (nth 1 found))
+		    (setq guesspos (filepos-to-bufferpos (nth 1 found)
+                                                         'approximate))
 
 		    ;; If this is an indirect file, determine which
 		    ;; file really holds this node and read it in.
@@ -1203,8 +1204,7 @@ is non-nil)."
 		      (throw 'foo t)))))
 
 	      ;; Else we may have a node, which we search for:
-	      (goto-char (max (point-min)
-			      (- (byte-to-position guesspos) 1000)))
+	      (goto-char (max (point-min) (- guesspos 1000)))
 
 	      ;; Now search from our advised position (or from beg of
 	      ;; buffer) to find the actual node.  First, check
@@ -1523,7 +1523,9 @@ is non-nil)."
 			thisfilepos thisfilename)
 		    (search-forward ": ")
 		    (setq thisfilename  (buffer-substring beg (- (point) 2)))
-		    (setq thisfilepos (+ (point-min) (read (current-buffer))))
+		    (setq thisfilepos
+                          (filepos-to-bufferpos (read (current-buffer))
+                                                'approximate))
 		    ;; read in version 19 stops at the end of number.
 		    ;; Advance to the next line.
 		    (forward-line 1)
@@ -1554,7 +1556,7 @@ is non-nil)."
     ;; Don't add the length of the skipped summary segment to
     ;; the value returned to `Info-find-node-2'.  (Bug#14125)
     (if (numberp nodepos)
-	(+ (- nodepos lastfilepos) (point-min)))))
+	(- nodepos lastfilepos))))
 
 (defun Info-unescape-quotes (value)
   "Unescape double quotes and backslashes in VALUE."
@@ -2013,8 +2015,9 @@ If DIRECTION is `backward', search in the reverse direction."
 		        (re-search-backward "\\(^.*\\): [0-9]+$")
 		      (re-search-forward "\\(^.*\\): [0-9]+$"))
 		    (goto-char (+ (match-end 1) 2))
-		    (setq list (cons (cons (+ (point-min)
-					      (read (current-buffer)))
+		    (setq list (cons (cons (filepos-to-bufferpos
+                                            (read (current-buffer))
+                                            'approximate)
 					   (match-string-no-properties 1))
 				     list))
 		    (goto-char (if backward
diff --git a/lisp/international/mule-util.el b/lisp/international/mule-util.el
index eae787b..1f7df0b 100644
--- a/lisp/international/mule-util.el
+++ b/lisp/international/mule-util.el
@@ -313,6 +313,35 @@ per-character basis, this may not be accurate."
 				  (throw 'tag3 charset)))
 			  charset-list)
 		    nil)))))))))
+
+;;;###autoload
+(defun filepos-to-bufferpos (byte &optional quality coding-system)
+  "Try to return the buffer position corresponding to a particular file position.
+The file position is given as a BYTE count.
+The function presumes the file is encoded with CODING-SYSTEM, which defaults
+to `buffer-file-coding-system'.
+QUALITY can be:
+  `approximate', in which case we may cut some corners to avoid
+    excessive work.
+  nil, in which case we may return nil rather than an approximation."
+  ;; `exact', in which case we may end up re-(en|de)coding a large
+  ;;   part of the file.
+  (unless coding-system (setq coding-system buffer-file-coding-system))
+  (let ((eol (coding-system-eol-type coding-system))
+        (type (coding-system-type coding-system))
+        (pm (save-restriction (widen) (point-min))))
+    (pcase (cons type eol)
+      (`(utf-8 . ,(or 0 2))
+       (let ((bom-offset (coding-system-get coding-system :bom)))
+         (byte-to-position
+          (+ pm (max 0 (- byte (if bom-offset 3 0)))))))
+      ;; FIXME: What if it's a 2-byte charset?  Are there such beasts?
+      (`(charset . ,(or 0 2)) (+ pm byte))
+      (_
+       (pcase quality
+         (`approximate (+ pm (byte-to-position byte)))
+         ;; (`exact ...)
+         )))))
 
 (provide 'mule-util)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Wed, 10 Jun 2015 18:22:01 GMT) Full text and rfc822 format available.

Message #30 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Wed, 10 Jun 2015 21:21:25 +0300

> From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
> Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
> Date: Wed, 10 Jun 2015 13:50:29 -0400
> 
> > Using byte-to-position would make things worse for Latin-1 and the likes.
> 
> There's also the problem of EOL encoding, but I'll just ignore it for now.

That was never a problem before Texinfo 5.x: makeinfo didn't count
the CR characters in the CRLF EOLs, and the Info readers removed the
CR characters when reading the Info files.

But Texinfo 5.x and later does count the CR characters, so the
stand-alone Info reader was recently changed to account for that.
Which means that Emacs will now have a problem, whereby the byte
counts in the tag tables will be inaccurate, and our only hope is the
1000-character tolerance we use to look for the node around the
position stated in the tag table will be large enough.

Read the gory details about that in this thread:

  http://lists.gnu.org/archive/html/bug-texinfo/2014-12/msg00068.html

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Thu, 11 Jun 2015 03:03:02 GMT) Full text and rfc822 format available.

Message #33 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Wed, 10 Jun 2015 23:02:16 -0400

> But Texinfo 5.x and later does count the CR characters, so the
> stand-alone Info reader was recently changed to account for that.
> Which means that Emacs will now have a problem, whereby the byte
> counts in the tag tables will be inaccurate, and our only hope is the
> 1000-character tolerance we use to look for the node around the
> position stated in the tag table will be large enough.

If needed, I think we could make it work reasonably cheaply with
something along the lines of (100% guaranteed untested code):

       (let (pos lines (eol-offset 0))
         (while
             (progn
               (setq pos (byte-to-position (+ pm byte (- eol-offset))))
               (setq lines (1- (line-number-at-pos pos)))
               (not (= lines eol-offset)))
           (setq eol-offset (+ eol-offset lines)))
         pos))


-- Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20704; Package emacs. (Thu, 11 Jun 2015 13:12:03 GMT) Full text and rfc822 format available.

Message #36 received at 20704 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
Subject: Re: bug#20704: info.el bug fix; Interprets Info format wrongly
Date: Thu, 11 Jun 2015 16:11:16 +0300

> From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
> Cc: teddy <at> recompile.se, 20704 <at> debbugs.gnu.org
> Date: Wed, 10 Jun 2015 23:02:16 -0400
> 
> > But Texinfo 5.x and later does count the CR characters, so the
> > stand-alone Info reader was recently changed to account for that.
> > Which means that Emacs will now have a problem, whereby the byte
> > counts in the tag tables will be inaccurate, and our only hope is the
> > 1000-character tolerance we use to look for the node around the
> > position stated in the tag table will be large enough.
> 
> If needed, I think we could make it work reasonably cheaply with
> something along the lines of (100% guaranteed untested code):

Sure, but this needs to be conditioned on the EOL encoding we actually
found when we read the file.

bug closed, send any further explanations to 13431 <at> debbugs.gnu.org and Joseph Oswald <josephoswald <at> gmail.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Thu, 27 Jun 2019 11:45:04 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 26 Jul 2019 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 6 years and 21 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #20704 info.el bug fix; Interprets Info format wrongly

GNU bug report logs - #20704
info.el bug fix; Interprets Info format wrongly