GNU bug report logs - #20783
25.0.50; [PATCH] byte-to-position has internal off-by-one bug

Previous Next

Package: emacs;

Reported by: Wolfgang Jenkner <wjenkner <at> inode.at>

Date: Wed, 10 Jun 2015 15:20:05 UTC

Severity: normal

Tags: patch

Found in version 25.0.50

Fixed in version 25.1

Done: Wolfgang Jenkner <wjenkner <at> inode.at>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20783 in the body.
You can then email your comments to 20783 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Wed, 10 Jun 2015 15:20:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to Wolfgang Jenkner <wjenkner <at> inode.at>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 10 Jun 2015 15:20:06 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Jenkner <wjenkner <at> inode.at>
To: bug-gnu-emacs <at> gnu.org
Subject: 25.0.50; [PATCH] byte-to-position has internal off-by-one bug
Date: Wed, 10 Jun 2015 17:13:30 +0200
Here's a test case for the bug:

(with-temp-buffer
  (insert "éé")
  (let ((i 1) pos res)
    (while (setq pos (byte-to-position i))
      (push pos res)
      (setq i (1+ i)))
    (nreverse res)))

=> (1 2 2 2 3)

while the correct result is

=> (1 1 2 2 3)

I found that this bug had been noticed before in

http://stackoverflow.com/questions/17588117/emacs-byte-to-position-function-is-not-consistent-with-document

Here's a patch.  The fix may look a bit clumsy but it's actually meant
to avoid pessimizing the presumably common case where the initial
bytepos is at a character boundary.

-- >8 --
Subject: [PATCH] * src/marker.c (buf_bytepos_to_charpos): Fix best_below_byte
 count.

If bytepos is not after a character boundary the preceding loop
overshoots by one character position.
---
 src/marker.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/src/marker.c b/src/marker.c
index 73928ba..94d676b 100644
--- a/src/marker.c
+++ b/src/marker.c
@@ -341,6 +341,12 @@ buf_bytepos_to_charpos (struct buffer *b, ptrdiff_t bytepos)
 	  BUF_INC_POS (b, best_below_byte);
 	}
 
+      if (best_below_byte != bytepos)
+	{
+	  best_below--;
+	  BUF_DEC_POS (b, best_below_byte);
+	}
+
       /* If this position is quite far from the nearest known position,
 	 cache the correspondence by creating a marker here.
 	 It will last until the next GC.
-- 
2.4.2





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Wed, 10 Jun 2015 17:18:02 GMT) Full text and rfc822 format available.

Message #8 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Wolfgang Jenkner <wjenkner <at> inode.at>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Wed, 10 Jun 2015 20:17:20 +0300
> From: Wolfgang Jenkner <wjenkner <at> inode.at>
> Date: Wed, 10 Jun 2015 17:13:30 +0200
> 
> Here's a test case for the bug:
> 
> (with-temp-buffer
>   (insert "éé")
>   (let ((i 1) pos res)
>     (while (setq pos (byte-to-position i))
>       (push pos res)
>       (setq i (1+ i)))
>     (nreverse res)))
> 
> => (1 2 2 2 3)
> 
> while the correct result is
> 
> => (1 1 2 2 3)
> 
> I found that this bug had been noticed before in
> 
> http://stackoverflow.com/questions/17588117/emacs-byte-to-position-function-is-not-consistent-with-document
> 
> Here's a patch.  The fix may look a bit clumsy but it's actually meant
> to avoid pessimizing the presumably common case where the initial
> bytepos is at a character boundary.

Wouldn't it be better to handle this use case in Fbyte_to_position?
The BYTE_TO_CHAR macro is called an awful lot in the Emacs innermost
loops, and it's _always_ called with a byte position that's on a
character boundary.  So punishing all that code with even a single
comparison, for the benefit of a use case whose importance is unclear
to me is not necessarily TRT.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Thu, 11 Jun 2015 15:49:02 GMT) Full text and rfc822 format available.

Message #11 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Jenkner <wjenkner <at> inode.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Thu, 11 Jun 2015 17:24:42 +0200
On Wed, Jun 10 2015, Eli Zaretskii wrote:

> Wouldn't it be better to handle this use case in Fbyte_to_position?
> The BYTE_TO_CHAR macro is called an awful lot in the Emacs innermost
> loops, and it's _always_ called with a byte position that's on a
> character boundary.

I see.  How about something like the patch below?  The loop could be
improved a bit by doing pointer arithmetic like in DEC_POS but it's
perhaps not worth complicating things for the case where bytepos is not
at a character boundary.

-- >8 --
Subject: [PATCH] * editfns.c (Fbyte_to_position): Fix bytepos not at char
 boundary.

The behavior now matches the description in the manual.  (Bug#20783)
---
 src/editfns.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/editfns.c b/src/editfns.c
index cddb0d4..94715fe 100644
--- a/src/editfns.c
+++ b/src/editfns.c
@@ -1025,10 +1025,18 @@ DEFUN ("byte-to-position", Fbyte_to_position, Sbyte_to_position, 1, 1, 0,
 If BYTEPOS is out of range, the value is nil.  */)
   (Lisp_Object bytepos)
 {
+  ptrdiff_t pos_byte;
+
   CHECK_NUMBER (bytepos);
-  if (XINT (bytepos) < BEG_BYTE || XINT (bytepos) > Z_BYTE)
+  pos_byte = XINT (bytepos);
+  if (pos_byte < BEG_BYTE || pos_byte > Z_BYTE)
     return Qnil;
-  return make_number (BYTE_TO_CHAR (XINT (bytepos)));
+  if (Z != Z_BYTE)
+    /* There are multibyte characters in the buffer.
+       Search for the start of the current character. */
+    while (!CHAR_HEAD_P (FETCH_BYTE (pos_byte)))
+      pos_byte--;
+  return make_number (BYTE_TO_CHAR (pos_byte));
 }
 
 DEFUN ("following-char", Ffollowing_char, Sfollowing_char, 0, 0, 0,
-- 
2.4.2





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Thu, 11 Jun 2015 16:05:03 GMT) Full text and rfc822 format available.

Message #14 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Wolfgang Jenkner <wjenkner <at> inode.at>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Thu, 11 Jun 2015 19:04:24 +0300
> From: Wolfgang Jenkner <wjenkner <at> inode.at>
> Cc: 20783 <at> debbugs.gnu.org
> Date: Thu, 11 Jun 2015 17:24:42 +0200
> 
> I see.  How about something like the patch below?  The loop could be
> improved a bit by doing pointer arithmetic like in DEC_POS but it's
> perhaps not worth complicating things for the case where bytepos is not
> at a character boundary.
> 
> -- >8 --
> Subject: [PATCH] * editfns.c (Fbyte_to_position): Fix bytepos not at char
>  boundary.
> 
> The behavior now matches the description in the manual.  (Bug#20783)
> ---
>  src/editfns.c | 12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/src/editfns.c b/src/editfns.c
> index cddb0d4..94715fe 100644
> --- a/src/editfns.c
> +++ b/src/editfns.c
> @@ -1025,10 +1025,18 @@ DEFUN ("byte-to-position", Fbyte_to_position, Sbyte_to_position, 1, 1, 0,
>  If BYTEPOS is out of range, the value is nil.  */)
>    (Lisp_Object bytepos)
>  {
> +  ptrdiff_t pos_byte;
> +
>    CHECK_NUMBER (bytepos);
> -  if (XINT (bytepos) < BEG_BYTE || XINT (bytepos) > Z_BYTE)
> +  pos_byte = XINT (bytepos);
> +  if (pos_byte < BEG_BYTE || pos_byte > Z_BYTE)
>      return Qnil;
> -  return make_number (BYTE_TO_CHAR (XINT (bytepos)));
> +  if (Z != Z_BYTE)
> +    /* There are multibyte characters in the buffer.
> +       Search for the start of the current character. */
> +    while (!CHAR_HEAD_P (FETCH_BYTE (pos_byte)))
> +      pos_byte--;
> +  return make_number (BYTE_TO_CHAR (pos_byte));
>  }

Works for me, thanks.  But please add a comment there about
BYTE_TO_CHAR expecting byte positions that are on a character
boundary, so that the reason for the loop is clear.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Thu, 11 Jun 2015 16:43:01 GMT) Full text and rfc822 format available.

Message #17 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Jenkner <wjenkner <at> inode.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Thu, 11 Jun 2015 18:41:35 +0200
On Thu, Jun 11 2015, Eli Zaretskii wrote:

>  But please add a comment there about
> BYTE_TO_CHAR expecting byte positions that are on a character
> boundary,

Wouldn't it make sense to add this to the comment before the definition
of BYTE_TO_CHAR instead (or to both)?






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Thu, 11 Jun 2015 19:10:03 GMT) Full text and rfc822 format available.

Message #20 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Wolfgang Jenkner <wjenkner <at> inode.at>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Thu, 11 Jun 2015 22:08:16 +0300
> From: Wolfgang Jenkner <wjenkner <at> inode.at>
> Cc: 20783 <at> debbugs.gnu.org
> Date: Thu, 11 Jun 2015 18:41:35 +0200
> 
> On Thu, Jun 11 2015, Eli Zaretskii wrote:
> 
> >  But please add a comment there about
> > BYTE_TO_CHAR expecting byte positions that are on a character
> > boundary,
> 
> Wouldn't it make sense to add this to the comment before the definition
> of BYTE_TO_CHAR instead (or to both)?

Both, I'd say.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Tue, 16 Jun 2015 15:53:02 GMT) Full text and rfc822 format available.

Message #23 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Jenkner <wjenkner <at> inode.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Tue, 16 Jun 2015 17:40:38 +0200
[Message part 1 (text/plain, inline)]
On Thu, Jun 11 2015, Wolfgang Jenkner wrote:

> The loop could be improved a bit by doing pointer arithmetic like in
> DEC_POS

I wondered whether such a change (to avoid unnecessary buffer gap
considerations while in the middle of a multibyte character) would
actually make a measurable difference, so, silly as that may be, I wrote
a simple benchmark for byte-to-position, using the tutorials as data
samples, and passed the results to ministat(1)[*], please see the
attached btp-ministat.el and ministat.sh for details.

[*] https://www.freebsd.org/cgi/man.cgi?query=ministat&sektion=1&manpath=FreeBSD+10.1-RELEASE

The result is that ministat reports statistical differences for several
of the tutorials (but not generally for the same languages at each run,
system load apparently generating too much statistical noise) and I find
that the version with the DEC_POS like loop is _always_ faster in those
cases (judging from the average values or just by taking a quick look at
the histograms).

So, while this is not really very important, it seems that it would be
safe to use the following patch with the improved loop instead:

[0001-src-editfns.c-Fbyte_to_position-Fix-bytepos-not-at-c.patch (text/x-diff, inline)]
From be2adf5b7b427ee5d84c9ae011d8d11d452c2f4e Mon Sep 17 00:00:00 2001
From: Wolfgang Jenkner <wjenkner <at> inode.at>
Date: Thu, 11 Jun 2015 16:21:21 +0200
Subject: [PATCH] * src/editfns.c (Fbyte_to_position): Fix bytepos not at char
 boundary.

The behavior now matches the description in the manual.  (Bug#20783)
---
 src/editfns.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/src/editfns.c b/src/editfns.c
index cddb0d4..ff54e73 100644
--- a/src/editfns.c
+++ b/src/editfns.c
@@ -1025,10 +1025,28 @@ DEFUN ("byte-to-position", Fbyte_to_position, Sbyte_to_position, 1, 1, 0,
 If BYTEPOS is out of range, the value is nil.  */)
   (Lisp_Object bytepos)
 {
+  ptrdiff_t pos_byte;
+
   CHECK_NUMBER (bytepos);
-  if (XINT (bytepos) < BEG_BYTE || XINT (bytepos) > Z_BYTE)
+  pos_byte = XINT (bytepos);
+  if (pos_byte < BEG_BYTE || pos_byte > Z_BYTE)
     return Qnil;
-  return make_number (BYTE_TO_CHAR (XINT (bytepos)));
+  if (Z != Z_BYTE)
+    /* There are multibyte characters in the buffer.
+       The argument of BYTE_TO_CHAR must be a byte position at
+       a character boundary, so search for the start of the current
+       character.  */
+    {
+      unsigned char *chp = BYTE_POS_ADDR (pos_byte);
+
+      while (!CHAR_HEAD_P (*chp))
+	{
+	  pos_byte--;
+	  /* There's no buffer gap in the middle of a character.  */
+	  chp--;
+	}
+    }
+  return make_number (BYTE_TO_CHAR (pos_byte));
 }
 
 DEFUN ("following-char", Ffollowing_char, Sfollowing_char, 0, 0, 0,
-- 
2.4.2

[btp-ministat.el (application/emacs-lisp, attachment)]
[ministat.sh (text/x-sh, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Tue, 16 Jun 2015 16:10:03 GMT) Full text and rfc822 format available.

Message #26 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Wolfgang Jenkner <wjenkner <at> inode.at>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Tue, 16 Jun 2015 19:08:49 +0300
> From: Wolfgang Jenkner <wjenkner <at> inode.at>
> Cc: 20783 <at> debbugs.gnu.org
> Date: Tue, 16 Jun 2015 17:40:38 +0200
> 
> +      while (!CHAR_HEAD_P (*chp))
> +	{
> +	  pos_byte--;
> +	  /* There's no buffer gap in the middle of a character.  */
> +	  chp--;
> +	}

Thanks, but I'd prefer we didn't have code that manipulated pointers
to buffer text directly.  E.g., if we ever have some kind of
multi-threading, or even if at some point someone adds a non-trivial
function call to this loop, this code will be a subtle bug waiting to
bite.  It's fundamentally not safe to do this, and not only due to the
gap considerations, but also because in general BEG_ADDR might change
under certain circumstances behind your back.  (Buffer text and string
data are implemented with double indirection for good reasons.)

For some very tight loops, it might be justified to take these
shortcuts (with WARNING COMMENTS CRYING BLOODY MURDER all around), but
this function doesn't belong to those cases.

So I prefer the previous variant, even though it will lose that
benchmark.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Tue, 16 Jun 2015 16:32:02 GMT) Full text and rfc822 format available.

Message #29 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Jenkner <wjenkner <at> inode.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Tue, 16 Jun 2015 18:31:08 +0200
On Tue, Jun 16 2015, Eli Zaretskii wrote:

> So I prefer the previous variant, even though it will lose that
> benchmark.

Neither do I care about winning the benchmark.  It just seemed the
better code from a micro-optimization point of view (in
a single-threaded emacs).  But I can't argue against your global
arguments, of course.  Thanks for explaining things.





Reply sent to Wolfgang Jenkner <wjenkner <at> inode.at>:
You have taken responsibility. (Wed, 17 Jun 2015 12:30:04 GMT) Full text and rfc822 format available.

Notification sent to Wolfgang Jenkner <wjenkner <at> inode.at>:
bug acknowledged by developer. (Wed, 17 Jun 2015 12:30:07 GMT) Full text and rfc822 format available.

Message #34 received at 20783-done <at> debbugs.gnu.org (full text, mbox):

From: Wolfgang Jenkner <wjenkner <at> inode.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 20783-done <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Wed, 17 Jun 2015 14:19:10 +0200
Version: 25.1

On Thu, Jun 11 2015, Eli Zaretskii wrote:

> Works for me, thanks.  But please add a comment there about
> BYTE_TO_CHAR expecting byte positions that are on a character
> boundary, so that the reason for the loop is clear.

Done and pushed.

However, I didn't follow my own suggestion of adding a remark to the
comment above BYTE_TO_CHAR, after all, as it is true for most macros or
functions with a byte position argument, so adding such a comment to
just one of them could be confusing, I think.

Thank you for steering this change in the right direction.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#20783; Package emacs. (Wed, 17 Jun 2015 16:58:02 GMT) Full text and rfc822 format available.

Message #37 received at 20783 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Wolfgang Jenkner <wjenkner <at> inode.at>
Cc: 20783 <at> debbugs.gnu.org
Subject: Re: bug#20783: 25.0.50;
 [PATCH] byte-to-position has internal off-by-one bug
Date: Wed, 17 Jun 2015 19:57:08 +0300
> From: Wolfgang Jenkner <wjenkner <at> inode.at>
> Cc: 20783-done <at> debbugs.gnu.org
> Date: Wed, 17 Jun 2015 14:19:10 +0200
> 
> Version: 25.1
> 
> On Thu, Jun 11 2015, Eli Zaretskii wrote:
> 
> > Works for me, thanks.  But please add a comment there about
> > BYTE_TO_CHAR expecting byte positions that are on a character
> > boundary, so that the reason for the loop is clear.
> 
> Done and pushed.

Thanks.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 16 Jul 2015 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 345 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.