GNU bug report logs - #11073
24.0.94; BIDI-related crash in redisplay with certain byte sequences

Previous Next

Package: emacs;

Reported by: Eli Zaretskii <eliz <at> gnu.org>

Date: Fri, 23 Mar 2012 11:27:02 UTC

Severity: normal

Found in version 24.0.94

Done: Glenn Morris <rgm <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 11073 <at> debbugs.gnu.org
Subject: bug#11073: 24.0.94; BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 17:58:25 +0200
> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 11073 <at> debbugs.gnu.org
> Date: Fri, 23 Mar 2012 10:27:39 -0400
> 
> > (Repeat after me: FETCH_MULTIBYTE_CHAR followed by CHAR_BYTES is not
> > always equivalent to STRING_CHAR_AND_LENGTH.)
> 
> Do we really absolutely have to have such a trap?
> I mean: is there a good reason why they're not always equivalent?

They are not equivalent when conversion of the multibyte form into a
character unifies a CJK character that is represented by a codepoint
from one of the private use areas.  This unification is done in
char_string, via a call to MAYBE_UNIFY_CHAR, which converts the
private codepoint into the equivalent codepoint in one of the "normal"
planes.  The UTF-8 encoding of the unified character can be shorter or
longer than the original multibyte sequence.  The problem with the
code I had in bidi.c, viz.:

   character = FETCH_MULTIBYTE_CHAR (bytepos);
   char_len = CHAR_BYTES (character);

is that the value in `character' is not guaranteed to correspond to
the multibyte sequence consumed by FETCH_MULTIBYTE_CHAR, and therefore
that character's length as returned by CHAR_BYTES is not the right
instrument to advance to the next character.

So, I'd say that FETCH_MULTIBYTE_CHAR should only be used for fetching
a single character; if one wants to advance, one should either use
FETCH_CHAR_ADVANCE or (if they are paranoiac about speed, like I am)
use 

   character = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (bytepos), length);

which returns the length of the consumed sequence, and use that to
advance to the next character position.

And note the other gotcha: that the length returned by
STRING_CHAR_AND_LENGTH is not necessarily the length of the UTF-8
encoding of the character it returns, but rather the length of the
multibyte sequence which was converted to the character.




This bug report was last modified 12 years and 95 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.