#11073 - 24.0.94; BIDI-related crash in redisplay with certain byte sequences

GNU bug report logs - #11073
24.0.94; BIDI-related crash in redisplay with certain byte sequences

Package: emacs;

Reported by: Eli Zaretskii <eliz <at> gnu.org>

Date: Fri, 23 Mar 2012 11:27:02 UTC

Severity: normal

Found in version 24.0.94

Done: Glenn Morris <rgm <at> gnu.org>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Stefan Monnier <monnier <at> iro.umontreal.ca> Cc: 11073 <at> debbugs.gnu.org Subject: bug#11073: 24.0.94; BIDI-related crash in redisplay with certain byte sequences Date: Fri, 23 Mar 2012 17:58:25 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca> > Cc: 11073 <at> debbugs.gnu.org > Date: Fri, 23 Mar 2012 10:27:39 -0400 > > > (Repeat after me: FETCH_MULTIBYTE_CHAR followed by CHAR_BYTES is not > > always equivalent to STRING_CHAR_AND_LENGTH.) > > Do we really absolutely have to have such a trap? > I mean: is there a good reason why they're not always equivalent? They are not equivalent when conversion of the multibyte form into a character unifies a CJK character that is represented by a codepoint from one of the private use areas. This unification is done in char_string, via a call to MAYBE_UNIFY_CHAR, which converts the private codepoint into the equivalent codepoint in one of the "normal" planes. The UTF-8 encoding of the unified character can be shorter or longer than the original multibyte sequence. The problem with the code I had in bidi.c, viz.: character = FETCH_MULTIBYTE_CHAR (bytepos); char_len = CHAR_BYTES (character); is that the value in `character' is not guaranteed to correspond to the multibyte sequence consumed by FETCH_MULTIBYTE_CHAR, and therefore that character's length as returned by CHAR_BYTES is not the right instrument to advance to the next character. So, I'd say that FETCH_MULTIBYTE_CHAR should only be used for fetching a single character; if one wants to advance, one should either use FETCH_CHAR_ADVANCE or (if they are paranoiac about speed, like I am) use character = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (bytepos), length); which returns the length of the consumed sequence, and use that to advance to the next character position. And note the other gotcha: that the length returned by STRING_CHAR_AND_LENGTH is not necessarily the length of the UTF-8 encoding of the character it returns, but rather the length of the multibyte sequence which was converted to the character.

This bug report was last modified 12 years and 154 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #11073 24.0.94; BIDI-related crash in redisplay with certain byte sequences

GNU bug report logs - #11073
24.0.94; BIDI-related crash in redisplay with certain byte sequences