GNU bug report logs - #11073
24.0.94; BIDI-related crash in redisplay with certain byte sequences

Previous Next

Package: emacs;

Reported by: Eli Zaretskii <eliz <at> gnu.org>

Date: Fri, 23 Mar 2012 11:27:02 UTC

Severity: normal

Found in version 24.0.94

Done: Glenn Morris <rgm <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Kenichi Handa <handa <at> m17n.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 11073 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: bug#11073: 24.0.94; BIDI-related crash in redisplay with certain byte sequences
Date: Mon, 26 Mar 2012 16:45:56 +0900
In article <837gybupdf.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> > Why do we need this unification?  Or rather, why do we need multiple
> > codepoints, which then forces us to unify them?

> That's something Handa-san (CC'ed) will be able to explain much better
> than I ever could.

It's a long story.  When I designed emacs-unicode (the
version before merged to the trunk, more than 10 years ago),
the unification maps of CJK charsets to Unicode were not
stable.  In addtion, there were various conflicting policies
on which character to unify to which character.  One reason
of this confusion was that Unicode itself didn't define
mapping to/from such CJK charsets (JIS, GB, KSC).

The unification problem is not only for Ideographic
characters.  Many CJK charsets contain, for instance,
full-width version of Greek characters, but Unicode doesn't
distinguish them from single-width versions (though Unicode
has full-width version of 'A'..'Z', etc).  There were people
who wanted to distinguish full-width Greek chars from
single-width chars.

There also were people who have a text of iso-2022-7bit file
which distinguishes characters of GB charset and JIS
charset.  To edit such a file and write it back as the
original one, one has to disable unification of one of GB
and JIS (or both of them).

So, I decided at that time to give each CJK charset unique
code space (above #x110000) in Emacs, and allow users to
freely unify/disunify them to Unicode code space (below
#x110000) by giving the function unify-charset.

FYI, http://www.unicode.org/reports/tr38/ tells some
difficulty of mappings.

> AFAIU, there are good reasons to have some CJK
> characters on separate codepoints, because they need to be treated
> differently from their Unicode codepoints (perhaps a different choice
> of font to display them?)

That was one reaons, but the current code pay attention to
`charset' text property of each character to select a proper
font.

---
Kenichi Handa
handa <at> m17n.org




This bug report was last modified 12 years and 94 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.