GNU bug report logs -
#15984
24.3; Problem with combining characters in attachment filename
Previous Next
Reported by: nisse <at> lysator.liu.se (Niels Möller)
Date: Thu, 28 Nov 2013 08:33:01 UTC
Severity: normal
Found in version 24.3
Fixed in version 24.4
Done: Glenn Morris <rgm <at> gnu.org>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
>> What I think is the right thing, is to allow a sequence of unicode
>> values, e.g., "A" + combining character, or "A" + any random sequence of
>> combining characters, intern this string, and treat this as a single
>> "character".
>
> For the Lisp-level notion of "character", I think this would require too
> many deep changes.
I can understand that. I'm actually impressed by the move from MULE
encodings to unicode, which to a user appeared to very smooth.
But I still think that type of "character" abstraction the right thing
for unicode text processing in general.
> For forward-char, we do try to fake that behavior (e.g. a `forward-char'
> command will skip over the whole A+ring combo) but not faithfully
> (e.g. `C-u 2 forward-char' will also just skip that combo, and not the
> subsequent char). It's not perfect, but it seems "close enough" that it
> hasn't proved problematic.
Didn't know, that's a bit weird. I just tried, as Eli suggested, editing
text with "ä" represented with a as a combining character. In
emacs-23.4, pressing DEL after the "ä" deletes the dots only. I now
understand why, but it's not what I had expected, and I think deleteing
the entire A + dots would be preferable. Plain C-x = on the "a" shows
just "Char: a (97, #o141, #x61) point=443 of 455 (97%) column=1", but
C-u C-x = also shows the combining char.
However, emacs-24.3 behaves differently, the 'a' and the '"' gets
displayed differently, and are not combined at all for display.
The buffer shows 'a"', and according to C-u C-x 8 the '"' is a
"COMBINING DIAERESIS". These tests done in an X11 frame, so maybe
they're just picking up different fonts?
>> E.g, there could be a mode which makes each and every unicode value a
>> single character, which will then be displayed as separate glyphs,
>> separate characters for regexp matching, etc.
>
> I think we wouldn't want to use different modes (too coarse) but
> different commands instead.
I didn't mean an emacs major or minor mode. It would be more like a
special coding system, applied when reading the text from file.
> In any case, a first step would be to find a name for that notion of "multi
> character character". "Grapheme cluster" doesn't sound too good if we
> want to expose the concept to the end user.
I think "character" is the right word, the main source of confusion is
that unicode code points are often referred to as "characters".
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
This bug report was last modified 11 years and 103 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.