GNU bug report logs - #15984
24.3; Problem with combining characters in attachment filename

Previous Next

Package: emacs;

Reported by: nisse <at> lysator.liu.se (Niels Möller)

Date: Thu, 28 Nov 2013 08:33:01 UTC

Severity: normal

Found in version 24.3

Fixed in version 24.4

Done: Glenn Morris <rgm <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


Message #32 received at 15984 <at> debbugs.gnu.org (full text, mbox):

From: nisse <at> lysator.liu.se (Niels Möller)
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 15984 <at> debbugs.gnu.org
Subject: Re: bug#15984: 24.3;
 Problem with combining characters in attachment filename
Date: Fri, 29 Nov 2013 13:41:01 +0100
Eli Zaretskii <eliz <at> gnu.org> writes:

> However, we do want to give the user a way to
> delete only one or more of the combining characters, so forcing the
> entire combination to be a single indivisible entity would not be TRT
> for users.

Good question, how to handle this.

Today, to remove the dots from an "ä" character, I'll have to delete the
complete "ä" character and insert a new "a" character. Or similarly for
the reverse edit. I think this "atomic" handling is the desired
behaviour in many cases. And I don't think it should behave differently
depending on the representation of "ä" in the original file. But if you
have a complex sequence of unicode combining characters, I agree there's
some need to be able to edit it. Maybe put point on the character and
invoke edit-char to go in some special mode which explodes the usually
"atomic" character into smaller pieces.

And such a character edit mode might be useful for more things than
unicode composing characters, e.g, manipulationg the different sub-parts
of a chinese character. Anyway, this user interface is not intimately
tied to the internal character representation; its overall effect on the
buffer will be the same as replacing any substring.

>> When reading text files, the character boundaries may be configurble.
>
> The important question is what to do by default,

I'm pretty sure the default should be that a sequence of one unicode
base char and all following unicode combining chars is interned as a
single "emacs character". (I think the detailed rules for this are
spelled out in the unicode book). With some arbitrary limit to prevent a
GByte file with only unicode combining characters to get read as a
single emacs character; say at most 10 combining characters.

> You are mixing display issues with editing issues and with how
> characters are represented internally in an Emacs buffer.

I think it's confusing for users if the units of text which forward-char
skips over, do not correspond to the units matched by "." in
isearch-forward-regexp.

My suggested internal representation seems to be a natural way to get
this correspondence right, at the cost of some memory (or lots of
complexity in reducing memory usage). I'm sure there are other ways, and
maybe also a lot better ways, to implement the same thing.

> Thanks, I will try that.

Now I've also reproduced it on the same machine, without my normal Gnus
setup getting in the way. I start emacs with

  $ rm -rf ~/tmp/home/ && mkdir ~/tmp/home/ && HOME=$HOME/tmp/home emacs -nw -Q -l bug.el

where bug.el contains

  (setq gnus-init-file nil)
  (setq gnus-nntp-server nil)
  (gnus-no-server)

Then create the group with G d, pointing out the spool-like directory,
enter the group (RET), view the message (RET), try to write out the
attachment ("o" on the attachment button). Still crashes for me.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.




This bug report was last modified 11 years and 103 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.