GNU bug report logs - #7781
23.2.91; ispell problem with hunspell and UTF-8 file

Previous Next

Package: emacs;

Reported by: Reuben Thomas <rrt <at> sc3d.org>

Date: Mon, 3 Jan 2011 23:08:01 UTC

Severity: normal

Tags: notabug

Found in version 23.2.91

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

Forwarded to https://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

Full log


View this message in rfc822 format

From: Agustin Martin <agustin.martin <at> hispalinux.es>
To: Reuben Thomas <rrt <at> sc3d.org>
Cc: 7781 <at> debbugs.gnu.org
Subject: bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file
Date: Fri, 7 Jan 2011 14:14:03 +0100
2011/1/4 Reuben Thomas <rrt <at> sc3d.org>:
> With the following text, and using emacs -Q, I get the errors you can
> see in the messages log below when using hunspell to spell-check a UTF-8
> buffer with some extended characters in it.
>
> I did test this with emacs -Q, but the current session, in which I
> reproduced the problem and am now composing this bug report, was not
> started with -Q (this is so submitting the bug report works properly!).
>
> I am running a freshly bzr-pulled build of the emacs-23 branch.

Hi, Reuben,

I can also reproduce this with emacs23.2. I could locate problems in
two lines, after splititng original lines,

-- Cut here -- 8< ----- minimal.txt: utf-8
of out-of-copyright works. The Kindle may be a loss leader, but at £109
it’s still not cheap. Feedbooks, rather than integrating easily into
-- Cut here -- 8< ----- End of minimal.txt

In first line, currency seems to give some conversion errors when
iso-8859-1 is used, when that should have ignored by hunspell. I get
tons of

UTF-8 encoding error. Missing continuation byte in 0. character position:

for that line when using

$ cat minimal.txt | hunspell -d en_US -a -i iso-8859-1

In second line unusual apostrophe seems to cause some confusion to
hunspell when utf8 is used. Comparing what aspell and hunspell give in
similar text I get

$ cat minimal.txt | aspell --encoding=utf-8 -d en_US -a
& Feedbooks 6 22: Feed books, Feed-books, Feedback's, Feedbags, ...

$ cat minimal.txt | hunspell -d en_US -i utf-8 -a
& Feedbooks 8 24: Feed books, Feed-books, Feedback, Feedbags, ...

Do not worry about first number, is the number of suggestions. However
position in second number differ. Seems that hunspell is not
considering that apostrophe as a single (multibyte) char when
counting, but as three components

Looks to me an hunspell bug. I found no reference to this problem in
hunspell sf site, but noticed that Hunspell 1.2.14 was released
yesterday. Need to check if that has some related new.

-- 
Agustin




This bug report was last modified 4 years and 323 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.