#7781 - 23.2.91; ispell problem with hunspell and UTF-8 file

GNU bug report logs - #7781
23.2.91; ispell problem with hunspell and UTF-8 file

Package: emacs;

Reported by: Reuben Thomas <rrt <at> sc3d.org>

Date: Mon, 3 Jan 2011 23:08:01 UTC

Severity: normal

Tags: notabug

Found in version 23.2.91

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

Forwarded to https://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

View this message in rfc822 format

From: Agustin Martin <agustin.martin <at> hispalinux.es> To: Reuben Thomas <rrt <at> sc3d.org> Cc: 7781 <at> debbugs.gnu.org Subject: bug#7781: 23.2.91; ispell problem with hunspell and UTF-8 file Date: Fri, 7 Jan 2011 14:14:03 +0100

2011/1/4 Reuben Thomas <rrt <at> sc3d.org>: > With the following text, and using emacs -Q, I get the errors you can > see in the messages log below when using hunspell to spell-check a UTF-8 > buffer with some extended characters in it. > > I did test this with emacs -Q, but the current session, in which I > reproduced the problem and am now composing this bug report, was not > started with -Q (this is so submitting the bug report works properly!). > > I am running a freshly bzr-pulled build of the emacs-23 branch. Hi, Reuben, I can also reproduce this with emacs23.2. I could locate problems in two lines, after splititng original lines, -- Cut here -- 8< ----- minimal.txt: utf-8 of out-of-copyright works. The Kindle may be a loss leader, but at £109 it’s still not cheap. Feedbooks, rather than integrating easily into -- Cut here -- 8< ----- End of minimal.txt In first line, currency seems to give some conversion errors when iso-8859-1 is used, when that should have ignored by hunspell. I get tons of UTF-8 encoding error. Missing continuation byte in 0. character position: for that line when using $ cat minimal.txt | hunspell -d en_US -a -i iso-8859-1 In second line unusual apostrophe seems to cause some confusion to hunspell when utf8 is used. Comparing what aspell and hunspell give in similar text I get $ cat minimal.txt | aspell --encoding=utf-8 -d en_US -a & Feedbooks 6 22: Feed books, Feed-books, Feedback's, Feedbags, ... $ cat minimal.txt | hunspell -d en_US -i utf-8 -a & Feedbooks 8 24: Feed books, Feed-books, Feedback, Feedbags, ... Do not worry about first number, is the number of suggestions. However position in second number differ. Seems that hunspell is not considering that apostrophe as a single (multibyte) char when counting, but as three components Looks to me an hunspell bug. I found no reference to this problem in hunspell sf site, but noticed that Hunspell 1.2.14 was released yesterday. Need to check if that has some related new. -- Agustin

This bug report was last modified 4 years and 323 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #7781 23.2.91; ispell problem with hunspell and UTF-8 file

GNU bug report logs - #7781
23.2.91; ispell problem with hunspell and UTF-8 file