#7781 - 23.2.91; ispell problem with hunspell and UTF-8 file

GNU bug report logs - #7781
23.2.91; ispell problem with hunspell and UTF-8 file

Package: emacs;

Reported by: Reuben Thomas <rrt <at> sc3d.org>

Date: Mon, 3 Jan 2011 23:08:01 UTC

Severity: normal

Tags: notabug

Found in version 23.2.91

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

Forwarded to https://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Николай Сущенко <sckol <at> yandex.ru> Cc: 7781 <at> debbugs.gnu.org Subject: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file Date: Sun, 14 Apr 2013 10:08:42 +0300

> Date: Sun, 14 Apr 2013 10:33:39 +0400 > From: Николай Сущенко > <sckol <at> yandex.ru> > CC: 7781 <at> debbugs.gnu.org > > Please send me this patch, I'll ask the hunspell developers to include it. Attached. This is a small part of a much larger patch, most of it for Windows-specific problems. If you have problems compiling the patched hunspell, let me know: it could be that I omitted some hunk that is needed for this part. > Could you also recall which concrete problems produces this workaround? > For me it works fine, but I haven't tested it in different languages and > encodings. One problem is that you assume the encoding of the communications with hunspell is UTF-8, and thus matches the internal representation of text in Emacs buffers and strings (only then will byte-to-position give correct results). But that assumption is false: hunspell supports any encoding that it can convert to/from UTF-8 (it uses libiconv internally). The "usual" choice of the encoding is the one used by the dictionary. Not every dictionary out there is in UTF-8. > If it is some problems, I could try to fix it I don't think you can fix this on the Emacs side, because Emacs cannot easily and/or quickly convert between bytes and characters in an arbitrary multibyte encoding. When I discovered this problem, I also tried fixing it on the Emacs side first, but then I realized that this kind of solution has too many problems, and instead fixed it in hunspell. --- src/tools/hunspell.cxx~0 2011-01-21 19:01:29.000000000 +0200 +++ src/tools/hunspell.cxx 2013-02-07 10:11:54.443610900 +0200 @@ -710,13 +748,22 @@ if (pos >= 0) { fflush(stdout); } else { char ** wlst = NULL; - int ns = pMS[d]->suggest(&wlst, token); + int byte_offset = parser->get_tokenpos() + pos; + int char_offset = 0; + if (strcmp(io_enc, "UTF-8") == 0) { + for (int i = 0; i < byte_offset; i++) { + if ((buf[i] & 0xc0) != 0x80) + char_offset++; + } + } else { + char_offset = byte_offset; + } + int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d])); if (ns == 0) { - fprintf(stdout,"# %s %d", token, - parser->get_tokenpos() + pos); + fprintf(stdout,"# %s %d", token, char_offset); } else { fprintf(stdout,"& %s %d %d: ", token, ns, - parser->get_tokenpos() + pos); + char_offset); fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], io_enc)); } for (int j = 1; j < ns; j++) { @@ -745,13 +792,23 @@ if (pos >= 0) { if (root) free(root); } else { char ** wlst = NULL; + int byte_offset = parser->get_tokenpos() + pos; + int char_offset = 0; + if (strcmp(io_enc, "UTF-8") == 0) { + for (int i = 0; i < byte_offset; i++) { + if ((buf[i] & 0xc0) != 0x80) + char_offset++; + } + } else { + char_offset = byte_offset; + } int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d])); if (ns == 0) { fprintf(stdout,"# %s %d", chenc(token, io_enc, ui_enc), - parser->get_tokenpos() + pos); + char_offset); } else { fprintf(stdout,"& %s %d %d: ", chenc(token, io_enc, ui_enc), ns, - parser->get_tokenpos() + pos); + char_offset); fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], ui_enc)); } for (int j = 1; j < ns; j++) {

This bug report was last modified 4 years and 323 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #7781 23.2.91; ispell problem with hunspell and UTF-8 file

GNU bug report logs - #7781
23.2.91; ispell problem with hunspell and UTF-8 file