GNU bug report logs - #7781
23.2.91; ispell problem with hunspell and UTF-8 file

Previous Next

Package: emacs;

Reported by: Reuben Thomas <rrt <at> sc3d.org>

Date: Mon, 3 Jan 2011 23:08:01 UTC

Severity: normal

Tags: notabug

Found in version 23.2.91

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

Forwarded to https://sourceforge.net/tracker/?func=detail&aid=3178449&group_id=143754&atid=756395

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Николай Сущенко <sckol <at> yandex.ru>
Cc: 7781 <at> debbugs.gnu.org
Subject: bug#7781: [PATCH] Fix ispell problem with hunspell and UTF-8 file
Date: Sun, 14 Apr 2013 10:08:42 +0300
> Date: Sun, 14 Apr 2013 10:33:39 +0400
> From: Николай Сущенко
>  <sckol <at> yandex.ru>
> CC: 7781 <at> debbugs.gnu.org
> 
> Please send me this patch, I'll ask the hunspell developers to include it.

Attached.  This is a small part of a much larger patch, most of it for
Windows-specific problems.  If you have problems compiling the patched
hunspell, let me know: it could be that I omitted some hunk that is
needed for this part.

> Could you also recall which concrete problems produces this workaround? 
> For me it works fine, but I haven't tested it in different languages and 
> encodings.

One problem is that you assume the encoding of the communications with
hunspell is UTF-8, and thus matches the internal representation of
text in Emacs buffers and strings (only then will byte-to-position
give correct results).  But that assumption is false: hunspell
supports any encoding that it can convert to/from UTF-8 (it uses
libiconv internally).  The "usual" choice of the encoding is the one
used by the dictionary.  Not every dictionary out there is in UTF-8.

> If it is some problems, I could try to fix it

I don't think you can fix this on the Emacs side, because Emacs cannot
easily and/or quickly convert between bytes and characters in an
arbitrary multibyte encoding.

When I discovered this problem, I also tried fixing it on the Emacs
side first, but then I realized that this kind of solution has too
many problems, and instead fixed it in hunspell.

--- src/tools/hunspell.cxx~0	2011-01-21 19:01:29.000000000 +0200
+++ src/tools/hunspell.cxx	2013-02-07 10:11:54.443610900 +0200
@@ -710,13 +748,22 @@ if (pos >= 0) {
 			fflush(stdout);
 		} else {
 			char ** wlst = NULL;
-			int ns = pMS[d]->suggest(&wlst, token);
+			int byte_offset = parser->get_tokenpos() + pos;
+			int char_offset = 0;
+			if (strcmp(io_enc, "UTF-8") == 0) {
+				for (int i = 0; i < byte_offset; i++) {
+					if ((buf[i] & 0xc0) != 0x80)
+						char_offset++;
+				}
+			} else {
+				char_offset = byte_offset;
+			}
+			int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d]));
 			if (ns == 0) {
-		    		fprintf(stdout,"# %s %d", token,
-		    		    parser->get_tokenpos() + pos);
+		    		fprintf(stdout,"# %s %d", token, char_offset);
 			} else {
 				fprintf(stdout,"& %s %d %d: ", token, ns,
-				    parser->get_tokenpos() + pos);
+					char_offset);
 				fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], io_enc));
 			}
 			for (int j = 1; j < ns; j++) {
@@ -745,13 +792,23 @@ if (pos >= 0) {
 			if (root) free(root);
 		} else {
 			char ** wlst = NULL;
+			int byte_offset = parser->get_tokenpos() + pos;
+			int char_offset = 0;
+			if (strcmp(io_enc, "UTF-8") == 0) {
+				for (int i = 0; i < byte_offset; i++) {
+					if ((buf[i] & 0xc0) != 0x80)
+						char_offset++;
+				}
+			} else {
+				char_offset = byte_offset;
+			}
 			int ns = pMS[d]->suggest(&wlst, chenc(token, io_enc, dic_enc[d]));
 			if (ns == 0) {
 		    		fprintf(stdout,"# %s %d", chenc(token, io_enc, ui_enc),
-		    		    parser->get_tokenpos() + pos);
+		    		    char_offset);
 			} else {
 				fprintf(stdout,"& %s %d %d: ", chenc(token, io_enc, ui_enc), ns,
-				    parser->get_tokenpos() + pos);
+				    char_offset);
 				fprintf(stdout,"%s", chenc(wlst[0], dic_enc[d], ui_enc));
 			}
 			for (int j = 1; j < ns; j++) {





This bug report was last modified 4 years and 323 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.