GNU bug report logs - #19653
ispell misalignment with hunspell when Unicode apostrophe is used

Previous Next

Package: emacs;

Reported by: Tobias Getzner <tobias.getzner <at> gmx.de>

Date: Thu, 22 Jan 2015 14:41:02 UTC

Severity: normal

Tags: moreinfo

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 19653 in the body.
You can then email your comments to 19653 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Thu, 22 Jan 2015 14:41:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tobias Getzner <tobias.getzner <at> gmx.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 22 Jan 2015 14:41:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tobias Getzner <tobias.getzner <at> gmx.de>
To: bug-gnu-emacs <at> gnu.org
Subject: ispell misalignment with hunspell when Unicode apostrophe is used
Date: Thu, 22 Jan 2015 15:40:05 +0100
Hello,

I’ve noticed that when ispell.el (Emacs 24.4.1) is using hunspell (v.
1.3.3) to spell-check a buffer containing the typographically correct
apostrophe («’»; U+2019), ispell will error out with the message
«ispell misalignment».

The problem can be reproduced by setting ispell-program-name to
«hunspell», and spell-checking a buffer containing the string «abc’s
zzz». This yields the following error:

> ispell-process-line: Ispell misalignment: word `zzz' point 9; probably incompatible versions

This seems to be a regression from 24.3, where hunspell support was
working (with the caveat that the apostrophe had to be manually added
to the dictionary’s «OTHERCHARS»).

Best regards,
Tobias






Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Thu, 22 Jan 2015 17:43:02 GMT) Full text and rfc822 format available.

Message #8 received at 19653 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Tobias Getzner <tobias.getzner <at> gmx.de>
Cc: 19653 <at> debbugs.gnu.org
Subject: Re: bug#19653: ispell misalignment with hunspell when Unicode
 apostrophe	is used
Date: Thu, 22 Jan 2015 19:41:54 +0200
> From: Tobias Getzner <tobias.getzner <at> gmx.de>
> Date: Thu, 22 Jan 2015 15:40:05 +0100
> 
> I’ve noticed that when ispell.el (Emacs 24.4.1) is using hunspell (v.
> 1.3.3) to spell-check a buffer containing the typographically correct
> apostrophe («’»; U+2019), ispell will error out with the message
> «ispell misalignment».
> 
> The problem can be reproduced by setting ispell-program-name to
> «hunspell», and spell-checking a buffer containing the string «abc’s
> zzz». This yields the following error:
> 
> > ispell-process-line: Ispell misalignment: word `zzz' point 9; probably incompatible versions

I cannot reproduce this with Emacs 24.4 and Hunspell 1.3.2 (heavily
patched to fix known problems in Hunspell).  You didn't provide enough
information for me to be sure I did the same as you, so here are the
possible explanations for the different experience:

 . I use a different version of Hunspell, and yours has a bug.
   Hunspell is known to have a problem with reporting mis-spelled
   words with byte offsets, whereas Emacs expects character offsets,
   so dictionaries encoded in UTF-8 cause symptoms similar to those
   you report.  My Hunspell is patched to avoid this problem.

 . I didn't change OTHERCHARS.  Frankly, I think doing this asks for
   trouble, since the speller still uses the characters recorded in
   the .aff file.

 . You didn't tell which dictionary you used.  I tried en_US and
   de_DE, and none of them produced these problems.  Maybe this is
   specific to some dictionary you used.  In particular, the encoding
   of that dictionary is important vs the encoding you tell ispell.el
   to use (if you customized that part).

> This seems to be a regression from 24.3, where hunspell support was
> working (with the caveat that the apostrophe had to be manually added
> to the dictionary’s «OTHERCHARS»).

Are you saying that the same version of Hunspell with the same
dictionary worked in Emacs 24.3, where Emacs 24.4 doesn't?  If so,
please try to eliminate or at least minimize your ispell-related
customizations, and try again.  If the problem persists, please show
the minimal set of customizations to reproduce the problem.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Sat, 26 Dec 2015 16:54:01 GMT) Full text and rfc822 format available.

Message #11 received at 19653 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Tobias Getzner <tobias.getzner <at> gmx.de>
Cc: 19653 <at> debbugs.gnu.org
Subject: Re: bug#19653: ispell misalignment with hunspell when Unicode
 apostrophe is used
Date: Sat, 26 Dec 2015 17:53:28 +0100
Tobias Getzner <tobias.getzner <at> gmx.de> writes:

> I’ve noticed that when ispell.el (Emacs 24.4.1) is using hunspell (v.
> 1.3.3) to spell-check a buffer containing the typographically correct
> apostrophe («’»; U+2019), ispell will error out with the message
> «ispell misalignment».

There was an earlier similar report where the conclusion was that
hunspell was buggy, but a new version of hunspell fixed the problem...

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




bug closed, send any further explanations to 19653 <at> debbugs.gnu.org and Tobias Getzner <tobias.getzner <at> gmx.de> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 26 Dec 2015 16:55:01 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 24 Jan 2016 12:24:08 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Joseph Mingrone <jrm <at> ftfl.ca> to control <at> debbugs.gnu.org. (Fri, 21 Oct 2016 05:04:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Fri, 21 Oct 2016 05:06:02 GMT) Full text and rfc822 format available.

Message #20 received at 19653 <at> debbugs.gnu.org (full text, mbox):

From: Joseph Mingrone <jrm <at> ftfl.ca>
To: 19653 <at> debbugs.gnu.org
Subject: Re: bug#19653: ispell misalignment with hunspell when Unicode
 apostrophe is used
Date: Fri, 21 Oct 2016 02:04:58 -0300
[Message part 1 (text/plain, inline)]
This still seems to be a problem with hunspell version 1.3.3.

The problem can be reproduced by spell checking a file with this one line.

alsdk ✅ sdfkjdsf sldksdfkjsfd

During spell checking, the process list shows:

ispell run -- -- /usr/local/bin/hunspell -a -d en_CA -i UTF-8

The error Emacs (version 25.1.1) reports is:

ispell-process-line: Ispell misalignment: word ‘sdfkjdsf’ point 11; probably incompatible versions

Hunspell skips over the special character when it is run at a terminal prompt.  This is the initial output.

### begin hunspell output ###
        alsdk           File: test.txt

alsdk \~E sdfkjdsf sldksdfkjsfd

 0: Alaska

[SPACE] R)epl A)ccept I)nsert U)ncap S)tem Q)uit e(X)it or ? for help
### end hunspell output ###
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Fri, 21 Oct 2016 07:34:02 GMT) Full text and rfc822 format available.

Message #23 received at 19653 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Joseph Mingrone <jrm <at> ftfl.ca>
Cc: 19653 <at> debbugs.gnu.org
Subject: Re: bug#19653: ispell misalignment with hunspell when Unicode
 apostrophe is used
Date: Fri, 21 Oct 2016 10:33:10 +0300
> From: Joseph Mingrone <jrm <at> ftfl.ca>
> Date: Fri, 21 Oct 2016 02:04:58 -0300
> 
> This still seems to be a problem with hunspell version 1.3.3.
> 
> The problem can be reproduced by spell checking a file with this one line.
> 
> alsdk ✅ sdfkjdsf sldksdfkjsfd
> 
> During spell checking, the process list shows:
> 
> ispell run -- -- /usr/local/bin/hunspell -a -d en_CA -i UTF-8
> 
> The error Emacs (version 25.1.1) reports is:
> 
> ispell-process-line: Ispell misalignment: word ‘sdfkjdsf’ point 11; probably incompatible versions

Did Hunspell ever fix the problem whereby it reported byte offsets of
the misspelled words, as opposed to character offsets?  If not, that
is your problem, and Hunspell should finally get its act together.

To see whether this is the problem, invoke Hunspell like this:

  /usr/local/bin/hunspell -a -d en_CA -i UTF-8 < test.txt

and see what Hunspell emits.  It should emit something like this (the
below is taken from my system, and I don't have the en_CA dictionary,
so your output might be slightly different):

  @(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
  & alsdk 3 0: Alaska, elastic, Alston
  & sdfkjdsf 2 8: artefact's, postfix
  & sldksdfkjsfd 2 17: justification, staphylococcus

The second number after each misspelled word is the offset of that
word's beginning, measured in characters, from the start of the line.
Hunspell used to report this in bytes instead of characters; if it
still does, you will have to patch it to fix that bug.  AFAIR, the
Hunspell issue tracker includes several patches for this bug.  Or
maybe the latest Hunspell 1.4.1 already fixes this, in which case
please upgrade.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Fri, 21 Oct 2016 13:01:02 GMT) Full text and rfc822 format available.

Message #26 received at 19653 <at> debbugs.gnu.org (full text, mbox):

From: Joseph Mingrone <jrm <at> ftfl.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 19653 <at> debbugs.gnu.org
Subject: Re: bug#19653: ispell misalignment with hunspell when Unicode
 apostrophe is used
Date: Fri, 21 Oct 2016 09:59:57 -0300
[Message part 1 (text/plain, inline)]
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Joseph Mingrone <jrm <at> ftfl.ca>
>> Date: Fri, 21 Oct 2016 02:04:58 -0300

>> This still seems to be a problem with hunspell version 1.3.3.

>> The problem can be reproduced by spell checking a file with this one line.

>> alsdk ✅ sdfkjdsf sldksdfkjsfd

>> During spell checking, the process list shows:

>> ispell run -- -- /usr/local/bin/hunspell -a -d en_CA -i UTF-8

>> The error Emacs (version 25.1.1) reports is:

>> ispell-process-line: Ispell misalignment: word ‘sdfkjdsf’ point 11; probably incompatible versions

> Did Hunspell ever fix the problem whereby it reported byte offsets of
> the misspelled words, as opposed to character offsets?  If not, that
> is your problem, and Hunspell should finally get its act together.

> To see whether this is the problem, invoke Hunspell like this:

>   /usr/local/bin/hunspell -a -d en_CA -i UTF-8 < test.txt

> and see what Hunspell emits.  It should emit something like this (the
> below is taken from my system, and I don't have the en_CA dictionary,
> so your output might be slightly different):

>   @(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
>   & alsdk 3 0: Alaska, elastic, Alston
>   & sdfkjdsf 2 8: artefact's, postfix
>   & sldksdfkjsfd 2 17: justification, staphylococcus

> The second number after each misspelled word is the offset of that
> word's beginning, measured in characters, from the start of the line.
> Hunspell used to report this in bytes instead of characters; if it
> still does, you will have to patch it to fix that bug.  AFAIR, the
> Hunspell issue tracker includes several patches for this bug.  Or
> maybe the latest Hunspell 1.4.1 already fixes this, in which case
> please upgrade.

It's still a problem with hunspell.

% echo "é startingCharTwo" | hunspell -a -d en_CA -i UTF-8
@(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.3)
& é 15 0: e, s, i, a, n, r, t, o, l, c, d, u, g, m, p
& startingCharTwo 1 3: nonparticipating

https://github.com/hunspell/hunspell/issues/418
[signature.asc (application/pgp-signature, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#19653; Package emacs. (Fri, 21 Oct 2016 14:53:02 GMT) Full text and rfc822 format available.

Message #29 received at 19653 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Joseph Mingrone <jrm <at> ftfl.ca>
Cc: 19653 <at> debbugs.gnu.org
Subject: Re: bug#19653: ispell misalignment with hunspell when Unicode
 apostrophe is used
Date: Fri, 21 Oct 2016 17:52:09 +0300
> From: Joseph Mingrone <jrm <at> ftfl.ca>
> Cc: 19653 <at> debbugs.gnu.org
> Date: Fri, 21 Oct 2016 09:59:57 -0300
> 
> >   @(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.2)
> >   & alsdk 3 0: Alaska, elastic, Alston
> >   & sdfkjdsf 2 8: artefact's, postfix
> >   & sldksdfkjsfd 2 17: justification, staphylococcus
> 
> > The second number after each misspelled word is the offset of that
> > word's beginning, measured in characters, from the start of the line.
> > Hunspell used to report this in bytes instead of characters; if it
> > still does, you will have to patch it to fix that bug.  AFAIR, the
> > Hunspell issue tracker includes several patches for this bug.  Or
> > maybe the latest Hunspell 1.4.1 already fixes this, in which case
> > please upgrade.
> 
> It's still a problem with hunspell.
> 
> % echo "é startingCharTwo" | hunspell -a -d en_CA -i UTF-8
> @(#) International Ispell Version 3.2.06 (but really Hunspell 1.3.3)
> & é 15 0: e, s, i, a, n, r, t, o, l, c, d, u, g, m, p
> & startingCharTwo 1 3: nonparticipating
> 
> https://github.com/hunspell/hunspell/issues/418

Thanks for checking.





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 19 Nov 2016 12:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 215 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.