GNU bug report logs - #23097
24.5; ispell.el: lines with both CASECHARS and NOT-CASECHARS get sent to the spell checker

Previous Next

Package: emacs;

Reported by: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>

Date: Wed, 23 Mar 2016 18:12:01 UTC

Severity: normal

Tags: moreinfo, notabug

Found in version 24.5

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 23097 in the body.
You can then email your comments to 23097 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Wed, 23 Mar 2016 18:12:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 23 Mar 2016 18:12:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.5; ispell.el: lines with both CASECHARS and NOT-CASECHARS get sent
 to the spell checker
Date: Wed, 23 Mar 2016 21:11:19 +0300
Each entry ispell-dictionary-alist has elements called CASECHARS and 
NOT-CASECHARS. They are used for defining what gets sent to the spell 
checker and what does not.

One use case for them is that, if you have two dictionaries for 
languages with totally different alphabets, you can spellcheck a file 
where both languages are mixed together. In theory.

Here's what happens in practice:
If line contains only CASECHARS, it gets sent to the spell checker.
If line contains only NOT-CASECHARS, it does not get sent to the spell 
checker.
If line contains both CASECHARS and NOT-CASECHARS, the whole line gets 
sent to the spell checker.

Sending the whole line makes NOT-CASECHARS pretty useless. I think the 
reasonable behavior in this case would be sending the line word by word.

Here's how to repeat this with aspell.
1. Starting from emacs -Q eval this:
(setq ispell-program-name "aspell")
(defun ispell-set-my-dictionaries()
  (setq ispell-dictionary-alist (delq (assoc "english" 
ispell-dictionary-alist) ispell-dictionary-alist))
  (add-to-list 'ispell-dictionary-alist
               '("english" "[kcat]" "[dogh]" "[']" nil ("-B") nil 
iso-8859-1)))
(advice-add 'ispell-set-spellchecker-params :after 
#'ispell-set-my-dictionaries)
2. ispell-change-dictionary to english.
3. ispell-buffer a buffer containing this:
kat
doh
kat doh

"Kat" at the first line would get sent to aspell, since it passes 
CASECHARS. This is fine. "Doh" at the second line would be ignored, 
since it's not in CASECHARS. This is fine too. At the line with both 
words, not only "kat" would get sent, but also "doh" and that's what we 
don't want to happen.

-- 
Best Regards,
Nikolay Kudryavtsev





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Wed, 23 Mar 2016 18:24:01 GMT) Full text and rfc822 format available.

Message #8 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
Cc: 23097 <at> debbugs.gnu.org
Subject: Re: bug#23097: 24.5;
 ispell.el: lines with both CASECHARS and NOT-CASECHARS get sent to
 the spell checker
Date: Wed, 23 Mar 2016 20:22:42 +0200
> From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
> Date: Wed, 23 Mar 2016 21:11:19 +0300
> 
> Each entry ispell-dictionary-alist has elements called CASECHARS and 
> NOT-CASECHARS. They are used for defining what gets sent to the spell 
> checker and what does not.
> 
> One use case for them is that, if you have two dictionaries for 
> languages with totally different alphabets, you can spellcheck a file 
> where both languages are mixed together. In theory.

Don't you need to restart the spell-checker each time you switch the
dictionaries?  AFAIK, only Hunspell supports such mixed
spell-checking, and with Hunspell you don't need to break the line
into separate words in that case.  With any other spell-checker, you
need to restart it whenever you switch languages.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Wed, 23 Mar 2016 20:14:01 GMT) Full text and rfc822 format available.

Message #11 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 23097 <at> debbugs.gnu.org
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Wed, 23 Mar 2016 23:12:44 +0300
[Message part 1 (text/plain, inline)]
Yes, you do need to restart the spell checker when you switch 
dictionaries, but it's not too inconvenient in practice.

As you know, I've ran into issues with hunspell, which I described in 
this thread 
<http://lists.gnu.org/archive/html/help-gnu-emacs/2016-03/msg00107.html>.

-- 
Best Regards,
Nikolay Kudryavtsev

[Message part 2 (text/html, inline)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Sat, 15 Aug 2020 04:23:02 GMT) Full text and rfc822 format available.

Message #14 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 23097 <at> debbugs.gnu.org, Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Fri, 14 Aug 2020 21:22:24 -0700
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
>> Date: Wed, 23 Mar 2016 21:11:19 +0300
>>
>> Each entry ispell-dictionary-alist has elements called CASECHARS and
>> NOT-CASECHARS. They are used for defining what gets sent to the spell
>> checker and what does not.
>>
>> One use case for them is that, if you have two dictionaries for
>> languages with totally different alphabets, you can spellcheck a file
>> where both languages are mixed together. In theory.
>
> Don't you need to restart the spell-checker each time you switch the
> dictionaries?  AFAIK, only Hunspell supports such mixed
> spell-checking, and with Hunspell you don't need to break the line
> into separate words in that case.  With any other spell-checker, you
> need to restart it whenever you switch languages.

It seems like this is a limitation of external software then, and not in
Emacs?  Should this therefore be closed, or is there anything more to do
here?

Best regards,
Stefan Kangas




Added tag(s) moreinfo. Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Sat, 15 Aug 2020 04:23:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Sat, 15 Aug 2020 16:16:01 GMT) Full text and rfc822 format available.

Message #19 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Kangas <stefan <at> marxist.se>
Cc: 23097 <at> debbugs.gnu.org, nikolay.kudryavtsev <at> gmail.com
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Sat, 15 Aug 2020 19:15:14 +0300
> From: Stefan Kangas <stefan <at> marxist.se>
> Date: Fri, 14 Aug 2020 21:22:24 -0700
> Cc: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>, 23097 <at> debbugs.gnu.org
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
> >> Date: Wed, 23 Mar 2016 21:11:19 +0300
> >>
> >> Each entry ispell-dictionary-alist has elements called CASECHARS and
> >> NOT-CASECHARS. They are used for defining what gets sent to the spell
> >> checker and what does not.
> >>
> >> One use case for them is that, if you have two dictionaries for
> >> languages with totally different alphabets, you can spellcheck a file
> >> where both languages are mixed together. In theory.
> >
> > Don't you need to restart the spell-checker each time you switch the
> > dictionaries?  AFAIK, only Hunspell supports such mixed
> > spell-checking, and with Hunspell you don't need to break the line
> > into separate words in that case.  With any other spell-checker, you
> > need to restart it whenever you switch languages.
> 
> It seems like this is a limitation of external software then, and not in
> Emacs?  Should this therefore be closed, or is there anything more to do
> here?

Yes, I think we should close this issue.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Sat, 15 Aug 2020 16:41:01 GMT) Full text and rfc822 format available.

Message #22 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 23097 <at> debbugs.gnu.org, nikolay.kudryavtsev <at> gmail.com
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Sat, 15 Aug 2020 09:40:37 -0700
tags 23097 + notabug
close 23097
thanks

Eli Zaretskii <eliz <at> gnu.org> writes:

> Yes, I think we should close this issue.

Thanks, done.




Added tag(s) notabug. Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Sat, 15 Aug 2020 16:41:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 23097 <at> debbugs.gnu.org and Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com> Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Sat, 15 Aug 2020 16:41:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Mon, 17 Aug 2020 09:21:01 GMT) Full text and rfc822 format available.

Message #29 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
To: Stefan Kangas <stefan <at> marxist.se>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 23097 <at> debbugs.gnu.org
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Mon, 17 Aug 2020 12:20:08 +0300
This is not an external software bug, but very much an Emacs bug.

I'm not sure what was the initial design idea for CASECHARS and 
NOT-CASECHARS, but whatever it was, it would not work effectively due to 
feeding the entire line. The most obvious practical use for them(being 
able to spellcheck languages with completely different alphabets without 
the spellchecker misfiring on either pass) would not work either.

The ideal pratical fix for this should spellcheck such lines word by word.

-- 
Best Regards,
Nikolay Kudryavtsev





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Mon, 17 Aug 2020 12:49:02 GMT) Full text and rfc822 format available.

Message #32 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>,
 Eli Zaretskii <eliz <at> gnu.org>
Cc: 23097 <at> debbugs.gnu.org
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Mon, 17 Aug 2020 12:48:41 +0000
Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com> writes:

> This is not an external software bug, but very much an Emacs bug.
>
> I'm not sure what was the initial design idea for CASECHARS and
> NOT-CASECHARS, but whatever it was, it would not work effectively due to
> feeding the entire line. The most obvious practical use for them(being
> able to spellcheck languages with completely different alphabets without
> the spellchecker misfiring on either pass) would not work either.
>
> The ideal pratical fix for this should spellcheck such lines word by word.

Okay, but that's not a documented use-case, so I'm not sure that it's a
bug.  The thing you suggest may be possible, but would require
developing a new feature, for example to run two instances of the same
spell checker at once.

AFAIU, the best solution is to use an external spell checker that has
support for using two languages at once.  Why not use that?

Best regards,
Stefan Kangas




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Mon, 17 Aug 2020 16:42:02 GMT) Full text and rfc822 format available.

Message #35 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
Cc: 23097 <at> debbugs.gnu.org, stefan <at> marxist.se
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Mon, 17 Aug 2020 19:40:58 +0300
> From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
> Cc: 23097 <at> debbugs.gnu.org
> Date: Mon, 17 Aug 2020 12:20:08 +0300
> 
> This is not an external software bug, but very much an Emacs bug.
> 
> I'm not sure what was the initial design idea for CASECHARS and 
> NOT-CASECHARS, but whatever it was, it would not work effectively due to 
> feeding the entire line. The most obvious practical use for them(being 
> able to spellcheck languages with completely different alphabets without 
> the spellchecker misfiring on either pass) would not work either.

The original design was that a spell-checker supports a single
language, and any text in other languages is a spelling mistake.  This
is still true for Ispell and for Aspell; only Hunspell (and Enchant,
when it uses Hunspell as its back-end) supports multiple languages.
With Hunspell, ispell.el effectively ignores CASECHARS and
NOT-CASECHARS, and instead uses the character set specified by the
dictionary file itself.

This is the only multi-dictionary spell-checking configuration that
ispell.el currently supports.  Which is why, when you first reported
this, I asked you why you couldn't use Hunspell; your answer, which
described some kind of failure related to encoding, I couldn't
understand then and I don't understand now (primarily because that
feature works for me).

Instead, you seem to insist on using Aspell in a way that to me sounds
like a kludge: spell-check the region with one dictionary, then
restart ispell.el with another dictionary and spell-check the same
region again.  AFAIU, you'd like ispell.el to support this kind of
workaround OOTB.  Is that correct, or did I miss something?

If my understanding is correct, then, apart of being a kludgey
solution for a problem that has a much cleaner one, I don't think I
understand how this could work well in general.  Suppose you have in
your buffer a mis-spelled word such as this:

   fooЫbar

with the Cyrillic letter being there by accident: perhaps you
unintentionally pressed a key when you shouldn't have.  Or imagine the
following typo:

   fooбар

which could happen if you forgot to switch the input method.

With your proposed mode of operation, the spell-checker will check
partial words and decide that in both cases there's no spelling
mistakes here, because each partial word is spelled correctly in its
language.  But clearly these are typos that need to be flagged.

Thus, just using 2 sets of characters is not enough to handle these
typos intelligently, as you'd get a lot of false negatives.

So even if we consider your report as a feature request, it is not
entirely clear to me how to implement such a feature.  And frankly,
since at least one spell-checker exists which supports multiple
dictionaries, it is not clear to me why we should try so hard forcing
Aspell look as if it did, too.

> The ideal pratical fix for this should spellcheck such lines word by word.

I think I show above why such simplistic strategy will backfire by
leaving some typos undetected.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 15 Sep 2020 11:24:08 GMT) Full text and rfc822 format available.

bug unarchived. Request was from Eli Zaretskii <eliz <at> gnu.org> to control <at> debbugs.gnu.org. (Tue, 13 Oct 2020 16:52:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Tue, 13 Oct 2020 17:01:02 GMT) Full text and rfc822 format available.

Message #42 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
Cc: 23097 <at> debbugs.gnu.org, stefan <at> marxist.se
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Tue, 13 Oct 2020 20:00:31 +0300
> From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
> Cc: stefan <at> marxist.se, 23097 <at> debbugs.gnu.org
> Date: Tue, 13 Oct 2020 16:19:10 +0300
>
> Anyway, Hunspell IMHO is sort of besides the point for this discussion.
> This bug is about ispell.el not performing in a way a user would
> realistically expect from its public facing API.

Which expectations from what public API are being violated here?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#23097; Package emacs. (Wed, 14 Oct 2020 19:21:02 GMT) Full text and rfc822 format available.

Message #45 received at 23097 <at> debbugs.gnu.org (full text, mbox):

From: Nikolay Kudryavtsev <nikolay.kudryavtsev <at> gmail.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 23097 <at> debbugs.gnu.org, stefan <at> marxist.se
Subject: Re: bug#23097: 24.5; ispell.el: lines with both CASECHARS and
 NOT-CASECHARS get sent to the spell checker
Date: Wed, 14 Oct 2020 22:20:08 +0300
The whole ispell-dictionary-alist structure implies that matching would 
be done word by word. And looking into the dictionary setup is the first 
thing ispell.el user would do. Apart from NOT-CASECHARS it also has this 
element:

> OTHERCHARS is a regexp of characters in the NOT-CASECHARS set but 
> which can be
> used to construct words in some special way.  If OTHERCHARS characters 
> follow
> and precede characters from CASECHARS, they are parsed as part of a word,
> otherwise they become word-breaks...
Basically presence of both NOT-CASECHARS and OTHERCHARS implies that 
ispell.el does strict word by word matching. If we're just sending any 
line that contains a CASECHARS match, we don't really need either of 
them, since we can just match by CASECHARS alone and then send the line.

Oh, and there's another thing. Ispell.el actually does word by word 
search, but only on resume. Try my recipe again, just make the last line 
of spellchecked buffer to look like "doh kat". Then suspend the 
spellcheck after the first line and resume it with C-u M-$. You'd see 
that it skips the last line "doh" fine in this scenario. But then it 
suffers from the word mix problem described by Eli: spellchecking
dohkat" and "katdoh" results in kat alone being sent.

Thinking a bit more about this word mix problem, seems like it's not as 
simple to fix it as I thought in my previous letter, since we need some 
list of legitimate word separators for each language.

-- 
Best Regards,
Nikolay Kudryavtsev





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 12 Nov 2020 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 223 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.