GNU bug report logs - #37036
[PATCH] Inconsistent ASCII and Latin char categories

Package: emacs;

Reported by: Mattias Engdegård <mattiase <at> acm.org>

Date: Thu, 15 Aug 2019 12:18:02 UTC

Severity: normal

Tags: patch, wontfix

Done: Mattias Engdegård <mattiase <at> acm.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 37036 in the body.
You can then email your comments to 37036 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 12:18:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Mattias Engdegård <mattiase <at> acm.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 15 Aug 2019 12:18:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: bug-gnu-emacs <at> gnu.org
Subject: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 14:17:15 +0200

[Message part 1 (text/plain, inline)]

The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain.

It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code).

The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify.

Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range.

[0001-Fix-ASCII-and-Latin-character-categories.patch (application/octet-stream, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 15:28:01 GMT) Full text and rfc822 format available.

Message #8 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 18:27:28 +0300

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 15 Aug 2019 14:17:15 +0200
> 
> The ASCII (a) and Latin (l) character categories are inconsistent in what characters they contain.
> 
> It should be clear what the ASCII category means, but it omits 00-1f (contrary to a comment in the code).
> 
> The Latin category isn't exactly defined anywhere but should reasonably comprise letters from Latin-based scripts. Currently, it also includes many control characters and symbols from the ASCII and Latin-1 Supplement blocks, which seems hard to justify.
> 
> Other changes to Latin could be argued: what modifiers/combining chars should be included? What about IPA and non-IPA phonetics? Ligatures? What about Latin-derived forms such as circled letters? &c. The attached patch does not go there but only fixes the glaring errors in the 00-ff range.

Did you try moving by words after these changes?  What happens in
words that consist of ASCII and non-ASCII Latin characters, for
example?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 15:47:02 GMT) Full text and rfc822 format available.

Message #11 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 17:46:35 +0200

15 aug. 2019 kl. 17.27 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> Did you try moving by words after these changes?  What happens in
> words that consist of ASCII and non-ASCII Latin characters, for
> example?

No change in behaviour observed in any such case.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 16:24:01 GMT) Full text and rfc822 format available.

Message #14 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 19:23:01 +0300

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 15 Aug 2019 17:46:35 +0200
> Cc: 37036 <at> debbugs.gnu.org
> 
> 15 aug. 2019 kl. 17.27 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > 
> > Did you try moving by words after these changes?  What happens in
> > words that consist of ASCII and non-ASCII Latin characters, for
> > example?
> 
> No change in behaviour observed in any such case.

In any case, how to justify the fact that, say, "naïve", has
characters from different scripts?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 16:31:01 GMT) Full text and rfc822 format available.

Message #17 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 18:30:47 +0200

15 aug. 2019 kl. 18.23 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> In any case, how to justify the fact that, say, "naïve", has
> characters from different scripts?

The proposed change does not change the categories of any character in that string.
Or did you mean something else?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 17:01:02 GMT) Full text and rfc822 format available.

Message #20 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 19:59:53 +0300

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 15 Aug 2019 18:30:47 +0200
> Cc: 37036 <at> debbugs.gnu.org
> 
> 15 aug. 2019 kl. 18.23 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > 
> > In any case, how to justify the fact that, say, "naïve", has
> > characters from different scripts?
> 
> The proposed change does not change the categories of any character in that string.

What about "abcdef^A^B"?  Does M-f stop before the control characters?

I guess I don't understand the rationale for the change.  Categories
are Emacs's invention, and their purpose is mostly to allow us to use
regexps for searching certain characters, and other similar
subtleties.  Your rationale seems to be some attempt to be formally
"consistent".  But this is not a formal attribute, it is entirely
ad-hoc, as can be easily seen by just looking at the list of the
categories.

So I wonder why would we want to rock that particular boat.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 17:39:01 GMT) Full text and rfc822 format available.

Message #23 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 19:37:49 +0200

15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
> What about "abcdef^A^B"?  Does M-f stop before the control characters?

Yes. Does forward-word use categories?

> I guess I don't understand the rationale for the change.  Categories
> are Emacs's invention, and their purpose is mostly to allow us to use
> regexps for searching certain characters, and other similar
> subtleties.  Your rationale seems to be some attempt to be formally
> "consistent".  But this is not a formal attribute, it is entirely
> ad-hoc, as can be easily seen by just looking at the list of the
> categories.

The more categories are arbitrary, the less useful they are. Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure? If 'Latin' means 'Latin letters, some symbols, some whitespace, some control chars, Indo-Arabic digits and the occasional Greek letter', which it does today, then who can use it correctly?

Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that. Those who reviewed that function thought it looked reasonable, as did I when I read it.

It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 19:24:02 GMT) Full text and rfc822 format available.

Message #26 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 22:23:00 +0300

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Thu, 15 Aug 2019 19:37:49 +0200
> Cc: 37036 <at> debbugs.gnu.org
> 
> 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > 
> > What about "abcdef^A^B"?  Does M-f stop before the control characters?
> 
> Yes. Does forward-word use categories?

No.  Sorry, it was my faulty memory.  It uses char-script-table
instead.

> The more categories are arbitrary, the less useful they are.

I think they should become entirely useless, i.e. we should stop using
them.  We have the entire Unicode database with all the character
properties for quite some time now, and should favor using that
instead.  Categories are an old kludgey hack, which goes back to
pre-Unicode Emacs; it can never be anything but arbitrary, and we will
never be able to fix that anywhere near completely.

> Why would anyone use categories to discriminate characters if they do not have a sensible, useful and predictable structure?

I don't know why anyone should.  My recommendation is to just say no.

> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.

Can you tell the details of where this function doesn't work?  I'd
like to understand why fixing it needs to change the categories.

> It is perfectly clear that categories have been introduced in an ad-hoc way to solve problems as they arose, but that doesn't mean that no mistakes were made even for those narrow purposes.

I don't think we should fix those mistakes, because that's an
impossible goal.  We should instead gradually stop using categories
for anything serious, certainly for any new code.  We should use the
UCD properties and the various char-tables built upon that instead.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 19:48:01 GMT) Full text and rfc822 format available.

Message #29 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: mattiase <at> acm.org
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Thu, 15 Aug 2019 22:46:52 +0300

> Date: Thu, 15 Aug 2019 22:23:00 +0300
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 37036 <at> debbugs.gnu.org
> 
> > From: Mattias Engdegård <mattiase <at> acm.org>
> > Date: Thu, 15 Aug 2019 19:37:49 +0200
> > Cc: 37036 <at> debbugs.gnu.org
> > 
> > 15 aug. 2019 kl. 18.59 skrev Eli Zaretskii <eliz <at> gnu.org>:
> > > 
> > > What about "abcdef^A^B"?  Does M-f stop before the control characters?
> > 
> > Yes. Does forward-word use categories?
> 
> No.  Sorry, it was my faulty memory.  It uses char-script-table
> instead.

Actually, it uses categories indirectly, via word-combining-categories
and word-separating-categories.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Thu, 15 Aug 2019 22:20:02 GMT) Full text and rfc822 format available.

Message #32 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Fri, 16 Aug 2019 00:19:43 +0200

15 aug. 2019 kl. 21.23 skrev Eli Zaretskii <eliz <at> gnu.org>:

> I think they should become entirely useless, i.e. we should stop using
> them.  We have the entire Unicode database with all the character
> properties for quite some time now, and should favor using that
> instead.  Categories are an old kludgey hack, which goes back to
> pre-Unicode Emacs; it can never be anything but arbitrary, and we will
> never be able to fix that anywhere near completely.

Thank you, I see what you mean, and I agree that Unicode properties probably are better for most purposes.
In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial.

>> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.
> 
> Can you tell the details of where this function doesn't work?  I'd
> like to understand why fixing it needs to change the categories.

Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish).
Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

Of course it doesn't require the categories to be fixed. The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.

> I don't think we should fix those mistakes, because that's an
> impossible goal.  We should instead gradually stop using categories
> for anything serious, certainly for any new code.  We should use the
> UCD properties and the various char-tables built upon that instead.

Perhaps, but categories still have one thing going for them: they have fairly good regexp support.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Fri, 16 Aug 2019 09:34:01 GMT) Full text and rfc822 format available.

Message #35 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Mattias Engdegård <mattiase <at> acm.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Fri, 16 Aug 2019 12:33:08 +0300

> From: Mattias Engdegård <mattiase <at> acm.org>
> Date: Fri, 16 Aug 2019 00:19:43 +0200
> Cc: 37036 <at> debbugs.gnu.org
> 
> In any case, I wasn't aiming for perfection; that is indeed a fool's errand. It was just a discovery of a rather obvious mistake, and evidence of code that doesn't work properly because of it. I thought the patch would be rather uncontroversial.

AFAIU, the patch made all the non-letter characters excluded from the
Latin category, is that right?  If so, it's a pretty significant
change IMO; who knows what it could break, including outside of the
core Emacs.  The fact that the Latin category is not well defined
doesn't yet mean we are at liberty of changing that (implied)
definition at will.  Categories are currently used for a small number
of core Emacs features, and AFAIR were created incrementally as the
ad-hoc need for each one of them arose, so we also risk breaking our
own code.  Do we really have a good reason to wake those sleeping
dogs?

> >> Consider the function fill-polish-nobreak-p. It is clearly written with the assumption of a reasonable definition of the Latin category, and it doesn't work as expected because of that.
> > 
> > Can you tell the details of where this function doesn't work?  I'd
> > like to understand why fixing it needs to change the categories.
> 
> Right: it attempts to match a single-character word before point, with the assumption that \cl would match any Latin(-script) letter. However, since that expression matches most of ASCII as well, the function incorrectly says that line-breaking would be disallowed after "In my dreams..." or "(She smiles!)" or "He died in 1951." (well, the equivalents in Polish).
> Some details are in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20871 .

So you are saying that function fails to consider punctuation and
symbols that are part of the Latin blocks?  That just means it
shouldn't use \cl in the first place (and yes, my suggestion to use
that in the bug discussion was wrong, sorry), it should use the
general-category Unicode property to filter out punctuation
characters.  Or it could use explicit ranges of codepoints.  Or we
could extend [:punct:] to support non-ASCII punctuation in a more
meaningful way.  Either way, that's not a reason good enough to make
significant changes in how the categories are defined.  If any
extensions are needed, I'd rather we made it in more modern and less
ad-hoc features.

> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.

This argument goes both ways: there could be code out there which
relies on the current "broken" definition of the Latin category.

> > I don't think we should fix those mistakes, because that's an
> > impossible goal.  We should instead gradually stop using categories
> > for anything serious, certainly for any new code.  We should use the
> > UCD properties and the various char-tables built upon that instead.
> 
> Perhaps, but categories still have one thing going for them: they have fairly good regexp support.

I think this is in many cases an illusory advantage: specifying \cFOO
in a regexp just makes the code access some char-table.  But the same
is true for get-char-code-property and for accessing char-script-table
from Lisp, to mention just two alternatives.  And we all know that
using regular expressions for solving a problem sometimes _adds_ a
problem instead of solving one.

If we have some functionality in regular expressions that's supported
by categories, but is unavailable or inconvenient with Unicode
properties, I'd rather we extended our regex engine to support the
likes of \p{Po} and \p{script=greek}, see
http://unicode.org/reports/tr18/, instead of wasting our resources on
"fixing" the categories.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#37036; Package emacs. (Fri, 16 Aug 2019 10:49:01 GMT) Full text and rfc822 format available.

Message #38 received at 37036 <at> debbugs.gnu.org (full text, mbox):

From: Mattias Engdegård <mattiase <at> acm.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 37036 <at> debbugs.gnu.org
Subject: Re: bug#37036: [PATCH] Inconsistent ASCII and Latin char categories
Date: Fri, 16 Aug 2019 12:48:34 +0200

tags 37036 wontfix
close 37036
stop

16 aug. 2019 kl. 11.33 skrev Eli Zaretskii <eliz <at> gnu.org>:
> 
>> The point is that if there is some code that doesn't work because of the broken categories, there may very well be more.
> 
> This argument goes both ways: there could be code out there which
> relies on the current "broken" definition of the Latin category.

Well, that's an argument against fixing any bug. In general, code is more likely to depend on correctness than on errors.

That said, this is nothing I feel strongly about; let's not waste any more time. Maybe the manual section about categories should be amended to discourage would-be users.

Added tag(s) wontfix. Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Fri, 16 Aug 2019 10:49:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 37036 <at> debbugs.gnu.org and Mattias Engdegård <mattiase <at> acm.org> Request was from Mattias Engdegård <mattiase <at> acm.org> to control <at> debbugs.gnu.org. (Fri, 16 Aug 2019 10:49:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 13 Sep 2019 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 336 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #37036 [PATCH] Inconsistent ASCII and Latin char categories

GNU bug report logs - #37036
[PATCH] Inconsistent ASCII and Latin char categories