GNU bug report logs -
#18150
24.3.92; Uppercase umlauts and case-fold-search t
Previous Next
Reported by: michael_heerdegen <at> web.de
Date: Wed, 30 Jul 2014 15:12:01 UTC
Severity: normal
Found in version 24.3.92
Done: Eli Zaretskii <eliz <at> gnu.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18150 in the body.
You can then email your comments to 18150 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#18150
; Package
emacs
.
(Wed, 30 Jul 2014 15:12:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
michael_heerdegen <at> web.de
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Wed, 30 Jul 2014 15:12:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello,
sorry if this is just a unibyte/multibyte thing I don't understand, but
it makes no sense to me:
(let ((str "École")
(case-fold-search t))
(when (string-match "[[:upper:]]" str)
(match-string 0 str)))
==> "c"
However,
(let ((str "École")
(case-fold-search nil))
(when (string-match "[[:upper:]]" str)
(match-string 0 str)))
==> "É"
I would expect "É" in both examples.
Thanks,
Michael.
In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2)
of 2014-07-17 on drachen
Windowing system distributor `The X.Org Foundation', version 11.0.11600000
System Description: Debian GNU/Linux testing (jessie)
Important settings:
value of $LC_ALL: de_DE.utf8
value of $LC_COLLATE: C
value of $LC_TIME: C
value of $LANG: de_DE.utf8
locale-coding-system: utf-8-unix
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#18150
; Package
emacs
.
(Tue, 16 Feb 2016 14:54:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 18150 <at> debbugs.gnu.org (full text, mbox):
Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268).
Best,
mb
On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen <at> web.de> wrote:
> Hello,
>
>
> sorry if this is just a unibyte/multibyte thing I don't understand, but
> it makes no sense to me:
>
> (let ((str "École")
> (case-fold-search t))
> (when (string-match "[[:upper:]]" str)
> (match-string 0 str)))
>
> ==> "c"
>
> However,
>
> (let ((str "École")
> (case-fold-search nil))
> (when (string-match "[[:upper:]]" str)
> (match-string 0 str)))
>
> ==> "É"
>
> I would expect "É" in both examples.
>
>
> Thanks,
>
> Michael.
>
>
>
>
> In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2)
> of 2014-07-17 on drachen
> Windowing system distributor `The X.Org Foundation', version 11.0.11600000
> System Description: Debian GNU/Linux testing (jessie)
>
> Important settings:
> value of $LC_ALL: de_DE.utf8
> value of $LC_COLLATE: C
> value of $LC_TIME: C
> value of $LANG: de_DE.utf8
> locale-coding-system: utf-8-unix
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#18150
; Package
emacs
.
(Tue, 16 Feb 2016 18:10:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 18150 <at> debbugs.gnu.org (full text, mbox):
> From: Marcin Borkowski <mbork <at> mbork.pl>
> Date: Tue, 16 Feb 2016 15:53:41 +0100
> Cc: 18150 <at> debbugs.gnu.org
>
> Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268).
>
> Best,
> mb
>
> On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen <at> web.de> wrote:
>
> > Hello,
> >
> >
> > sorry if this is just a unibyte/multibyte thing I don't understand, but
> > it makes no sense to me:
> >
> > (let ((str "École")
> > (case-fold-search t))
> > (when (string-match "[[:upper:]]" str)
> > (match-string 0 str)))
> >
> > ==> "c"
> >
> > However,
> >
> > (let ((str "École")
> > (case-fold-search nil))
> > (when (string-match "[[:upper:]]" str)
> > (match-string 0 str)))
> >
> > ==> "É"
> >
> > I would expect "É" in both examples.
What do we expect the result to be in the variant below?
(let ((str "ecole")
(case-fold-search t))
(when (string-match "[[:upper:]]" str)
(match-string 0 str)))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#18150
; Package
emacs
.
(Tue, 16 Feb 2016 18:39:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 18150 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> What do we expect the result to be in the variant below?
>
> (let ((str "ecole")
> (case-fold-search t))
> (when (string-match "[[:upper:]]" str)
> (match-string 0 str)))
According to the docstring of `case-fold-search', I would expect "e"
(which the expression returns here).
Before having thought about it, 70% of me expected `nil'.
Michael.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#18150
; Package
emacs
.
(Tue, 16 Feb 2016 18:58:01 GMT)
Full text and
rfc822 format available.
Message #17 received at 18150 <at> debbugs.gnu.org (full text, mbox):
> From: Michael Heerdegen <michael_heerdegen <at> web.de>
> Cc: Marcin Borkowski <mbork <at> mbork.pl>, 18150 <at> debbugs.gnu.org
> Date: Tue, 16 Feb 2016 19:38:21 +0100
>
> Eli Zaretskii <eliz <at> gnu.org> writes:
>
> > What do we expect the result to be in the variant below?
> >
> > (let ((str "ecole")
> > (case-fold-search t))
> > (when (string-match "[[:upper:]]" str)
> > (match-string 0 str)))
>
> According to the docstring of `case-fold-search', I would expect "e"
> (which the expression returns here).
>
> Before having thought about it, 70% of me expected `nil'.
That's exactly the point.
If, when case-fold-search is non-nil, we want both [:upper:] and
[:lower:] to match any letter that has a case variant, then the patch
below seems to do the job. Does anyone see a problem with it?
The gotcha here is that regex.c doesn't know what TRANSLATE does, and
no one promises that TRANSLATE downcases characters. It could fold
them, for example, or, more generally, transform them in any way the
caller wants. The patch below is TRT when TRANSLATE downcases; when
it does something else, the question is: do we want to test the match
only on the result of TRANSLATE (which is what the original code
does), or do we want something else?
For the unibyte case, re_compile_pattern sets up a bitmap for
characters _after_ TRANSLATE, so things work as expected. We cannot
do that for multibyte characters -- there are too many of them -- so
this problem arises. AFAICS, it existed since Emacs 20.
diff --git a/src/regex.c b/src/regex.c
index dd3f2b3..27dce8b 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
case charset:
case charset_not:
{
- register unsigned int c;
+ register unsigned int c, corig;
boolean not = (re_opcode_t) *(p - 1) == charset_not;
int len;
@@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
}
PREFETCH ();
- c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
+ corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
if (target_multibyte)
{
int c1;
@@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
{
int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
- if ( (class_bits & BIT_LOWER && ISLOWER (c))
+ if ( (class_bits & BIT_LOWER
+ && (ISLOWER (c) || (corig != c && ISUPPER(c))))
| (class_bits & BIT_MULTIBYTE)
| (class_bits & BIT_PUNCT && ISPUNCT (c))
| (class_bits & BIT_SPACE && ISSPACE (c))
- | (class_bits & BIT_UPPER && ISUPPER (c))
+ | (class_bits & BIT_UPPER
+ && (ISUPPER (c) || (corig != c && ISLOWER (c))))
| (class_bits & BIT_WORD && ISWORD (c))
| (class_bits & BIT_ALPHA && ISALPHA (c))
| (class_bits & BIT_ALNUM && ISALNUM (c))
Reply sent
to
Eli Zaretskii <eliz <at> gnu.org>
:
You have taken responsibility.
(Sat, 20 Feb 2016 11:07:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
michael_heerdegen <at> web.de
:
bug acknowledged by developer.
(Sat, 20 Feb 2016 11:07:02 GMT)
Full text and
rfc822 format available.
Message #22 received at 18150-done <at> debbugs.gnu.org (full text, mbox):
> Date: Tue, 16 Feb 2016 20:57:41 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 18150 <at> debbugs.gnu.org, mbork <at> mbork.pl
>
> If, when case-fold-search is non-nil, we want both [:upper:] and
> [:lower:] to match any letter that has a case variant, then the patch
> below seems to do the job. Does anyone see a problem with it?
No further comment, so I pushed a slightly safer change to emacs-25
branch, and I'm marking this bug done.
Thanks.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#18150
; Package
emacs
.
(Sat, 20 Feb 2016 12:10:02 GMT)
Full text and
rfc822 format available.
Message #25 received at 18150 <at> debbugs.gnu.org (full text, mbox):
Eli Zaretskii <eliz <at> gnu.org> writes:
> No further comment, so I pushed a slightly safer change to emacs-25
> branch, and I'm marking this bug done.
Thanks, Eli. I'm too ignorant to estimate you C-level patch, but things
behave as I expect now.
Regards,
Michael.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 20 Mar 2016 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 153 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.