GNU bug report logs - #18150
24.3.92; Uppercase umlauts and case-fold-search t

Package: emacs;

Reported by: michael_heerdegen <at> web.de

Date: Wed, 30 Jul 2014 15:12:01 UTC

Severity: normal

Found in version 24.3.92

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18150 in the body.
You can then email your comments to 18150 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#18150; Package emacs. (Wed, 30 Jul 2014 15:12:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to michael_heerdegen <at> web.de:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 30 Jul 2014 15:12:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Michael Heerdegen <michael_heerdegen <at> web.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Wed, 30 Jul 2014 17:11:01 +0200

Hello,


sorry if this is just a unibyte/multibyte thing I don't understand, but
it makes no sense to me:

  (let ((str "École")
        (case-fold-search t))
    (when (string-match "[[:upper:]]" str)
      (match-string 0 str)))

==> "c"

However,

  (let ((str "École")
        (case-fold-search nil))
    (when (string-match "[[:upper:]]" str)
      (match-string 0 str)))

==> "É"

I would expect "É" in both examples.


Thanks,

Michael.




In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2)
 of 2014-07-17 on drachen
Windowing system distributor `The X.Org Foundation', version 11.0.11600000
System Description:	Debian GNU/Linux testing (jessie)

Important settings:
  value of $LC_ALL: de_DE.utf8
  value of $LC_COLLATE: C
  value of $LC_TIME: C
  value of $LANG: de_DE.utf8
  locale-coding-system: utf-8-unix

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#18150; Package emacs. (Tue, 16 Feb 2016 14:54:01 GMT) Full text and rfc822 format available.

Message #8 received at 18150 <at> debbugs.gnu.org (full text, mbox):

From: Marcin Borkowski <mbork <at> mbork.pl>
To: Michael Heerdegen <michael_heerdegen <at> web.de>
Cc: 18150 <at> debbugs.gnu.org
Subject: Re: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Tue, 16 Feb 2016 15:53:41 +0100

Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268).

Best,
mb



On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen <at> web.de> wrote:

> Hello,
>
>
> sorry if this is just a unibyte/multibyte thing I don't understand, but
> it makes no sense to me:
>
>   (let ((str "École")
>         (case-fold-search t))
>     (when (string-match "[[:upper:]]" str)
>       (match-string 0 str)))
>
> ==> "c"
>
> However,
>
>   (let ((str "École")
>         (case-fold-search nil))
>     (when (string-match "[[:upper:]]" str)
>       (match-string 0 str)))
>
> ==> "É"
>
> I would expect "É" in both examples.
>
>
> Thanks,
>
> Michael.
>
>
>
>
> In GNU Emacs 24.3.92.1 (x86_64-unknown-linux-gnu, GTK+ Version 3.12.2)
>  of 2014-07-17 on drachen
> Windowing system distributor `The X.Org Foundation', version 11.0.11600000
> System Description:	Debian GNU/Linux testing (jessie)
>
> Important settings:
>   value of $LC_ALL: de_DE.utf8
>   value of $LC_COLLATE: C
>   value of $LC_TIME: C
>   value of $LANG: de_DE.utf8
>   locale-coding-system: utf-8-unix

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#18150; Package emacs. (Tue, 16 Feb 2016 18:10:01 GMT) Full text and rfc822 format available.

Message #11 received at 18150 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Marcin Borkowski <mbork <at> mbork.pl>
Cc: michael_heerdegen <at> web.de, 18150 <at> debbugs.gnu.org
Subject: Re: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Tue, 16 Feb 2016 20:09:02 +0200

> From: Marcin Borkowski <mbork <at> mbork.pl>
> Date: Tue, 16 Feb 2016 15:53:41 +0100
> Cc: 18150 <at> debbugs.gnu.org
> 
> Confirmed on emacs -Q (GNU Emacs 25.1.50.2, commit 4ccd268).
> 
> Best,
> mb
> 
> On 2014-07-30, at 18:11, Michael Heerdegen <michael_heerdegen <at> web.de> wrote:
> 
> > Hello,
> >
> >
> > sorry if this is just a unibyte/multibyte thing I don't understand, but
> > it makes no sense to me:
> >
> >   (let ((str "École")
> >         (case-fold-search t))
> >     (when (string-match "[[:upper:]]" str)
> >       (match-string 0 str)))
> >
> > ==> "c"
> >
> > However,
> >
> >   (let ((str "École")
> >         (case-fold-search nil))
> >     (when (string-match "[[:upper:]]" str)
> >       (match-string 0 str)))
> >
> > ==> "É"
> >
> > I would expect "É" in both examples.

What do we expect the result to be in the variant below?

   (let ((str "ecole")
         (case-fold-search t))
     (when (string-match "[[:upper:]]" str)
       (match-string 0 str)))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#18150; Package emacs. (Tue, 16 Feb 2016 18:39:02 GMT) Full text and rfc822 format available.

Message #14 received at 18150 <at> debbugs.gnu.org (full text, mbox):

From: Michael Heerdegen <michael_heerdegen <at> web.de>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 18150 <at> debbugs.gnu.org, Marcin Borkowski <mbork <at> mbork.pl>
Subject: Re: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Tue, 16 Feb 2016 19:38:21 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

> What do we expect the result to be in the variant below?
>
>    (let ((str "ecole")
>          (case-fold-search t))
>      (when (string-match "[[:upper:]]" str)
>        (match-string 0 str)))

According to the docstring of `case-fold-search', I would expect "e"
(which the expression returns here).

Before having thought about it, 70% of me expected `nil'.

Michael.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#18150; Package emacs. (Tue, 16 Feb 2016 18:58:01 GMT) Full text and rfc822 format available.

Message #17 received at 18150 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Michael Heerdegen <michael_heerdegen <at> web.de>
Cc: 18150 <at> debbugs.gnu.org, mbork <at> mbork.pl
Subject: Re: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Tue, 16 Feb 2016 20:57:41 +0200

> From: Michael Heerdegen <michael_heerdegen <at> web.de>
> Cc: Marcin Borkowski <mbork <at> mbork.pl>,  18150 <at> debbugs.gnu.org
> Date: Tue, 16 Feb 2016 19:38:21 +0100
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > What do we expect the result to be in the variant below?
> >
> >    (let ((str "ecole")
> >          (case-fold-search t))
> >      (when (string-match "[[:upper:]]" str)
> >        (match-string 0 str)))
> 
> According to the docstring of `case-fold-search', I would expect "e"
> (which the expression returns here).
> 
> Before having thought about it, 70% of me expected `nil'.

That's exactly the point.

If, when case-fold-search is non-nil, we want both [:upper:] and
[:lower:] to match any letter that has a case variant, then the patch
below seems to do the job.  Does anyone see a problem with it?

The gotcha here is that regex.c doesn't know what TRANSLATE does, and
no one promises that TRANSLATE downcases characters.  It could fold
them, for example, or, more generally, transform them in any way the
caller wants.  The patch below is TRT when TRANSLATE downcases; when
it does something else, the question is: do we want to test the match
only on the result of TRANSLATE (which is what the original code
does), or do we want something else?

For the unibyte case, re_compile_pattern sets up a bitmap for
characters _after_ TRANSLATE, so things work as expected.  We cannot
do that for multibyte characters -- there are too many of them -- so
this problem arises.  AFAICS, it existed since Emacs 20.

diff --git a/src/regex.c b/src/regex.c
index dd3f2b3..27dce8b 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	case charset:
 	case charset_not:
 	  {
-	    register unsigned int c;
+	    register unsigned int c, corig;
 	    boolean not = (re_opcode_t) *(p - 1) == charset_not;
 	    int len;

@@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	      }

 	    PREFETCH ();
-	    c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
+	    corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
 	    if (target_multibyte)
 	      {
 		int c1;
@@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	      {
 		int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);

-		if (  (class_bits & BIT_LOWER && ISLOWER (c))
+		if (  (class_bits & BIT_LOWER
+		       && (ISLOWER (c) || (corig != c && ISUPPER(c))))
 		    | (class_bits & BIT_MULTIBYTE)
 		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
 		    | (class_bits & BIT_SPACE && ISSPACE (c))
-		    | (class_bits & BIT_UPPER && ISUPPER (c))
+		    | (class_bits & BIT_UPPER
+		       && (ISUPPER (c) || (corig != c && ISLOWER (c))))
 		    | (class_bits & BIT_WORD  && ISWORD  (c))
 		    | (class_bits & BIT_ALPHA && ISALPHA (c))
 		    | (class_bits & BIT_ALNUM && ISALNUM (c))

Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 20 Feb 2016 11:07:01 GMT) Full text and rfc822 format available.

Notification sent to michael_heerdegen <at> web.de:
bug acknowledged by developer. (Sat, 20 Feb 2016 11:07:02 GMT) Full text and rfc822 format available.

Message #22 received at 18150-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: michael_heerdegen <at> web.de
Cc: 18150-done <at> debbugs.gnu.org, mbork <at> mbork.pl
Subject: Re: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Sat, 20 Feb 2016 13:06:01 +0200

> Date: Tue, 16 Feb 2016 20:57:41 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: 18150 <at> debbugs.gnu.org, mbork <at> mbork.pl
> 
> If, when case-fold-search is non-nil, we want both [:upper:] and
> [:lower:] to match any letter that has a case variant, then the patch
> below seems to do the job.  Does anyone see a problem with it?

No further comment, so I pushed a slightly safer change to emacs-25
branch, and I'm marking this bug done.

Thanks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#18150; Package emacs. (Sat, 20 Feb 2016 12:10:02 GMT) Full text and rfc822 format available.

Message #25 received at 18150 <at> debbugs.gnu.org (full text, mbox):

From: Michael Heerdegen <michael_heerdegen <at> web.de>
To: 18150 <at> debbugs.gnu.org
Cc: eliz <at> gnu.org
Subject: Re: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Sat, 20 Feb 2016 13:09:06 +0100

Eli Zaretskii <eliz <at> gnu.org> writes:

> No further comment, so I pushed a slightly safer change to emacs-25
> branch, and I'm marking this bug done.

Thanks, Eli.  I'm too ignorant to estimate you C-level patch, but things
behave as I expect now.


Regards,

Michael.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 20 Mar 2016 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 153 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18150 24.3.92; Uppercase umlauts and case-fold-search t

GNU bug report logs - #18150
24.3.92; Uppercase umlauts and case-fold-search t