GNU bug report logs - #18150
24.3.92; Uppercase umlauts and case-fold-search t

Previous Next

Package: emacs;

Reported by: michael_heerdegen <at> web.de

Date: Wed, 30 Jul 2014 15:12:01 UTC

Severity: normal

Found in version 24.3.92

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org>
To: Michael Heerdegen <michael_heerdegen <at> web.de>
Cc: 18150 <at> debbugs.gnu.org, mbork <at> mbork.pl
Subject: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t
Date: Tue, 16 Feb 2016 20:57:41 +0200
> From: Michael Heerdegen <michael_heerdegen <at> web.de>
> Cc: Marcin Borkowski <mbork <at> mbork.pl>,  18150 <at> debbugs.gnu.org
> Date: Tue, 16 Feb 2016 19:38:21 +0100
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > What do we expect the result to be in the variant below?
> >
> >    (let ((str "ecole")
> >          (case-fold-search t))
> >      (when (string-match "[[:upper:]]" str)
> >        (match-string 0 str)))
> 
> According to the docstring of `case-fold-search', I would expect "e"
> (which the expression returns here).
> 
> Before having thought about it, 70% of me expected `nil'.

That's exactly the point.

If, when case-fold-search is non-nil, we want both [:upper:] and
[:lower:] to match any letter that has a case variant, then the patch
below seems to do the job.  Does anyone see a problem with it?

The gotcha here is that regex.c doesn't know what TRANSLATE does, and
no one promises that TRANSLATE downcases characters.  It could fold
them, for example, or, more generally, transform them in any way the
caller wants.  The patch below is TRT when TRANSLATE downcases; when
it does something else, the question is: do we want to test the match
only on the result of TRANSLATE (which is what the original code
does), or do we want something else?

For the unibyte case, re_compile_pattern sets up a bitmap for
characters _after_ TRANSLATE, so things work as expected.  We cannot
do that for multibyte characters -- there are too many of them -- so
this problem arises.  AFAICS, it existed since Emacs 20.

diff --git a/src/regex.c b/src/regex.c
index dd3f2b3..27dce8b 100644
--- a/src/regex.c
+++ b/src/regex.c
@@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	case charset:
 	case charset_not:
 	  {
-	    register unsigned int c;
+	    register unsigned int c, corig;
 	    boolean not = (re_opcode_t) *(p - 1) == charset_not;
 	    int len;
 
@@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	      }
 
 	    PREFETCH ();
-	    c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
+	    corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte);
 	    if (target_multibyte)
 	      {
 		int c1;
@@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1,
 	      {
 		int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]);
 
-		if (  (class_bits & BIT_LOWER && ISLOWER (c))
+		if (  (class_bits & BIT_LOWER
+		       && (ISLOWER (c) || (corig != c && ISUPPER(c))))
 		    | (class_bits & BIT_MULTIBYTE)
 		    | (class_bits & BIT_PUNCT && ISPUNCT (c))
 		    | (class_bits & BIT_SPACE && ISSPACE (c))
-		    | (class_bits & BIT_UPPER && ISUPPER (c))
+		    | (class_bits & BIT_UPPER
+		       && (ISUPPER (c) || (corig != c && ISLOWER (c))))
 		    | (class_bits & BIT_WORD  && ISWORD  (c))
 		    | (class_bits & BIT_ALPHA && ISALPHA (c))
 		    | (class_bits & BIT_ALNUM && ISALNUM (c))




This bug report was last modified 9 years and 154 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.