#18150 - 24.3.92; Uppercase umlauts and case-fold-search t

GNU bug report logs - #18150
24.3.92; Uppercase umlauts and case-fold-search t

Package: emacs;

Reported by: michael_heerdegen <at> web.de

Date: Wed, 30 Jul 2014 15:12:01 UTC

Severity: normal

Found in version 24.3.92

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

View this message in rfc822 format

From: Eli Zaretskii <eliz <at> gnu.org> To: Michael Heerdegen <michael_heerdegen <at> web.de> Cc: 18150 <at> debbugs.gnu.org, mbork <at> mbork.pl Subject: bug#18150: 24.3.92; Uppercase umlauts and case-fold-search t Date: Tue, 16 Feb 2016 20:57:41 +0200

> From: Michael Heerdegen <michael_heerdegen <at> web.de> > Cc: Marcin Borkowski <mbork <at> mbork.pl>, 18150 <at> debbugs.gnu.org > Date: Tue, 16 Feb 2016 19:38:21 +0100 > > Eli Zaretskii <eliz <at> gnu.org> writes: > > > What do we expect the result to be in the variant below? > > > > (let ((str "ecole") > > (case-fold-search t)) > > (when (string-match "[[:upper:]]" str) > > (match-string 0 str))) > > According to the docstring of `case-fold-search', I would expect "e" > (which the expression returns here). > > Before having thought about it, 70% of me expected `nil'. That's exactly the point. If, when case-fold-search is non-nil, we want both [:upper:] and [:lower:] to match any letter that has a case variant, then the patch below seems to do the job. Does anyone see a problem with it? The gotcha here is that regex.c doesn't know what TRANSLATE does, and no one promises that TRANSLATE downcases characters. It could fold them, for example, or, more generally, transform them in any way the caller wants. The patch below is TRT when TRANSLATE downcases; when it does something else, the question is: do we want to test the match only on the result of TRANSLATE (which is what the original code does), or do we want something else? For the unibyte case, re_compile_pattern sets up a bitmap for characters _after_ TRANSLATE, so things work as expected. We cannot do that for multibyte characters -- there are too many of them -- so this problem arises. AFAICS, it existed since Emacs 20. diff --git a/src/regex.c b/src/regex.c index dd3f2b3..27dce8b 100644 --- a/src/regex.c +++ b/src/regex.c @@ -5444,7 +5444,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, case charset: case charset_not: { - register unsigned int c; + register unsigned int c, corig; boolean not = (re_opcode_t) *(p - 1) == charset_not; int len; @@ -5473,7 +5473,7 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, } PREFETCH (); - c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); + corig = c = RE_STRING_CHAR_AND_LENGTH (d, len, target_multibyte); if (target_multibyte) { int c1; @@ -5517,11 +5517,13 @@ re_match_2_internal (struct re_pattern_buffer *bufp, const_re_char *string1, { int class_bits = CHARSET_RANGE_TABLE_BITS (&p[-1]); - if ( (class_bits & BIT_LOWER && ISLOWER (c)) + if ( (class_bits & BIT_LOWER + && (ISLOWER (c) || (corig != c && ISUPPER(c)))) | (class_bits & BIT_MULTIBYTE) | (class_bits & BIT_PUNCT && ISPUNCT (c)) | (class_bits & BIT_SPACE && ISSPACE (c)) - | (class_bits & BIT_UPPER && ISUPPER (c)) + | (class_bits & BIT_UPPER + && (ISUPPER (c) || (corig != c && ISLOWER (c)))) | (class_bits & BIT_WORD && ISWORD (c)) | (class_bits & BIT_ALPHA && ISALPHA (c)) | (class_bits & BIT_ALNUM && ISALNUM (c))

This bug report was last modified 9 years and 154 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18150 24.3.92; Uppercase umlauts and case-fold-search t

GNU bug report logs - #18150
24.3.92; Uppercase umlauts and case-fold-search t