From unknown Sat Jun 21 10:17:43 2025 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.509 (Entity 5.509) Content-Type: text/plain; charset=utf-8 From: bug#16581 <16581@debbugs.gnu.org> To: bug#16581 <16581@debbugs.gnu.org> Subject: Status: suggested code simplification in dfa.c Reply-To: bug#16581 <16581@debbugs.gnu.org> Date: Sat, 21 Jun 2025 17:17:43 +0000 retitle 16581 suggested code simplification in dfa.c reassign 16581 grep submitter 16581 Aharon Robbins severity 16581 normal thanks From debbugs-submit-bounces@debbugs.gnu.org Tue Jan 28 15:11:47 2014 Received: (at submit) by debbugs.gnu.org; 28 Jan 2014 20:11:47 +0000 Received: from localhost ([127.0.0.1]:39575 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8F0d-0002k4-3u for submit@debbugs.gnu.org; Tue, 28 Jan 2014 15:11:47 -0500 Received: from eggs.gnu.org ([208.118.235.92]:52787) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8F0b-0002jw-CR for submit@debbugs.gnu.org; Tue, 28 Jan 2014 15:11:45 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1W8F0R-00083f-NV for submit@debbugs.gnu.org; Tue, 28 Jan 2014 15:11:45 -0500 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=5.0 tests=BAYES_05,T_MANY_HDRS_LCASE autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:33061) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W8F0R-00083b-Kv for submit@debbugs.gnu.org; Tue, 28 Jan 2014 15:11:35 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:54436) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W8F0K-0002sT-As for bug-grep@gnu.org; Tue, 28 Jan 2014 15:11:35 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1W8F0D-00081D-0J for bug-grep@gnu.org; Tue, 28 Jan 2014 15:11:28 -0500 Received: from mxout4.netvision.net.il ([194.90.9.27]:43773) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1W8F0C-00080w-Je for bug-grep@gnu.org; Tue, 28 Jan 2014 15:11:20 -0500 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from skeeve.com ([93.172.51.72]) by mxout4.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTP id <0N04004UFO2TOLI0@mxout4.netvision.net.il> for bug-grep@gnu.org; Tue, 28 Jan 2014 22:11:18 +0200 (IST) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.14.4/8.14.4/Debian-2ubuntu2.1) with ESMTP id s0SKBFSU008494 for ; Tue, 28 Jan 2014 22:11:15 +0200 Received: (from arnold@localhost) by skeeve.com (8.14.4/8.14.4/Submit) id s0SKBE19008493 for bug-grep@gnu.org; Tue, 28 Jan 2014 22:11:14 +0200 From: Aharon Robbins Message-id: <201401282011.s0SKBE19008493@skeeve.com> Date: Tue, 28 Jan 2014 22:11:14 +0200 To: bug-grep@gnu.org Subject: suggested code simplification in dfa.c User-Agent: Heirloom mailx 12.5 6/20/10 X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) Hi. The code in atom() looks to me like it could use a little refactoring and simplification. I suggest the diff below. With it both grep and gawk still pass their tests. Thanks, Arnold diff --git a/src/dfa.c b/src/dfa.c index b79c604..d2916ee 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -1725,17 +1725,20 @@ add_utf8_anychar (void) static void atom (void) { - if (0) + if (MBS_SUPPORT && tok == WCHAR) { - /* empty */ - } - else if (MBS_SUPPORT && tok == WCHAR) - { - addtok_wc (case_fold ? towlower (wctok) : wctok); - if (case_fold && iswalpha (wctok)) + if (! case_fold) + { + addtok_wc (wctok); + } + else { - addtok_wc (towupper (wctok)); - addtok (OR); + addtok_wc (towlower (wctok)); + if (iswalpha (wctok)) + { + addtok_wc (towupper (wctok)); + addtok (OR); + } } tok = lex (); From debbugs-submit-bounces@debbugs.gnu.org Tue Jan 28 16:50:59 2014 Received: (at 16581) by debbugs.gnu.org; 28 Jan 2014 21:50:59 +0000 Received: from localhost ([127.0.0.1]:39609 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8GYc-0006P7-Ja for submit@debbugs.gnu.org; Tue, 28 Jan 2014 16:50:59 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:35866) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8GYY-0006Ot-Py for 16581@debbugs.gnu.org; Tue, 28 Jan 2014 16:50:56 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id D94ADA60001; Tue, 28 Jan 2014 13:50:53 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id xElzAecOEMfC; Tue, 28 Jan 2014 13:50:53 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 300B839E8015; Tue, 28 Jan 2014 13:50:53 -0800 (PST) Message-ID: <52E8263C.6050206@cs.ucla.edu> Date: Tue, 28 Jan 2014 13:50:52 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Aharon Robbins , 16581@debbugs.gnu.org Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> In-Reply-To: <201401282011.s0SKBE19008493@skeeve.com> Content-Type: multipart/mixed; boundary="------------040900030507020406030300" X-Spam-Score: -2.8 (--) X-Debbugs-Envelope-To: 16581 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.8 (--) This is a multi-part message in MIME format. --------------040900030507020406030300 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit I like that as far as it goes, but it pulls loose a thread that has been nagging me for a while. How about the attached instead? It includes somewhat more simplification, entailing more-efficient handling of caseless letters when ignoring case. --------------040900030507020406030300 Content-Type: text/x-patch; name="0001-Simplify-handling-of-letter-case.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-Simplify-handling-of-letter-case.patch" >From 85efede266be9d2cda8d229c012828b6ae4574c5 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 28 Jan 2014 13:47:47 -0800 Subject: [PATCH] Simplify handling of letter case. * src/dfa.c (setbit_wc, setbit_case_fold_c, atom): Simplify. (setbit_case_fold_c, parse_bracket_exp, lex, atom): Invoke tolower and toupper instead of isalpha followed by one or the other, and similarly for towlower, towupper, iswalpha. This should lead to more-efficient handling of caseless letters, and it simplifies the code. --- src/dfa.c | 93 ++++++++++++++++++++++++++++++--------------------------------- 1 file changed, 44 insertions(+), 49 deletions(-) diff --git a/src/dfa.c b/src/dfa.c index b79c604..72beed0 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -693,39 +693,24 @@ dfasyntax (reg_syntax_t bits, int fold, unsigned char eol) this may happen when folding case in weird Turkish locales where dotless i/dotted I are not included in the chosen character set. Return whether a bit was set in the charclass. */ -#if MBS_SUPPORT static bool setbit_wc (wint_t wc, charclass c) { +#if MBS_SUPPORT int b = wctob (wc); if (b == EOF) return false; setbit (b, c); return true; -} - -/* Set a bit in the charclass for the given single byte character, - if it is valid in the current character set. */ -static void -setbit_c (int b, charclass c) -{ - /* Do nothing if b is invalid in this character set. */ - if (MB_CUR_MAX > 1 && btowc (b) == WEOF) - return; - setbit (b, c); -} #else -# define setbit_c setbit -static inline bool -setbit_wc (wint_t wc, charclass c) -{ abort (); /*NOTREACHED*/ return false; -} #endif +} -/* Like setbit_c, but if case is folded, set both cases of a letter. For +/* Like setbit_wc but for a single-byte character B; and if case is + folded, set both cases of a letter. For MB_CUR_MAX > 1, the resulting charset is only used as an optimization, and the caller takes care of setting the appropriate field of struct mb_char_classes. */ @@ -737,16 +722,16 @@ setbit_case_fold_c (int b, charclass c) wint_t wc = btowc (b); if (wc == WEOF) return; - setbit (b, c); - if (case_fold && iswalpha (wc)) - setbit_wc (iswupper (wc) ? towlower (wc) : towupper (wc), c); + if (case_fold) + setbit_wc (wc ^ towlower (wc) ^ towupper (wc), c); } else { - setbit (b, c); - if (case_fold && isalpha (b)) - setbit_c (isupper (b) ? tolower (b) : toupper (b), c); + if (case_fold) + setbit (b ^ tolower (b) ^ toupper (b), c); } + + setbit (b, c); } @@ -1085,23 +1070,30 @@ parse_bracket_exp (void) { /* When case folding map a range, say [m-z] (or even [M-z]) to the pair of ranges, [m-z] [M-Z]. */ + wchar_t lo1 = wc, hi1 = wc2, lo2 = wc, hi2 = wc2; + if (case_fold) + { + lo1 = towlower (lo1); + hi1 = towlower (hi1); + lo2 = towupper (lo2); + hi2 = towupper (hi2); + } + REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); - work_mbc->range_sts[work_mbc->nranges] = - case_fold ? towlower (wc) : (wchar_t) wc; - work_mbc->range_ends[work_mbc->nranges++] = - case_fold ? towlower (wc2) : (wchar_t) wc2; + work_mbc->range_sts[work_mbc->nranges] = lo1; + work_mbc->range_ends[work_mbc->nranges++] = hi1; - if (case_fold && (iswalpha (wc) || iswalpha (wc2))) + if (lo1 != lo2 || hi1 != hi2) { REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); - work_mbc->range_sts[work_mbc->nranges] = towupper (wc); + work_mbc->range_sts[work_mbc->nranges] = lo2; REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); - work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2); + work_mbc->range_ends[work_mbc->nranges++] = hi2; } } else @@ -1129,16 +1121,18 @@ parse_bracket_exp (void) continue; } - if (case_fold && iswalpha (wc)) + if (case_fold) { - wc = towlower (wc); - if (!setbit_wc (wc, ccl)) + wchar_t diff = towlower (wc) ^ towupper (wc); + if (diff) { - REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, - work_mbc->nchars + 1); - work_mbc->chars[work_mbc->nchars++] = wc; + if (!setbit_wc (wc ^ diff, ccl)) + { + REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, + work_mbc->nchars + 1); + work_mbc->chars[work_mbc->nchars++] = wc ^ diff; + } } - wc = towupper (wc); } if (!setbit_wc (wc, ccl)) { @@ -1481,7 +1475,7 @@ lex (void) if (MB_CUR_MAX > 1) return lasttok = WCHAR; - if (case_fold && isalpha (c)) + if (case_fold && tolower (c) != toupper (c)) { zeroset (ccl); setbit_case_fold_c (c, ccl); @@ -1725,17 +1719,18 @@ add_utf8_anychar (void) static void atom (void) { - if (0) - { - /* empty */ - } - else if (MBS_SUPPORT && tok == WCHAR) + if (MBS_SUPPORT && tok == WCHAR) { - addtok_wc (case_fold ? towlower (wctok) : wctok); - if (case_fold && iswalpha (wctok)) + wchar_t wc = wctok; + addtok_wc (wc); + if (case_fold) { - addtok_wc (towupper (wctok)); - addtok (OR); + wchar_t diff = towlower (wc) ^ towupper (wc); + if (diff) + { + addtok_wc (wc ^ diff); + addtok (OR); + } } tok = lex (); -- 1.8.5.3 --------------040900030507020406030300-- From debbugs-submit-bounces@debbugs.gnu.org Tue Jan 28 21:51:16 2014 Received: (at 16581) by debbugs.gnu.org; 29 Jan 2014 02:51:16 +0000 Received: from localhost ([127.0.0.1]:39724 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8LFD-0006Vv-Jq for submit@debbugs.gnu.org; Tue, 28 Jan 2014 21:51:16 -0500 Received: from mxout4.netvision.net.il ([194.90.9.27]:37533) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8LFB-0006Vk-GW for 16581@debbugs.gnu.org; Tue, 28 Jan 2014 21:51:14 -0500 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from skeeve.com ([93.172.51.72]) by mxout4.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTP id <0N05008H76KP52B0@mxout4.netvision.net.il> for 16581@debbugs.gnu.org; Wed, 29 Jan 2014 04:50:50 +0200 (IST) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.14.4/8.14.4/Debian-2ubuntu2.1) with ESMTP id s0T2omxD002396; Wed, 29 Jan 2014 04:50:48 +0200 Received: (from arnold@localhost) by skeeve.com (8.14.4/8.14.4/Submit) id s0T2olcN002395; Wed, 29 Jan 2014 04:50:47 +0200 From: Aharon Robbins Message-id: <201401290250.s0T2olcN002395@skeeve.com> Date: Wed, 29 Jan 2014 04:50:47 +0200 To: eggert@cs.ucla.edu, arnold@skeeve.com, 16581@debbugs.gnu.org Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> In-reply-to: <52E8263C.6050206@cs.ucla.edu> User-Agent: Heirloom mailx 12.5 6/20/10 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16581 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Hi Paul. I skimmed the patch. All that exclusive-ORing looks a little scary to me. Will that work, for example, on EBCDIC systems? Gawk supports z/OS - a POSIX enviornment on top of OS/390. Will it work on systems using some of the older far Eastern, non-Unicode locales? What is it even doing? What do you expect to get from wc ^ towlower(wc) ^ towupper(wc) ? I'm worried that you've embedded a deep assumption about how characters are encoded and how upper and lower case relate to each other in every possible character set we might be called upon to handle, and it feels really risky to me. I think I'd be happier if you did the simplification in smaller, more comprehensible, steps. My two cents, of course. :-) Thanks, Arnold From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 29 01:48:29 2014 Received: (at 16581) by debbugs.gnu.org; 29 Jan 2014 06:48:29 +0000 Received: from localhost ([127.0.0.1]:39749 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8Owm-0003z5-Ik for submit@debbugs.gnu.org; Wed, 29 Jan 2014 01:48:29 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:57996) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8Owi-0003yq-PK for 16581@debbugs.gnu.org; Wed, 29 Jan 2014 01:48:26 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id B151C39E8016; Tue, 28 Jan 2014 22:48:18 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id KnURnV0OdLch; Tue, 28 Jan 2014 22:48:17 -0800 (PST) Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net [108.0.233.62]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 5317939E8015; Tue, 28 Jan 2014 22:48:17 -0800 (PST) Message-ID: <52E8A430.9000506@cs.ucla.edu> Date: Tue, 28 Jan 2014 22:48:16 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Aharon Robbins , 16581@debbugs.gnu.org Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> In-Reply-To: <201401290250.s0T2olcN002395@skeeve.com> Content-Type: multipart/mixed; boundary="------------010000070302010909090205" X-Spam-Score: -2.8 (--) X-Debbugs-Envelope-To: 16581 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.8 (--) This is a multi-part message in MIME format. --------------010000070302010909090205 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Aharon Robbins wrote: > that exclusive-ORing looks a little scary to me. It is derived from a coding hack that I picked up from Dijkstra back in the 1970s. He gave the most boring computer-science lecture I have ever attended -- so boring that Kit Fine walked out a few minutes into it -- but I stubbornly stayed through to the end and I've never forgotten the hack. The hack works everywhere, including platforms that use EBCDIC, shift-JIS, DBCS, etc., because it doesn't rely on the encoding scheme at all. Attached is a revised patch that adds some commentary and breaks the hack into some functions that I hope help explain things. Sorry, I don't know how to break this into smaller patches that would be easier to understand. --------------010000070302010909090205 Content-Type: text/x-patch; name="0001-Simplify-handling-of-letter-case.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-Simplify-handling-of-letter-case.patch" >From 859b9496860e67d01e32a58e9f2a098410775a22 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 28 Jan 2014 13:47:47 -0800 Subject: [PATCH] Simplify handling of letter case. * src/dfa.c (setbit_wc, setbit_case_fold_c, atom): Simplify. (xor_other, xor_wother, to_other, to_wother): New functions. (setbit_case_fold_c, parse_bracket_exp, lex, atom): Use them to invoke tolower and toupper instead of isalpha followed by one or the other, and similarly for towlower, towupper, iswalpha. This should lead to more-efficient handling of caseless letters, and it simplifies the code. --- src/dfa.c | 130 +++++++++++++++++++++++++++++++++++++++----------------------- 1 file changed, 81 insertions(+), 49 deletions(-) diff --git a/src/dfa.c b/src/dfa.c index b79c604..af830a8 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -662,6 +662,42 @@ wchar_context (wint_t wc) return CTX_NONE; } +/* The following functions exploit the commutativity and associativity of ^, + and the fact that X ^ X is zero. POSIX requires that C equals + either tolower (C) or toupper (C); if the former, then C ^ tolower (C) + is zero so C ^ xor_other (C) equals toupper (C), and similarly + for the latter. */ + +/* Return the exclusive-OR of C and C's other case, or zero if C is + not a letter that changes case. */ + +static int +xor_other (int c) +{ + return tolower (c) ^ toupper (c); +} + +static wint_t +xor_wother (wint_t c) +{ + return towlower (c) ^ towupper (c); +} + +/* If C is a lowercase letter, return its uppercase version, and vice versa. + Return C if it's not a letter that changes case. */ + +static int +to_other (int c) +{ + return c ^ xor_other (c); +} + +static wint_t +to_wother (wint_t c) +{ + return c ^ xor_wother (c); +} + /* Entry point to set syntax options. */ void dfasyntax (reg_syntax_t bits, int fold, unsigned char eol) @@ -693,39 +729,24 @@ dfasyntax (reg_syntax_t bits, int fold, unsigned char eol) this may happen when folding case in weird Turkish locales where dotless i/dotted I are not included in the chosen character set. Return whether a bit was set in the charclass. */ -#if MBS_SUPPORT static bool setbit_wc (wint_t wc, charclass c) { +#if MBS_SUPPORT int b = wctob (wc); if (b == EOF) return false; setbit (b, c); return true; -} - -/* Set a bit in the charclass for the given single byte character, - if it is valid in the current character set. */ -static void -setbit_c (int b, charclass c) -{ - /* Do nothing if b is invalid in this character set. */ - if (MB_CUR_MAX > 1 && btowc (b) == WEOF) - return; - setbit (b, c); -} #else -# define setbit_c setbit -static inline bool -setbit_wc (wint_t wc, charclass c) -{ abort (); /*NOTREACHED*/ return false; -} #endif +} -/* Like setbit_c, but if case is folded, set both cases of a letter. For +/* Like setbit_wc but for a single-byte character B; and if case is + folded, set both cases of a letter. For MB_CUR_MAX > 1, the resulting charset is only used as an optimization, and the caller takes care of setting the appropriate field of struct mb_char_classes. */ @@ -737,16 +758,16 @@ setbit_case_fold_c (int b, charclass c) wint_t wc = btowc (b); if (wc == WEOF) return; - setbit (b, c); - if (case_fold && iswalpha (wc)) - setbit_wc (iswupper (wc) ? towlower (wc) : towupper (wc), c); + if (case_fold) + setbit_wc (to_wother (wc), c); } else { - setbit (b, c); - if (case_fold && isalpha (b)) - setbit_c (isupper (b) ? tolower (b) : toupper (b), c); + if (case_fold) + setbit (to_other (b), c); } + + setbit (b, c); } @@ -1085,23 +1106,30 @@ parse_bracket_exp (void) { /* When case folding map a range, say [m-z] (or even [M-z]) to the pair of ranges, [m-z] [M-Z]. */ + wchar_t lo1 = wc, hi1 = wc2, lo2 = wc, hi2 = wc2; + if (case_fold) + { + lo1 = towlower (lo1); + hi1 = towlower (hi1); + lo2 = towupper (lo2); + hi2 = towupper (hi2); + } + REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); - work_mbc->range_sts[work_mbc->nranges] = - case_fold ? towlower (wc) : (wchar_t) wc; - work_mbc->range_ends[work_mbc->nranges++] = - case_fold ? towlower (wc2) : (wchar_t) wc2; + work_mbc->range_sts[work_mbc->nranges] = lo1; + work_mbc->range_ends[work_mbc->nranges++] = hi1; - if (case_fold && (iswalpha (wc) || iswalpha (wc2))) + if (lo1 != lo2 || hi1 != hi2) { REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); - work_mbc->range_sts[work_mbc->nranges] = towupper (wc); + work_mbc->range_sts[work_mbc->nranges] = lo2; REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); - work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2); + work_mbc->range_ends[work_mbc->nranges++] = hi2; } } else @@ -1129,16 +1157,19 @@ parse_bracket_exp (void) continue; } - if (case_fold && iswalpha (wc)) + if (case_fold) { - wc = towlower (wc); - if (!setbit_wc (wc, ccl)) + wchar_t xor = xor_wother (wc); + if (xor) { - REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, - work_mbc->nchars + 1); - work_mbc->chars[work_mbc->nchars++] = wc; + wchar_t other = wc ^ xor; + if (!setbit_wc (other, ccl)) + { + REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, + work_mbc->nchars + 1); + work_mbc->chars[work_mbc->nchars++] = other; + } } - wc = towupper (wc); } if (!setbit_wc (wc, ccl)) { @@ -1481,7 +1512,7 @@ lex (void) if (MB_CUR_MAX > 1) return lasttok = WCHAR; - if (case_fold && isalpha (c)) + if (case_fold && tolower (c) != toupper (c)) { zeroset (ccl); setbit_case_fold_c (c, ccl); @@ -1725,17 +1756,18 @@ add_utf8_anychar (void) static void atom (void) { - if (0) - { - /* empty */ - } - else if (MBS_SUPPORT && tok == WCHAR) + if (MBS_SUPPORT && tok == WCHAR) { - addtok_wc (case_fold ? towlower (wctok) : wctok); - if (case_fold && iswalpha (wctok)) + wchar_t wc = wctok; + addtok_wc (wc); + if (case_fold) { - addtok_wc (towupper (wctok)); - addtok (OR); + wchar_t xor = xor_wother (wc); + if (xor) + { + addtok_wc (wc ^ xor); + addtok (OR); + } } tok = lex (); -- 1.8.5.3 --------------010000070302010909090205-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 29 08:42:24 2014 Received: (at 16581) by debbugs.gnu.org; 29 Jan 2014 13:42:24 +0000 Received: from localhost ([127.0.0.1]:39986 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8VPL-0006xM-Q7 for submit@debbugs.gnu.org; Wed, 29 Jan 2014 08:42:24 -0500 Received: from mx1.redhat.com ([209.132.183.28]:30722) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8VPI-0006xC-IM for 16581@debbugs.gnu.org; Wed, 29 Jan 2014 08:42:22 -0500 Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s0TDgBfP003453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 29 Jan 2014 08:42:12 -0500 Received: from [10.3.113.18] ([10.3.113.18]) by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s0TDgA40030096; Wed, 29 Jan 2014 08:42:10 -0500 Message-ID: <52E90532.2090104@redhat.com> Date: Wed, 29 Jan 2014 06:42:10 -0700 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Paul Eggert , Aharon Robbins , 16581@debbugs.gnu.org Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> In-Reply-To: <52E8A430.9000506@cs.ucla.edu> X-Enigmail-Version: 1.6 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="fnhVotKR9JvEQrJ8uskmIunU8He3t89hX" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22 X-Spam-Score: -5.5 (-----) X-Debbugs-Envelope-To: 16581 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.5 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --fnhVotKR9JvEQrJ8uskmIunU8He3t89hX Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 01/28/2014 11:48 PM, Paul Eggert wrote: > =20 > +/* The following functions exploit the commutativity and associativity= of ^, > + and the fact that X ^ X is zero. POSIX requires that C equals > + either tolower (C) or toupper (C); Unfortunately, while this is true, I'm not sure if it accurately covers all possible case-folded comparisons outside of the C locale. http://www.unicode.org/faq/casemap_charprop.html Consider the Greek locale, el_GR.UTF-8, which has two lower-case sigma: L'\x3c3' and L'\x3c2', but only one upper-case: L'\x3a3'. As a result, all three wchar_t values must compare case-insensitively to one another. Or consider titlecase characters, such as Unicode L'\x1c8' (Lj), which has both an uppercase mapping L'\x1c7' (LJ) and lowercase mapping L'\x1c9' (lj) - again, all three wchar_t values must compare case-insensitively to one another. Your hack is great at finding characters that have a case mapping, but not necessarily at finding all such characters that map to the same result when passed through towlower(towupper(c)). --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --fnhVotKR9JvEQrJ8uskmIunU8He3t89hX Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJS6QUyAAoJEKeha0olJ0NquMQIAJxgOLCMuIrSzq22AUoVHWoy SJVGyh/6CIqOi7//BZWY74+G++15OLvJTrD9SARcCi7G9dCsNws/z8xNfxuoQ/w+ FObMCTZM5FpqbQTrCCWL6VRvypBWKbjPOA+M2WMI7GOHmeB5ICjte3pFWUleIHsY bXeOWtnqFtGQz+Jq/X3KvOwd19E5pKNGijc/iPDj82HdJGSqL7TTLW6gXa3d3tOL u6lv9Ph+GR9qRnMq3WRcRLDCRReeoAXuZqPdOjSbapys9RfHjcWHWfXCCj6ZwfMP 4ZldheUcGv74yoBRNYaEBlmwXUoyccjoujdmjofSkx9Yyr1oDBxIkImVxLLIBcQ= =8QfX -----END PGP SIGNATURE----- --fnhVotKR9JvEQrJ8uskmIunU8He3t89hX-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 29 08:49:20 2014 Received: (at 16581) by debbugs.gnu.org; 29 Jan 2014 13:49:20 +0000 Received: from localhost ([127.0.0.1]:39990 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8VW3-000780-Jz for submit@debbugs.gnu.org; Wed, 29 Jan 2014 08:49:19 -0500 Received: from mx1.redhat.com ([209.132.183.28]:51173) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8VW2-00077r-0E for 16581@debbugs.gnu.org; Wed, 29 Jan 2014 08:49:18 -0500 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s0TDnEhT014331 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK); Wed, 29 Jan 2014 08:49:15 -0500 Received: from [10.3.113.18] ([10.3.113.18]) by int-mx10.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s0TDnEbC005533; Wed, 29 Jan 2014 08:49:14 -0500 Message-ID: <52E906D9.7060801@redhat.com> Date: Wed, 29 Jan 2014 06:49:13 -0700 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Paul Eggert , Aharon Robbins , 16581@debbugs.gnu.org Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> <52E90532.2090104@redhat.com> In-Reply-To: <52E90532.2090104@redhat.com> X-Enigmail-Version: 1.6 OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="DHtwK9C6L05Gj5L568pHMQlI1U995qqcX" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.23 X-Spam-Score: -5.5 (-----) X-Debbugs-Envelope-To: 16581 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.5 (-----) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --DHtwK9C6L05Gj5L568pHMQlI1U995qqcX Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 01/29/2014 06:42 AM, Eric Blake wrote: > Your hack is great at finding characters that have a case mapping, but > not necessarily at finding all such characters that map to the same > result when passed through towlower(towupper(c)). >=20 In particular, note that the Java language has formalized case-insensitive comparison as follows: http://docs.oracle.com/javase/1.5.0/docs/api/java/lang/String.html#equals= IgnoreCase%28java.lang.String%29 Two characters c1 and c2 are considered the same, ignoring case if at least one of the following is true: The two characters are the same (as compared by the =3D=3D operator).= Applying the method Character.toUpperCase(char) to each character produces the same result. Applying the method Character.toLowerCase(char) to each character produces the same result. and lower down, compareToIgnoreCase(): Compares two strings lexicographically, ignoring case differences. This method returns an integer whose sign is that of calling compareTo with normalized versions of the strings where case differences have been eliminated by calling Character.toLowerCase(Character.toUpperCase(character)) on each character= =2E Note that this method does not take locale into account, and will result in an unsatisfactory ordering for certain locales. The java.text package provides collators to allow locale-sensitive ordering. In particular, the specification was careful to require double-case conversion, with uppercase first, in order to normalize all single-character oddities, while still mentioning that true Unicode collation has even more special cases that can't be decided on a character-by-character basis. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --DHtwK9C6L05Gj5L568pHMQlI1U995qqcX Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJS6QbZAAoJEKeha0olJ0Nq+FMIAKIin0d/XlrKVXfKzh9baZv7 Xbigg3m9wsOnxzWmmI1KlwiZ+0JklDGee7YGAZAQTnpvs3tYbbgLobr44NhcQO4r A/GKwq6pCKttUclrFcYDaZhR5Vjf0h6SWSUEXNRbCkIBjgREN9MMT3aFr/0jvcJL soh5edsum2OZfKjOF+jA4OYUuV60M64gnGnY+wEn7VoGQobFctRtjyyKFX6jezTK yGdMDaP9Hkf4NOORqw1JerYLVzaSO9YyAvl2knjdHQopV6zKc/IFilAsOjDXt9CR J+gjky0scjTAN0M/1lKchVgkdCPKLDgmUtaNVZGel+JqoWV0jzWPr60hOJo/Ae8= =Kp0G -----END PGP SIGNATURE----- --DHtwK9C6L05Gj5L568pHMQlI1U995qqcX-- From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 29 10:29:36 2014 Received: (at 16581) by debbugs.gnu.org; 29 Jan 2014 15:29:36 +0000 Received: from localhost ([127.0.0.1]:40465 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8X55-0001DT-TM for submit@debbugs.gnu.org; Wed, 29 Jan 2014 10:29:36 -0500 Received: from frenzy.freefriends.org ([66.54.153.139]:50406 helo=freefriends.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8X54-0001DK-21 for 16581@debbugs.gnu.org; Wed, 29 Jan 2014 10:29:34 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (localhost [127.0.0.1]) by freefriends.org (8.14.6/8.14.6) with ESMTP id s0TFTQ5E012266; Wed, 29 Jan 2014 08:29:26 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.6/8.14.6/submit) id s0TFTQgu012265; Wed, 29 Jan 2014 15:29:26 GMT From: arnold@skeeve.com Message-Id: <201401291529.s0TFTQgu012265@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Wed, 29 Jan 2014 08:29:26 -0700 To: eggert@cs.ucla.edu Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> In-Reply-To: User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16581 Cc: 16581@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Aaron Crane wrote: > I don't think this works for the wide-character case. For example, .... Maybe just use the xor stuff for the single byte case and the more straightforward code for the multibyte case? Otherwise it sounds like we're asking for trouble. Also, maybe name the routines ..._other_case instead of just _other ? Thanks, Arnold From debbugs-submit-bounces@debbugs.gnu.org Wed Jan 29 11:57:17 2014 Received: (at 16581) by debbugs.gnu.org; 29 Jan 2014 16:57:17 +0000 Received: from localhost ([127.0.0.1]:40485 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8YRw-0003e3-O7 for submit@debbugs.gnu.org; Wed, 29 Jan 2014 11:57:17 -0500 Received: from mail-wg0-f54.google.com ([74.125.82.54]:34922) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8W0A-0007sm-Ep for 16581@debbugs.gnu.org; Wed, 29 Jan 2014 09:20:27 -0500 Received: by mail-wg0-f54.google.com with SMTP id x13so3573420wgg.33 for <16581@debbugs.gnu.org>; Wed, 29 Jan 2014 06:20:25 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc:content-type :content-transfer-encoding; bh=Oxx5DF9F/srC9cYE8/GeL5qIhHdkgaVgl+9YVOrkVZA=; b=dSLdUWuPCQFWolpJBbIFL4OZHp2mtFtRJBFTS6fep5heqXeuk1bsNd3Rri0yS3wb4W a+d8jWnFKc9+7NsmHr0jKlLHv9j9VeIBPJiih6oiFcAc1wyLPPoyS0dRtRY6/d0mq0sV c9/i9lwgz11Nbl7cC1Wy/WtVHncPShAudRYndPlRWKq2xy2/AQzkeEXXONVMrrt1Gmya r0CTCFTZwS1Su8NTtIy6VfZDQABtrDCDK0fu+AeuoHbtwwH1MDUfx60xPQ3AO9WHxm0L NMkwFuZ9Ae90SVRmCohBQ+mVc2jXC2Vb7WCiyKNhWcfU5zKRfcc0nbiEz0s1k7I18H1S qpjQ== X-Gm-Message-State: ALoCoQn58TAiazEu7Pp+V/Br8IRJVPg/RVgmURxXaDKQjaCDQpK+k0JA5/TvUlrutaC9ozgZYQ6X X-Received: by 10.194.92.164 with SMTP id cn4mr363564wjb.74.1391005225488; Wed, 29 Jan 2014 06:20:25 -0800 (PST) MIME-Version: 1.0 Received: by 10.227.226.77 with HTTP; Wed, 29 Jan 2014 06:20:10 -0800 (PST) X-Originating-IP: [87.194.157.167] In-Reply-To: <52E8A430.9000506@cs.ucla.edu> References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> From: Aaron Crane Date: Wed, 29 Jan 2014 14:20:10 +0000 X-Google-Sender-Auth: AgBIZE0UQD5fJMi9_pg5ZxPclSY Message-ID: Subject: Re: bug#16581: suggested code simplification in dfa.c To: Paul Eggert Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 16581 X-Mailman-Approved-At: Wed, 29 Jan 2014 11:57:15 -0500 Cc: Aharon Robbins , 16581@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.7 (/) Paul Eggert wrote: > +/* The following functions exploit the commutativity and associativity o= f ^, > + and the fact that X ^ X is zero. POSIX requires that C equals > + either tolower (C) or toupper (C); if the former, then C ^ tolower (C= ) > + is zero so C ^ xor_other (C) equals toupper (C), and similarly > + for the latter. */ > + > +/* Return the exclusive-OR of C and C's other case, or zero if C is > + not a letter that changes case. */ > + > +static wint_t > +xor_wother (wint_t c) > +{ > + return towlower (c) ^ towupper (c); > +} [=E2=80=A6] > + if (case_fold) > { > + wchar_t xor =3D xor_wother (wc); > + if (xor) > + { > + addtok_wc (wc ^ xor); > + addtok (OR); > + } I don't think this works for the wide-character case. For example, in a suitable locale, I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ") under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under towlower(). This matches the behaviour I can observe with a simple test program under the en_GB.UTF-8 locale on both Linux and Mac OS. Since 0x1c7 ^ 0x1c9 =3D=3D 14, and 0x1c8 ^ 14 =3D=3D 0x1c6, this means we'd call addtok_wc(0x1c6), and U+01C6 is LATIN SMALL LETTER DZ WITH CARON, which isn't a desired character. --=20 Aaron Crane ** http://aaroncrane.co.uk/ From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 30 10:38:05 2014 Received: (at 16581) by debbugs.gnu.org; 30 Jan 2014 15:38:05 +0000 Received: from localhost ([127.0.0.1]:41597 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8tgr-0006Qg-8N for submit@debbugs.gnu.org; Thu, 30 Jan 2014 10:38:05 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:53413) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8tgo-0006QK-Vd for 16581@debbugs.gnu.org; Thu, 30 Jan 2014 10:38:03 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id CD156A60002; Thu, 30 Jan 2014 07:38:01 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 4+A4yiaw0--R; Thu, 30 Jan 2014 07:38:01 -0800 (PST) Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net [108.0.233.62]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 4D65EA60001; Thu, 30 Jan 2014 07:38:01 -0800 (PST) Message-ID: <52EA71D1.4000204@cs.ucla.edu> Date: Thu, 30 Jan 2014 07:37:53 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: Aaron Crane Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.7 (--) X-Debbugs-Envelope-To: 16581 Cc: Aharon Robbins , 16581@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.7 (--) Aaron Crane wrote: > I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL > LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ") > under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under > towlower(). Ouch, thanks, I hadn't considered that. So my idea was all wrong. But this means the current code is all wrong too. I'll take a look at it. I hope I don't regret picking up this thread.... From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 30 10:52:05 2014 Received: (at 16581) by debbugs.gnu.org; 30 Jan 2014 15:52:05 +0000 Received: from localhost ([127.0.0.1]:41613 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8tuO-0006lz-Ud for submit@debbugs.gnu.org; Thu, 30 Jan 2014 10:52:05 -0500 Received: from frenzy.freefriends.org ([66.54.153.139]:33501 helo=freefriends.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8tuM-0006lr-UM for 16581@debbugs.gnu.org; Thu, 30 Jan 2014 10:52:03 -0500 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (localhost [127.0.0.1]) by freefriends.org (8.14.6/8.14.6) with ESMTP id s0UFpYtJ012935; Thu, 30 Jan 2014 08:51:35 -0700 Received: (from arnold@localhost) by freefriends.org (8.14.6/8.14.6/submit) id s0UFpY2G012934; Thu, 30 Jan 2014 15:51:34 GMT From: arnold@skeeve.com Message-Id: <201401301551.s0UFpY2G012934@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Thu, 30 Jan 2014 08:51:34 -0700 To: grep@aaroncrane.co.uk, eggert@cs.ucla.edu Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> <52EA71D1.4000204@cs.ucla.edu> In-Reply-To: <52EA71D1.4000204@cs.ucla.edu> User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16581 Cc: arnold@skeeve.com, 16581@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) Paul Eggert wrote: > Aaron Crane wrote: > > I'd expect U+01C8 LATIN CAPITAL LETTER L WITH SMALL > > LETTER J ("Lj", roughly) to be U+01C7 LATIN CAPITAL LETTER LJ ("LJ") > > under towupper(), and U+01C9 LATIN SMALL LETTER LJ ("lj") under > > towlower(). > > Ouch, thanks, I hadn't considered that. So my idea was all wrong. But > this means the current code is all wrong too. I'll take a look at it. I > hope I don't regret picking up this thread.... This seems to be a weird (and very much corner) case: wc != towlower(wc) and wc != towupper(wc). It can only be an issue if doing case folding, and there are only a few spots in the code that deal with case folding when compiling the dfa. I suggest starting with the XOR changes for unibyte locales - they seem (to me) to be good no matter what. And then separately try to deal with the multibyte case. And just to increase the need for Aspirin, any idea how regex handles this case? I would not be surprised if the code there also doesn't catch this. Wheeeeeeeee! :-) Arnold From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 30 16:56:22 2014 Received: (at 16581) by debbugs.gnu.org; 30 Jan 2014 21:56:22 +0000 Received: from localhost ([127.0.0.1]:41811 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8zaw-0008Qd-6i for submit@debbugs.gnu.org; Thu, 30 Jan 2014 16:56:22 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:48034) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W8zau-0008QS-Ab for 16581@debbugs.gnu.org; Thu, 30 Jan 2014 16:56:21 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 6C93439E8019; Thu, 30 Jan 2014 13:56:19 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id oUjhVgc1Jcad; Thu, 30 Jan 2014 13:56:18 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 5B8C939E8013; Thu, 30 Jan 2014 13:56:18 -0800 (PST) Message-ID: <52EACA7A.6060004@cs.ucla.edu> Date: Thu, 30 Jan 2014 13:56:10 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 MIME-Version: 1.0 To: arnold@skeeve.com, grep@aaroncrane.co.uk Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> <52EA71D1.4000204@cs.ucla.edu> <201401301551.s0UFpY2G012934@freefriends.org> In-Reply-To: <201401301551.s0UFpY2G012934@freefriends.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.7 (--) X-Debbugs-Envelope-To: 16581 Cc: 16581@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.7 (--) On 01/30/2014 07:51 AM, arnold@skeeve.com wrote: > I suggest starting with the XOR changes for unibyte locales - they seem > (to me) to be good no matter what. And then separately try to deal with > the multibyte case. Unfortunately the changes don't work even for unibyte locales, since unibyte locales can have the same problem, i.e., c != tolower (c) && c != toupper (c). Admittedly this is rare, but it's possible (users can define their own locales, after all), and fixing the code for the multibyte case will induce similar changes for unibyte, I hope. > > And just to increase the need for Aspirin, any idea how regex handles > this case? No idea, sorry. Rounds of aspirin for everybody! From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 31 04:20:58 2014 Received: (at 16581) by debbugs.gnu.org; 31 Jan 2014 09:20:58 +0000 Received: from localhost ([127.0.0.1]:42354 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W9AHR-0006VL-QZ for submit@debbugs.gnu.org; Fri, 31 Jan 2014 04:20:57 -0500 Received: from mxout4.netvision.net.il ([194.90.9.27]:52745) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1W9AHP-0006VA-W3 for 16581@debbugs.gnu.org; Fri, 31 Jan 2014 04:20:57 -0500 MIME-version: 1.0 Content-transfer-encoding: 7BIT Content-type: text/plain; CHARSET=US-ASCII Received: from skeeve.com ([93.172.51.72]) by mxout4.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTP id <0N09007GJDYU52J1@mxout4.netvision.net.il> for 16581@debbugs.gnu.org; Fri, 31 Jan 2014 11:20:54 +0200 (IST) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.14.4/8.14.4/Debian-2ubuntu2.1) with ESMTP id s0V9KrV0004136; Fri, 31 Jan 2014 11:20:53 +0200 Received: (from arnold@localhost) by skeeve.com (8.14.4/8.14.4/Submit) id s0V9Kqlb004134; Fri, 31 Jan 2014 11:20:52 +0200 From: Aharon Robbins Message-id: <201401310920.s0V9Kqlb004134@skeeve.com> Date: Fri, 31 Jan 2014 11:20:52 +0200 To: grep@aaroncrane.co.uk, eggert@cs.ucla.edu Subject: Re: bug#16581: suggested code simplification in dfa.c References: <201401282011.s0SKBE19008493@skeeve.com> <52E8263C.6050206@cs.ucla.edu> <201401290250.s0T2olcN002395@skeeve.com> <52E8A430.9000506@cs.ucla.edu> <52EA71D1.4000204@cs.ucla.edu> <201401301551.s0UFpY2G012934@freefriends.org> <52EACA7A.6060004@cs.ucla.edu> In-reply-to: <52EACA7A.6060004@cs.ucla.edu> User-Agent: Heirloom mailx 12.5 6/20/10 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 16581 Cc: 16581@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) > > I suggest starting with the XOR changes for unibyte locales - they seem > > (to me) to be good no matter what. And then separately try to deal with > > the multibyte case. > > Unfortunately the changes don't work even for unibyte locales, since > unibyte locales can have the same problem, i.e., c != tolower (c) && c > != toupper (c). Admittedly this is rare, but it's possible (users can > define their own locales, after all), and fixing the code for the > multibyte case will induce similar changes for unibyte, I hope. I see. OK. > > And just to increase the need for Aspirin, any idea how regex handles > > this case? > > No idea, sorry. Rounds of aspirin for everybody! Well, I am comforted that this has not been a big issue in practice (yet!). If I get ambitious I will try to look at it; chances are that you, Paul, will beat me to it. It's been an interesting discussion, anyway. :-) Thanks, Arnold From debbugs-submit-bounces@debbugs.gnu.org Sat Mar 08 13:16:49 2014 Received: (at 16581-done) by debbugs.gnu.org; 8 Mar 2014 18:16:49 +0000 Received: from localhost ([127.0.0.1]:56876 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WMLnl-0004T0-7p for submit@debbugs.gnu.org; Sat, 08 Mar 2014 13:16:49 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:39513) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1WMLnj-0004Sr-4w for 16581-done@debbugs.gnu.org; Sat, 08 Mar 2014 13:16:47 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 9F4C439E8018 for <16581-done@debbugs.gnu.org>; Sat, 8 Mar 2014 10:16:46 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id RjpSDAWSzeI2 for <16581-done@debbugs.gnu.org>; Sat, 8 Mar 2014 10:16:46 -0800 (PST) Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net [108.0.233.62]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 50B6439E8013 for <16581-done@debbugs.gnu.org>; Sat, 8 Mar 2014 10:16:46 -0800 (PST) Message-ID: <531B5E8D.1060701@cs.ucla.edu> Date: Sat, 08 Mar 2014 10:16:45 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.3.0 MIME-Version: 1.0 To: 16581-done@debbugs.gnu.org Subject: Re: suggested code simplification in dfa.c Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 16581-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) This particular issue seems to have been put to bed in the savannah git master so I'm marking it as done. From unknown Sat Jun 21 10:17:43 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Sun, 06 Apr 2014 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator