From unknown Sun Jun 22 00:29:09 2025 X-Loop: help-debbugs@gnu.org Subject: bug#23358: merging byte to wide char caches in gawk Resent-From: Aharon Robbins Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Sun, 24 Apr 2016 16:42:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 23358 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 23358@debbugs.gnu.org X-Debbugs-Original-To: bug-grep@gnu.org Received: via spool by submit@debbugs.gnu.org id=B.14615160833457 (code B ref -1); Sun, 24 Apr 2016 16:42:02 +0000 Received: (at submit) by debbugs.gnu.org; 24 Apr 2016 16:41:23 +0000 Received: from localhost ([127.0.0.1]:45177 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auN62-0000th-TF for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:23 -0400 Received: from eggs.gnu.org ([208.118.235.92]:45633) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auN61-0000tU-C0 for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1auN5u-0007wP-VQ for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:16 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,T_MANY_HDRS_LCASE autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59259) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auN5u-0007wJ-SX for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:14 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36597) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auN5t-0001uu-Gc for bug-grep@gnu.org; Sun, 24 Apr 2016 12:41:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1auN5q-0007vo-AG for bug-grep@gnu.org; Sun, 24 Apr 2016 12:41:13 -0400 Received: from mxout5.netvision.net.il ([194.90.6.65]:40363) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auN5p-0007vd-TA for bug-grep@gnu.org; Sun, 24 Apr 2016 12:41:10 -0400 MIME-version: 1.0 Content-type: multipart/mixed; boundary="Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA)" Received: from skeeve.com ([93.173.176.204]) by mxout5.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTPSA id <0O650014VD09AM00@mxout5.netvision.net.il> for bug-grep@gnu.org; Sun, 24 Apr 2016 19:40:58 +0300 (IDT) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.15.2/8.15.2/Debian-3) with ESMTP id u3OGeu9E006002 for ; Sun, 24 Apr 2016 19:40:56 +0300 Received: (from arnold@localhost) by skeeve.com (8.15.2/8.15.2/Submit) id u3OGet68006001 for bug-grep@gnu.org; Sun, 24 Apr 2016 19:40:55 +0300 From: Aharon Robbins Message-id: <201604241640.u3OGet68006001@skeeve.com> Date: Sun, 24 Apr 2016 19:40:55 +0300 User-Agent: s-nail v14.8.6 X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -3.4 (---) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.4 (---) This is a multi-part message in MIME format. --Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA) Content-type: text/plain; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Content-disposition: inline Hi. Here is my proposed patch for merging the byte to w.c. caches in gawk by using the one in dfa. I renamed the one in dfa to 'btowc_cache' since it caches bytes, not multibyte characters. This compiles and gets through the test suite. I also changed the check for the return of mbrtowc since it returns unsigned. Thanks, Arnold --Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA) Content-type: text/x-diff; NAME=dfa.diff; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Content-disposition: attachment; filename=dfa.diff diff --git a/awk.h b/awk.h index 86c8883..636be96 100644 --- a/awk.h +++ b/awk.h @@ -1591,10 +1591,6 @@ extern const wchar_t *wcasestrstr(const wchar_t *haystack, size_t hs_len, const wchar_t *needle, size_t needle_len); extern void r_free_wstr(NODE *n); #define free_wstr(n) do { if ((n)->flags & WSTRCUR) r_free_wstr(n); } while(0) -extern wint_t btowc_cache[]; -#define btowc_cache(x) btowc_cache[(x)&0xFF] -extern void init_btowc_cache(); -#define is_valid_character(b) (btowc_cache[(b)&0xFF] != WEOF) /* re.c */ extern Regexp *make_regexp(const char *s, size_t len, bool ignorecase, bool dfa, bool canfatal); extern int research(Regexp *rp, char *str, int start, size_t len, int flags); diff --git a/dfa.c b/dfa.c index fff4599..a2c73b1 100644 --- a/dfa.c +++ b/dfa.c @@ -464,10 +464,10 @@ static void regexp (void); /* A table indexed by byte values that contains the corresponding wide character (if any) for that byte. WEOF means the byte is not a valid single-byte character. */ -static wint_t mbrtowc_cache[NOTCHAR]; +wint_t btowc_cache[NOTCHAR]; /* Store into *PWC the result of converting the leading bytes of the - multibyte buffer S of length N bytes, using the mbrtowc_cache in *D + multibyte buffer S of length N bytes, using the btowc_cache in *D and updating the conversion state in *D. On conversion error, convert just a single byte, to WEOF. Return the number of bytes converted. @@ -476,7 +476,7 @@ static wint_t mbrtowc_cache[NOTCHAR]; * PWC points to wint_t, not to wchar_t. * The last arg is a dfa *D instead of merely a multibyte conversion - state D->mbs. D also contains an mbrtowc_cache for speed. + state D->mbs. D also contains an btowc_cache for speed. * N must be at least 1. * S[N - 1] must be a sentinel byte. * Shift encodings are not supported. @@ -487,7 +487,7 @@ static size_t mbs_to_wchar (wint_t *pwc, char const *s, size_t n, struct dfa *d) { unsigned char uc = s[0]; - wint_t wc = mbrtowc_cache[uc]; + wint_t wc = btowc_cache[uc]; if (wc == WEOF) { @@ -695,7 +695,7 @@ static charclass newline; static bool unibyte_word_constituent (unsigned char c) { - return mbrtowc_cache[c] != WEOF && (isalnum (c) || (c) == '_'); + return btowc_cache[c] != WEOF && (isalnum (c) || (c) == '_'); } static int @@ -718,25 +718,44 @@ wchar_context (wint_t wc) return CTX_NONE; } +void init_btowc_cache(void) +{ + static bool inited = false; + int i; + + if (inited) + return; + + for (i = CHAR_MIN; i <= CHAR_MAX; ++i) + { + char c = i; + unsigned char uc = i; + mbstate_t s = { 0 }; + wchar_t wc; + size_t ret = mbrtowc (&wc, &c, 1, &s); + btowc_cache[uc] = (ret == (size_t)-1 || ret == (size_t) -2) ? WEOF : wc; + } + + inited = true; +} + /* Entry point to set syntax options. */ void dfasyntax (reg_syntax_t bits, int fold, unsigned char eol) { int i; + syntax_bits_set = 1; syntax_bits = bits; case_fold = fold != 0; eolbyte = eol; + init_btowc_cache(); + /* Now that btowc_cache[uc] is set, use it to calculate sbit. */ for (i = CHAR_MIN; i <= CHAR_MAX; ++i) { - char c = i; unsigned char uc = i; - mbstate_t s = { 0 }; - wchar_t wc; - mbrtowc_cache[uc] = mbrtowc (&wc, &c, 1, &s) <= 1 ? wc : WEOF; - /* Now that mbrtowc_cache[uc] is set, use it to calculate sbit. */ sbit[uc] = char_context (uc); switch (sbit[uc]) { diff --git a/dfa.h b/dfa.h index 18be7f5..f2dd656 100644 --- a/dfa.h +++ b/dfa.h @@ -120,4 +120,15 @@ extern void dfawarn (const char *); The user must supply a dfaerror. */ extern _Noreturn void dfaerror (const char *); +/* General support routines. */ + +/* using_utf8() lets us know if our locale is one based on UTF-8. */ extern int using_utf8 (void); + +/* init_mbcache() initializes the cache that maps bytes to m.b. characters. */ +extern void init_btowc_cache(void); + +/* is_valid_character() tells us if a byte is also a valid m.b. character. */ +extern wint_t btowc_cache[]; +#define is_valid_character(byte) (btowc_cache[(byte)&0xFF] != WEOF) +#define btowc_cache(x) btowc_cache[(x)&0xFF] diff --git a/node.c b/node.c index a7c19db..22119d2 100644 --- a/node.c +++ b/node.c @@ -949,19 +949,6 @@ get_ieee_magic_val(const char *val) return v; } -wint_t btowc_cache[256]; - -/* init_btowc_cache --- initialize the cache */ - -void init_btowc_cache() -{ - int i; - - for (i = 0; i < 255; i++) { - btowc_cache[i] = btowc(i); - } -} - #define BLOCKCHUNK 100 BLOCK nextfree[BLOCK_MAX] = { --Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA)-- From unknown Sun Jun 22 00:29:09 2025 X-Loop: help-debbugs@gnu.org Subject: bug#23358: merging byte to wide char caches in gawk Resent-From: arnold@skeeve.com Original-Sender: "Debbugs-submit" Resent-CC: bug-grep@gnu.org Resent-Date: Tue, 26 Apr 2016 07:18:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 23358 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: arnold@skeeve.com, 23358@debbugs.gnu.org Received: via spool by 23358-submit@debbugs.gnu.org id=B23358.14616550396852 (code B ref 23358); Tue, 26 Apr 2016 07:18:01 +0000 Received: (at 23358) by debbugs.gnu.org; 26 Apr 2016 07:17:19 +0000 Received: from localhost ([127.0.0.1]:47415 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auxFG-0001mR-V9 for submit@debbugs.gnu.org; Tue, 26 Apr 2016 03:17:19 -0400 Received: from freefriends.org ([96.88.95.60]:43974) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auxFF-0001mK-Hs for 23358@debbugs.gnu.org; Tue, 26 Apr 2016 03:17:18 -0400 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (localhost [127.0.0.1]) by freefriends.org (8.14.9/8.14.9) with ESMTP id u3Q7CD7V009881; Tue, 26 Apr 2016 01:12:13 -0600 Received: (from arnold@localhost) by freefriends.org (8.14.9/8.14.9/submit) id u3Q7CDuS009880; Tue, 26 Apr 2016 07:12:13 GMT From: arnold@skeeve.com Message-Id: <201604260712.u3Q7CDuS009880@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Tue, 26 Apr 2016 01:12:13 -0600 References: <201604241640.u3OGet68006001@skeeve.com> In-Reply-To: <201604241640.u3OGet68006001@skeeve.com> User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) The silence in response to this has been thundering. :-( Ignoring the gawk bits, is the grep team willing to incorporate the dfa.[ch] changes? Should this wait until after other pending changes to dfa are applied? Thanks, Arnold Aharon Robbins wrote: > Hi. > > Here is my proposed patch for merging the byte to w.c. caches in gawk > by using the one in dfa. > > I renamed the one in dfa to 'btowc_cache' since it caches bytes, > not multibyte characters. This compiles and gets through the test > suite. > > I also changed the check for the return of mbrtowc since it returns > unsigned. > > Thanks, > > Arnold From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 02 17:37:10 2016 Received: (at control) by debbugs.gnu.org; 2 Sep 2016 21:37:10 +0000 Received: from localhost ([127.0.0.1]:48050 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfw98-0006im-4b for submit@debbugs.gnu.org; Fri, 02 Sep 2016 17:37:10 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:58690) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfw96-0006iZ-Ok for control@debbugs.gnu.org; Fri, 02 Sep 2016 17:37:09 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1CE191613D8 for ; Fri, 2 Sep 2016 14:37:02 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 2bRjY1sIlU1E for ; Fri, 2 Sep 2016 14:37:01 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 7A0B81613E1 for ; Fri, 2 Sep 2016 14:37:01 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id gTuM_7W4pQiP for ; Fri, 2 Sep 2016 14:37:01 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 6026E1613D8 for ; Fri, 2 Sep 2016 14:37:01 -0700 (PDT) To: control@debbugs.gnu.org From: Paul Eggert Subject: 23358 has a patch Organization: UCLA Computer Science Department Message-ID: <4628530c-4ce2-1d27-804e-1c181be07f78@cs.ucla.edu> Date: Fri, 2 Sep 2016 14:37:01 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.5 (-) tags 23358 + patch From unknown Sun Jun 22 00:29:09 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Aharon Robbins Subject: bug#23358: closed (Re: merging byte to wide char caches in gawk) Message-ID: References: <83a7725f-60b0-089a-5198-8f9712d745f0@cs.ucla.edu> <201604241640.u3OGet68006001@skeeve.com> X-Gnu-PR-Message: they-closed 23358 X-Gnu-PR-Package: grep X-Gnu-PR-Keywords: patch Reply-To: 23358@debbugs.gnu.org Date: Fri, 02 Sep 2016 22:48:01 +0000 Content-Type: multipart/mixed; boundary="----------=_1472856481-6759-1" This is a multi-part message in MIME format... ------------=_1472856481-6759-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #23358: merging byte to wide char caches in gawk which was filed against the grep package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 23358@debbugs.gnu.org. --=20 23358: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D23358 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1472856481-6759-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 23358-done) by debbugs.gnu.org; 2 Sep 2016 22:47:01 +0000 Received: from localhost ([127.0.0.1]:48064 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfxEj-0001jN-7h for submit@debbugs.gnu.org; Fri, 02 Sep 2016 18:47:01 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35079) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1bfxEh-0001j4-R4 for 23358-done@debbugs.gnu.org; Fri, 02 Sep 2016 18:47:00 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 555921613D8; Fri, 2 Sep 2016 15:46:54 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id z8wDR2g_52gJ; Fri, 2 Sep 2016 15:46:53 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id A974A1613DE; Fri, 2 Sep 2016 15:46:53 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id v9yTZ3WltULP; Fri, 2 Sep 2016 15:46:53 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 8D1941613D8; Fri, 2 Sep 2016 15:46:53 -0700 (PDT) To: Aharon Robbins From: Paul Eggert Subject: Re: merging byte to wide char caches in gawk Organization: UCLA Computer Science Department Message-ID: <83a7725f-60b0-089a-5198-8f9712d745f0@cs.ucla.edu> Date: Fri, 2 Sep 2016 15:46:53 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -1.5 (-) X-Debbugs-Envelope-To: 23358-done Cc: 23358-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.5 (-) > The silence in response to this has been thundering. :-( Yes, it was pretty quiet.... I think we just now finally got around to incorporating all the ideas behind the patch in Bug#23358, albeit in a different way, so I'm boldly closing the bug report. ------------=_1472856481-6759-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 24 Apr 2016 16:41:23 +0000 Received: from localhost ([127.0.0.1]:45177 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auN62-0000th-TF for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:23 -0400 Received: from eggs.gnu.org ([208.118.235.92]:45633) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1auN61-0000tU-C0 for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1auN5u-0007wP-VQ for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:16 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,T_MANY_HDRS_LCASE autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59259) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auN5u-0007wJ-SX for submit@debbugs.gnu.org; Sun, 24 Apr 2016 12:41:14 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36597) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auN5t-0001uu-Gc for bug-grep@gnu.org; Sun, 24 Apr 2016 12:41:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1auN5q-0007vo-AG for bug-grep@gnu.org; Sun, 24 Apr 2016 12:41:13 -0400 Received: from mxout5.netvision.net.il ([194.90.6.65]:40363) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1auN5p-0007vd-TA for bug-grep@gnu.org; Sun, 24 Apr 2016 12:41:10 -0400 MIME-version: 1.0 Content-type: multipart/mixed; boundary="Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA)" Received: from skeeve.com ([93.173.176.204]) by mxout5.netvision.net.il (Oracle Communications Messaging Server 7u4-24.01(7.0.4.24.0) 64bit (built Nov 17 2011)) with ESMTPSA id <0O650014VD09AM00@mxout5.netvision.net.il> for bug-grep@gnu.org; Sun, 24 Apr 2016 19:40:58 +0300 (IDT) Received: from skeeve.com (skeeve.com [127.0.0.1]) by skeeve.com (8.15.2/8.15.2/Debian-3) with ESMTP id u3OGeu9E006002 for ; Sun, 24 Apr 2016 19:40:56 +0300 Received: (from arnold@localhost) by skeeve.com (8.15.2/8.15.2/Submit) id u3OGet68006001 for bug-grep@gnu.org; Sun, 24 Apr 2016 19:40:55 +0300 From: Aharon Robbins Message-id: <201604241640.u3OGet68006001@skeeve.com> Date: Sun, 24 Apr 2016 19:40:55 +0300 To: bug-grep@gnu.org Subject: merging byte to wide char caches in gawk User-Agent: s-nail v14.8.6 X-detected-operating-system: by eggs.gnu.org: Solaris 10 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -3.4 (---) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.4 (---) This is a multi-part message in MIME format. --Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA) Content-type: text/plain; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Content-disposition: inline Hi. Here is my proposed patch for merging the byte to w.c. caches in gawk by using the one in dfa. I renamed the one in dfa to 'btowc_cache' since it caches bytes, not multibyte characters. This compiles and gets through the test suite. I also changed the check for the return of mbrtowc since it returns unsigned. Thanks, Arnold --Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA) Content-type: text/x-diff; NAME=dfa.diff; CHARSET=US-ASCII Content-transfer-encoding: 7BIT Content-disposition: attachment; filename=dfa.diff diff --git a/awk.h b/awk.h index 86c8883..636be96 100644 --- a/awk.h +++ b/awk.h @@ -1591,10 +1591,6 @@ extern const wchar_t *wcasestrstr(const wchar_t *haystack, size_t hs_len, const wchar_t *needle, size_t needle_len); extern void r_free_wstr(NODE *n); #define free_wstr(n) do { if ((n)->flags & WSTRCUR) r_free_wstr(n); } while(0) -extern wint_t btowc_cache[]; -#define btowc_cache(x) btowc_cache[(x)&0xFF] -extern void init_btowc_cache(); -#define is_valid_character(b) (btowc_cache[(b)&0xFF] != WEOF) /* re.c */ extern Regexp *make_regexp(const char *s, size_t len, bool ignorecase, bool dfa, bool canfatal); extern int research(Regexp *rp, char *str, int start, size_t len, int flags); diff --git a/dfa.c b/dfa.c index fff4599..a2c73b1 100644 --- a/dfa.c +++ b/dfa.c @@ -464,10 +464,10 @@ static void regexp (void); /* A table indexed by byte values that contains the corresponding wide character (if any) for that byte. WEOF means the byte is not a valid single-byte character. */ -static wint_t mbrtowc_cache[NOTCHAR]; +wint_t btowc_cache[NOTCHAR]; /* Store into *PWC the result of converting the leading bytes of the - multibyte buffer S of length N bytes, using the mbrtowc_cache in *D + multibyte buffer S of length N bytes, using the btowc_cache in *D and updating the conversion state in *D. On conversion error, convert just a single byte, to WEOF. Return the number of bytes converted. @@ -476,7 +476,7 @@ static wint_t mbrtowc_cache[NOTCHAR]; * PWC points to wint_t, not to wchar_t. * The last arg is a dfa *D instead of merely a multibyte conversion - state D->mbs. D also contains an mbrtowc_cache for speed. + state D->mbs. D also contains an btowc_cache for speed. * N must be at least 1. * S[N - 1] must be a sentinel byte. * Shift encodings are not supported. @@ -487,7 +487,7 @@ static size_t mbs_to_wchar (wint_t *pwc, char const *s, size_t n, struct dfa *d) { unsigned char uc = s[0]; - wint_t wc = mbrtowc_cache[uc]; + wint_t wc = btowc_cache[uc]; if (wc == WEOF) { @@ -695,7 +695,7 @@ static charclass newline; static bool unibyte_word_constituent (unsigned char c) { - return mbrtowc_cache[c] != WEOF && (isalnum (c) || (c) == '_'); + return btowc_cache[c] != WEOF && (isalnum (c) || (c) == '_'); } static int @@ -718,25 +718,44 @@ wchar_context (wint_t wc) return CTX_NONE; } +void init_btowc_cache(void) +{ + static bool inited = false; + int i; + + if (inited) + return; + + for (i = CHAR_MIN; i <= CHAR_MAX; ++i) + { + char c = i; + unsigned char uc = i; + mbstate_t s = { 0 }; + wchar_t wc; + size_t ret = mbrtowc (&wc, &c, 1, &s); + btowc_cache[uc] = (ret == (size_t)-1 || ret == (size_t) -2) ? WEOF : wc; + } + + inited = true; +} + /* Entry point to set syntax options. */ void dfasyntax (reg_syntax_t bits, int fold, unsigned char eol) { int i; + syntax_bits_set = 1; syntax_bits = bits; case_fold = fold != 0; eolbyte = eol; + init_btowc_cache(); + /* Now that btowc_cache[uc] is set, use it to calculate sbit. */ for (i = CHAR_MIN; i <= CHAR_MAX; ++i) { - char c = i; unsigned char uc = i; - mbstate_t s = { 0 }; - wchar_t wc; - mbrtowc_cache[uc] = mbrtowc (&wc, &c, 1, &s) <= 1 ? wc : WEOF; - /* Now that mbrtowc_cache[uc] is set, use it to calculate sbit. */ sbit[uc] = char_context (uc); switch (sbit[uc]) { diff --git a/dfa.h b/dfa.h index 18be7f5..f2dd656 100644 --- a/dfa.h +++ b/dfa.h @@ -120,4 +120,15 @@ extern void dfawarn (const char *); The user must supply a dfaerror. */ extern _Noreturn void dfaerror (const char *); +/* General support routines. */ + +/* using_utf8() lets us know if our locale is one based on UTF-8. */ extern int using_utf8 (void); + +/* init_mbcache() initializes the cache that maps bytes to m.b. characters. */ +extern void init_btowc_cache(void); + +/* is_valid_character() tells us if a byte is also a valid m.b. character. */ +extern wint_t btowc_cache[]; +#define is_valid_character(byte) (btowc_cache[(byte)&0xFF] != WEOF) +#define btowc_cache(x) btowc_cache[(x)&0xFF] diff --git a/node.c b/node.c index a7c19db..22119d2 100644 --- a/node.c +++ b/node.c @@ -949,19 +949,6 @@ get_ieee_magic_val(const char *val) return v; } -wint_t btowc_cache[256]; - -/* init_btowc_cache --- initialize the cache */ - -void init_btowc_cache() -{ - int i; - - for (i = 0; i < 255; i++) { - btowc_cache[i] = btowc(i); - } -} - #define BLOCKCHUNK 100 BLOCK nextfree[BLOCK_MAX] = { --Boundary_(ID_bJdeIFle2/7C+N6tbR1TXA)-- ------------=_1472856481-6759-1--