From debbugs-submit-bounces@debbugs.gnu.org Mon Oct 20 11:04:42 2014 Received: (at submit) by debbugs.gnu.org; 20 Oct 2014 15:04:42 +0000 Received: from localhost ([127.0.0.1]:57383 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgEVl-0005Wa-R5 for submit@debbugs.gnu.org; Mon, 20 Oct 2014 11:04:42 -0400 Received: from eggs.gnu.org ([208.118.235.92]:51340) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgEVj-0005WM-EO for submit@debbugs.gnu.org; Mon, 20 Oct 2014 11:04:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XgEVV-0007SO-Iy for submit@debbugs.gnu.org; Mon, 20 Oct 2014 11:04:34 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:59606) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XgEVV-0007SK-Go for submit@debbugs.gnu.org; Mon, 20 Oct 2014 11:04:25 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52922) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XgEVN-0005YA-TM for bug-grep@gnu.org; Mon, 20 Oct 2014 11:04:25 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XgEVG-0007Oc-Aa for bug-grep@gnu.org; Mon, 20 Oct 2014 11:04:17 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:45283) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XgEVG-0007Ni-0f for bug-grep@gnu.org; Mon, 20 Oct 2014 11:04:10 -0400 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 66C3C80324 for ; Tue, 21 Oct 2014 00:04:04 +0900 (JST) Received: from mail05.kcn.ne.jp ([61.86.6.184]) by imp02 with bizsmtp id 5T441p00N3yDdWd01T44Zj; Tue, 21 Oct 2014 00:04:04 +0900 X-OrgRCPT: bug-grep@gnu.org Received: from [10.120.1.60] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail05.kcn.ne.jp (Postfix) with ESMTPA id 36E907D0099 for ; Tue, 21 Oct 2014 00:04:04 +0900 (JST) Date: Tue, 21 Oct 2014 00:04:02 +0900 From: Norihiro Tanaka To: bug-grep@gnu.org Subject: [PATCH] dfa: improvement for checking of multibyte character boundary Message-Id: <20141021000401.1015.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_5445222A000000001031_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -4.0 (----) --------_5445222A000000001031_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit This patch improves performance for input string which doesn't match even the first part of a pattern. Although there is no less effective for grep as it uses a superset of DFA, gawk speeds up about 40%. $ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k (before) real 2.85 user 2.79 sys 0.05 (after) real 1.70 user 1.64 sys 0.06 I think that this improvement should have been performed in bug#17576. --------_5445222A000000001031_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Disposition: attachment; filename="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Transfer-Encoding: base64 RnJvbSAyY2YyNGE0ZTA4NGM4NzNmN2FlM2YxODQyNTFiOGRjYTFhNTVlODUxIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDIwIE9jdCAyMDE0IDIzOjIwOjE1ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBpbXByb3ZlbWVudCBmb3IgY2hlY2tpbmcgb2YgbXVsdGlieXRlIGNoYXJhY3RlciBib3VuZGFy eQoKV2hlbiBmb3VuZCBuZXdsaW5lLCB3ZSBjYW4gc2tpcCBjaGVjayBvZiBhIG11bHRpYnl0ZSBj aGFyYWN0ZXIgYm91bmRhcnkKYmVmb3JlIHRoZSBjaGFyYWN0ZXIsIGFzIHdlIGFzc3VtZSBuZXds aW5lIGFzIGEgc2luZ2xlIGJ5dGUgY2hhcmFjdGVyLgpieSB0aGF0LgoKVGhlIGltcHJvdmVtZW50 IHNwZWVkcyB1cCBhYm91dCA0MCUgZm9yIGlucHV0IHN0cmluZyB3aGljaCBkb2Vzbid0IG1hdGNo CmV2ZW4gdGhlIGZpcnN0IHBhcnQgb2YgYSBwYXR0ZXJuLgoKKiBzcmMvZGZhLmMgKHNraXBfcmVt YWluc19tYik6IElmIGFuIGlucHV0IGNoYXJhY3RlciBpcyBuZXdsaW5lLCBza2lwCmNoZWNraW5n IGZvciBtdWx0aWJ5dGUgY2hhcmFjdGVyIGJvdW5kYXJ5IHVudGlsIHRoZXJlLgotLS0KIHNyYy9k ZmEuYyB8IDIgKysKIDEgZmlsZSBjaGFuZ2VkLCAyIGluc2VydGlvbnMoKykKCmRpZmYgLS1naXQg YS9zcmMvZGZhLmMgYi9zcmMvZGZhLmMKaW5kZXggNThhNGI4My4uYjlmMDY1ZiAxMDA2NDQKLS0t IGEvc3JjL2RmYS5jCisrKyBiL3NyYy9kZmEuYwpAQCAtMzI1Miw2ICszMjUyLDggQEAgc2tpcF9y ZW1haW5zX21iIChzdHJ1Y3QgZGZhICpkLCB1bnNpZ25lZCBjaGFyIGNvbnN0ICpwLAogICAgICAg ICAgICAgICAgICB1bnNpZ25lZCBjaGFyIGNvbnN0ICptYnAsIGNoYXIgY29uc3QgKmVuZCkKIHsK ICAgd2ludF90IHdjOworICBpZiAoKnAgPT0gZW9sYnl0ZSkKKyAgICByZXR1cm4gcDsKICAgd2hp bGUgKG1icCA8IHApCiAgICAgbWJwICs9IG1ic190b193Y2hhciAoJndjLCAoY2hhciBjb25zdCAq KSBtYnAsCiAgICAgICAgICAgICAgICAgICAgICAgICAgZW5kIC0gKGNoYXIgY29uc3QgKikgbWJw LCBkKTsKLS0gCjIuMS4xCgo= --------_5445222A000000001031_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Mon Oct 20 11:21:15 2014 Received: (at 18777) by debbugs.gnu.org; 20 Oct 2014 15:21:15 +0000 Received: from localhost ([127.0.0.1]:57408 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgElm-0005z3-7u for submit@debbugs.gnu.org; Mon, 20 Oct 2014 11:21:14 -0400 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:64515) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgElj-0005yo-Lq for 18777@debbugs.gnu.org; Mon, 20 Oct 2014 11:21:12 -0400 Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id C76F267C17 for <18777@debbugs.gnu.org>; Tue, 21 Oct 2014 00:21:04 +0900 (JST) Received: from mail08.kcn.ne.jp ([61.86.6.187]) by imp03 with bizsmtp id 5TM41p00U426eXR01TM44Q; Tue, 21 Oct 2014 00:21:04 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.60] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail08.kcn.ne.jp (Postfix) with ESMTPA id 886DC12B802E; Tue, 21 Oct 2014 00:21:04 +0900 (JST) Date: Tue, 21 Oct 2014 00:21:02 +0900 From: Norihiro Tanaka To: Norihiro Tanaka Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <20141021000401.1015.27F6AC2D@kcn.ne.jp> References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> Message-Id: <20141021002102.BF94.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: 18777 Cc: 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.4 (-) Norihiro Tanaka wrote: > $ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k The file `k' is below. $ yes `printf '%040d' 0` | head -10000000 >../k From debbugs-submit-bounces@debbugs.gnu.org Mon Oct 20 12:07:27 2014 Received: (at 18777) by debbugs.gnu.org; 20 Oct 2014 16:07:27 +0000 Received: from localhost ([127.0.0.1]:57437 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgFUU-0007Dy-O2 for submit@debbugs.gnu.org; Mon, 20 Oct 2014 12:07:27 -0400 Received: from mx1.redhat.com ([209.132.183.28]:22443) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgFUS-0007Dl-7E for 18777@debbugs.gnu.org; Mon, 20 Oct 2014 12:07:25 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s9KG7MmZ027932 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Mon, 20 Oct 2014 12:07:22 -0400 Received: from [10.3.113.40] (ovpn-113-40.phx2.redhat.com [10.3.113.40]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id s9KG7LcU018277; Mon, 20 Oct 2014 12:07:21 -0400 Message-ID: <54453338.6090406@redhat.com> Date: Mon, 20 Oct 2014 10:07:20 -0600 From: Eric Blake Organization: Red Hat, Inc. User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Norihiro Tanaka , 18777@debbugs.gnu.org Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141021000401.1015.27F6AC2D@kcn.ne.jp> OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="C6B6m1Fh11gKdvGBSDrtOflM1RusPth3W" X-Scanned-By: MIMEDefang 2.68 on 10.5.11.27 X-Spam-Score: -6.4 (------) X-Debbugs-Envelope-To: 18777 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.4 (------) This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --C6B6m1Fh11gKdvGBSDrtOflM1RusPth3W Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 10/20/2014 09:04 AM, Norihiro Tanaka wrote: > This patch improves performance for input string which doesn't match > even the first part of a pattern. Although there is no less effective > for grep as it uses a superset of DFA, gawk speeds up about 40%. >=20 >=20 > When found newline, we can skip check of a multibyte character boundary= > before the character, as we assume newline as a single byte character. > by that. POSIX requires that NUL, slash, dot, newline, and carriage return all be single bytes that cannot occur inside a multibyte character (because they have special meaning to file name resolution and/or terminal interaction); it added this requirement fairly recently, but only after confirming that common existing locales satisfy this constraint. (The same is not true for most any other character; even though POSIX requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid those characters from also being bytes embedded within multibyte characters). Is it worth extending your optimization to all five of the POSIX-guaranteed single byte characters? --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --C6B6m1Fh11gKdvGBSDrtOflM1RusPth3W Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg iQEcBAEBCAAGBQJURTM4AAoJEKeha0olJ0NqxOEH+wUZfLehj4CHpNTxl429VhrA KtAYX+sDHrW0oMPHfogxVVj8DOCgyPid3ogUqGS89KRW3pFHAz7mYk0SC+GiKFes MzfhV/dF9LS5UwLixZa+sXnRAn4zTeHvS5UvdqetCKxqjCP2UD2ul6u/gw8EBH3H ejsyol475B4f47Tr2JYmVt86Rs5KhtTdZyPKLXeRfVTkGf3Xva8kMfB8g62d9iaA MGO5c1o6WGqHdeq3TxRtT1CT3lD8TD/0EnqYaOnDCU/HK4N7xIIfXbS+j6AW6uxQ ANNSnIh9XbpqKJSFPRyXY4S0goF2iWv2BShDgpPCSCcLKzl84TKDZnLsSqvOFe4= =eQBh -----END PGP SIGNATURE----- --C6B6m1Fh11gKdvGBSDrtOflM1RusPth3W-- From debbugs-submit-bounces@debbugs.gnu.org Mon Oct 20 19:09:38 2014 Received: (at 18777) by debbugs.gnu.org; 20 Oct 2014 23:09:38 +0000 Received: from localhost ([127.0.0.1]:57571 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgM53-0002hz-Pt for submit@debbugs.gnu.org; Mon, 20 Oct 2014 19:09:38 -0400 Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:58200) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgM50-0002hk-JN for 18777@debbugs.gnu.org; Mon, 20 Oct 2014 19:09:35 -0400 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw05.kcn.ne.jp (Postfix) with ESMTP id 817A46801C for <18777@debbugs.gnu.org>; Tue, 21 Oct 2014 08:09:26 +0900 (JST) Received: from mail03.kcn.ne.jp ([61.86.6.182]) by imp02 with bizsmtp id 5b9S1p00J3veGq501b9Syj; Tue, 21 Oct 2014 08:09:26 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.58] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail03.kcn.ne.jp (Postfix) with ESMTPA id 51FD9141009A; Tue, 21 Oct 2014 08:09:26 +0900 (JST) Date: Tue, 21 Oct 2014 08:09:24 +0900 From: Norihiro Tanaka To: Eric Blake Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <54453338.6090406@redhat.com> References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> <54453338.6090406@redhat.com> Message-Id: <20141021080924.B38E.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: 18777 Cc: 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.4 (-) Eric Blake wrote: > Is it worth extending your optimization to all five of the > POSIX-guaranteed single byte characters? Thanks, but I don't want to perform it immediately. DFA has already regarded newline as a single byte character, but hasn't others yet. So, we may need to make many changes to handle invalid locales and sequences not to conform to the rule. If we omitted that, It might be that limits are added to the locale to be able to apply DFA to. Threfore, it should be performed carefully. From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 21 02:23:12 2014 Received: (at 18777) by debbugs.gnu.org; 21 Oct 2014 06:23:12 +0000 Received: from localhost ([127.0.0.1]:57676 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgSqe-0005r2-Fl for submit@debbugs.gnu.org; Tue, 21 Oct 2014 02:23:12 -0400 Received: from frenzy.freefriends.org ([66.54.153.139]:46406 helo=freefriends.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgSqb-0005qs-9w for 18777@debbugs.gnu.org; Tue, 21 Oct 2014 02:23:10 -0400 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (localhost [127.0.0.1]) by freefriends.org (8.14.9/8.14.9) with ESMTP id s9L6N86K025902; Tue, 21 Oct 2014 00:23:08 -0600 Received: (from arnold@localhost) by freefriends.org (8.14.9/8.14.9/submit) id s9L6N7QQ025901; Tue, 21 Oct 2014 06:23:07 GMT From: arnold@skeeve.com Message-Id: <201410210623.s9L6N7QQ025901@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Tue, 21 Oct 2014 00:23:07 -0600 To: noritnk@kcn.ne.jp, eblake@redhat.com Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> <54453338.6090406@redhat.com> <20141021080924.B38E.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141021080924.B38E.27F6AC2D@kcn.ne.jp> User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Norihiro Tanaka wrote: > Eric Blake wrote: > > Is it worth extending your optimization to all five of the > > POSIX-guaranteed single byte characters? > > Thanks, but I don't want to perform it immediately. DFA has already > regarded newline as a single byte character, but hasn't others yet. So, > we may need to make many changes to handle invalid locales and sequences > not to conform to the rule. If we omitted that, It might be that limits > are added to the locale to be able to apply DFA to. Threfore, it should > be performed carefully. I would think adding a check for '\r' would be safe and would help too; given that on Windows systems '\r' generally occurs just as frequently as '\n', it should give a nice speedup for gawk on those systems. The other characters that Erik cited seem less like a big issue to me. Thanks, Arnold From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 21 09:25:35 2014 Received: (at 18777) by debbugs.gnu.org; 21 Oct 2014 13:25:35 +0000 Received: from localhost ([127.0.0.1]:57939 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgZRO-0001fU-Cu for submit@debbugs.gnu.org; Tue, 21 Oct 2014 09:25:34 -0400 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:49397) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgZRK-0001fB-Ai for 18777@debbugs.gnu.org; Tue, 21 Oct 2014 09:25:31 -0400 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 05DB9E80025 for <18777@debbugs.gnu.org>; Tue, 21 Oct 2014 22:25:22 +0900 (JST) Received: from mail02.kcn.ne.jp ([61.86.6.181]) by imp02 with bizsmtp id 5pRN1p00G3uLcVp01pRN3M; Tue, 21 Oct 2014 22:25:22 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.70] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail02.kcn.ne.jp (Postfix) with ESMTPA id 55D31F10029; Tue, 21 Oct 2014 22:25:22 +0900 (JST) Date: Tue, 21 Oct 2014 22:25:21 +0900 From: Norihiro Tanaka To: arnold@skeeve.com Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <201410210623.s9L6N7QQ025901@freefriends.org> References: <20141021080924.B38E.27F6AC2D@kcn.ne.jp> <201410210623.s9L6N7QQ025901@freefriends.org> Message-Id: <20141021222519.74AB.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: 18777 Cc: eblake@redhat.com, 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.4 (-) arnold@skeeve.com wrote: > I would think adding a check for '\r' would be safe and would help > too; given that on Windows systems '\r' generally occurs just as > frequently as '\n', it should give a nice speedup for gawk on those > systems. As I recognize that DFA and regex aren't support multiple eolbytes as CR-LF, I can't understand where we can use the change. Grep converts Windows text to Unix text by removal of CR in advance. BTW, although I say `newline', correctly notice that it's `eolbyte' which mayn't be either LF or NUL. From debbugs-submit-bounces@debbugs.gnu.org Tue Oct 21 10:53:37 2014 Received: (at 18777) by debbugs.gnu.org; 21 Oct 2014 14:53:37 +0000 Received: from localhost ([127.0.0.1]:58660 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Xgaoa-00047T-Ar for submit@debbugs.gnu.org; Tue, 21 Oct 2014 10:53:36 -0400 Received: from frenzy.freefriends.org ([66.54.153.139]:52769 helo=freefriends.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgaoU-00047E-So for 18777@debbugs.gnu.org; Tue, 21 Oct 2014 10:53:32 -0400 X-Envelope-From: arnold@skeeve.com Received: from freefriends.org (localhost [127.0.0.1]) by freefriends.org (8.14.9/8.14.9) with ESMTP id s9LEhlCR008691; Tue, 21 Oct 2014 08:43:47 -0600 Received: (from arnold@localhost) by freefriends.org (8.14.9/8.14.9/submit) id s9LEhlH7008690; Tue, 21 Oct 2014 14:43:47 GMT From: arnold@skeeve.com Message-Id: <201410211443.s9LEhlH7008690@freefriends.org> X-Authentication-Warning: frenzy.freefriends.org: arnold set sender to arnold@skeeve.com using -f Date: Tue, 21 Oct 2014 08:43:47 -0600 To: noritnk@kcn.ne.jp, arnold@skeeve.com Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141021080924.B38E.27F6AC2D@kcn.ne.jp> <201410210623.s9L6N7QQ025901@freefriends.org> <20141021222519.74AB.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141021222519.74AB.27F6AC2D@kcn.ne.jp> User-Agent: Heirloom mailx 12.4 7/29/08 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: eblake@redhat.com, 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hi. Norihiro Tanaka wrote: > arnold@skeeve.com wrote: > > I would think adding a check for '\r' would be safe and would help > > too; given that on Windows systems '\r' generally occurs just as > > frequently as '\n', it should give a nice speedup for gawk on those > > systems. > > As I recognize that DFA and regex aren't support multiple eolbytes as > CR-LF, I can't understand where we can use the change. Grep converts > Windows text to Unix text by removal of CR in advance. Gawk does not remove CR in advance, unless someone specifically set RS = "\r\n", in which case the full regex matcher is used to first find \r\n in the raw input buffer. So for gawk, adding a check for (c == eolbyte || c == '\r') should produce more speedup on Windows. (Hmm, on Windows the default is probably text mode which causes the library/OS to hide the \r anway. Harumph. But if binary mode wsa requested then it could still make a difference.) > BTW, although I say `newline', correctly notice that it's `eolbyte' > which mayn't be either LF or NUL. Understood and agreed. Adding a check for \r isn't a big deal in any case, but of the 5 characters Erik mentioned originally, that is the only one where I see a potential for a check to really make a difference. Thanks! Arnold From debbugs-submit-bounces@debbugs.gnu.org Wed Oct 22 11:28:49 2014 Received: (at 18777) by debbugs.gnu.org; 22 Oct 2014 15:28:49 +0000 Received: from localhost ([127.0.0.1]:59975 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1XgxqC-0002Rm-QT for submit@debbugs.gnu.org; Wed, 22 Oct 2014 11:28:49 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:52879) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Xgxq8-0002RS-RS for 18777@debbugs.gnu.org; Wed, 22 Oct 2014 11:28:46 -0400 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 75DA4802C7 for <18777@debbugs.gnu.org>; Thu, 23 Oct 2014 00:28:37 +0900 (JST) Received: from mail05.kcn.ne.jp ([61.86.6.184]) by imp02 with bizsmtp id 6FUd1p0073yDdWd01FUdct; Thu, 23 Oct 2014 00:28:37 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.42] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail05.kcn.ne.jp (Postfix) with ESMTPA id 15DBC7D009F; Thu, 23 Oct 2014 00:28:37 +0900 (JST) Date: Thu, 23 Oct 2014 00:28:35 +0900 From: Norihiro Tanaka To: arnold@skeeve.com Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <201410211443.s9LEhlH7008690@freefriends.org> References: <20141021222519.74AB.27F6AC2D@kcn.ne.jp> <201410211443.s9LEhlH7008690@freefriends.org> Message-Id: <20141023002834.2860.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -1.4 (-) X-Debbugs-Envelope-To: 18777 Cc: eblake@redhat.com, 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.4 (-) arnold@skeeve.com wrote: > Gawk does not remove CR in advance, unless someone specifically > set RS = "\r\n", in which case the full regex matcher is used > to first find \r\n in the raw input buffer. Thanks, I also confirmed it on source code of Gawk. > So for gawk, adding a check for (c == eolbyte || c == '\r') > should produce more speedup on Windows. > > (Hmm, on Windows the default is probably text mode which causes > the library/OS to hide the \r anway. Harumph. But if binary mode > wsa requested then it could still make a difference.) I think It's better to build KWset rather than rely on checking for '\r' in non-UTF8 multibyte mode of DFA. Further more, even if we add checking for '\r' to DFA, I think that we can't use to speed up on Windows, so that DFA can't correctly locate a matched position except a pattern which is fixed string. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 15 09:59:42 2014 Received: (at 18777) by debbugs.gnu.org; 15 Dec 2014 14:59:42 +0000 Received: from localhost ([127.0.0.1]:47070 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0X7e-0006wG-Ah for submit@debbugs.gnu.org; Mon, 15 Dec 2014 09:59:42 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:50979) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0X7b-0006vz-VA for 18777@debbugs.gnu.org; Mon, 15 Dec 2014 09:59:41 -0500 Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id 65AF46C1270 for <18777@debbugs.gnu.org>; Mon, 15 Dec 2014 23:59:32 +0900 (JST) Received: from mail08.kcn.ne.jp ([61.86.6.187]) by imp01 with bizsmtp id TqzY1p004426eXR01qzYuX; Mon, 15 Dec 2014 23:59:32 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.53] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail08.kcn.ne.jp (Postfix) with ESMTPA id 2D89112B8098; Mon, 15 Dec 2014 23:59:32 +0900 (JST) Date: Mon, 15 Dec 2014 23:59:32 +0900 From: Norihiro Tanaka To: Eric Blake Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <54453338.6090406@redhat.com> References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> <54453338.6090406@redhat.com> Message-Id: <20141215235931.B2D8.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_548EF5F900000000B2D6_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 18777 Cc: 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --------_548EF5F900000000B2D6_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Mon, 20 Oct 2014 10:07:20 -0600 Eric Blake wrote: > POSIX requires that NUL, slash, dot, newline, and carriage return all be > single bytes that cannot occur inside a multibyte character (because > they have special meaning to file name resolution and/or terminal > interaction); it added this requirement fairly recently, but only after > confirming that common existing locales satisfy this constraint. (The > same is not true for most any other character; even though POSIX > requires that a-z, A-Z, and 0-9 be single bytes, it does not forbid > those characters from also being bytes embedded within multibyte > characters). Is it worth extending your optimization to all five of the > POSIX-guaranteed single byte characters? I rewrote the patch so that NUL, slash, dot and carriage return as well as newline might be also regarded as a special character. --------_548EF5F900000000B2D6_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Disposition: attachment; filename="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Transfer-Encoding: base64 RnJvbSA1OGJmNjg5NDI5M2VlNTIxNDVlYmU1MjIzYWNkNjg1ZWYyNWY3NDRmIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDE1IERlYyAyMDE0IDIzOjQwOjE3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBpbXByb3ZlbWVudCBmb3IgY2hlY2tpbmcgb2YgbXVsdGlieXRlIGNoYXJhY3RlciBib3VuZGFy eQoKV2hlbiBmb3VuZCBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5ub3Qgb2NjdXIgaW5zaWRlIGEgbXVs dGlieXRlIGNoYXJhY3Rlcgp3ZSBjYW4gc2tpcCBjaGVjayBmb3IgbXVsdGlieXRlIGNoYXJhY3Rl ciBib3VuZGFyeSBiZWZvcmUgdGhlIGNoYXJhY3Rlci4KClRoZSBpbXByb3ZlbWVudCBzcGVlZHMg dXAgYWJvdXQgNDAlIGZvciBpbnB1dCBzdHJpbmcgd2hpY2ggZG9lc24ndCBtYXRjaApldmVuIHRo ZSBmaXJzdCBwYXJ0IG9mIGEgcGF0dGVybi4KCiogc3JjL2RmYS5jIChhbHdheXNfc2luZ2xlX2J5 dGUpOiBBZGQgYSBuZXcgdmFyaWFibGUuICBJdCBjYWNoZXMgd2hldGhlcgplYWNoIGJ5dGUgY2Fu IG9jY3VyIGluc2lkZSBhIG11bHRpYnl0ZSBjaGFyYWN0ZXIgb3Igbm90LgooZGZhYWx3YXlzc2Ip OiBBZGQgYSBuZXcgZnVuY3Rpb24uCihkZmFjb21wKTogVXNlIGl0Lgooc2tpcF9yZW1haW5zX21i KTogSWYgYW4gaW5wdXQgY2hhcmFjdGVyIGlzIHNpbmdsZSBieXRlcyB0aGF0IGNhbm5vdApvY2N1 ciBpbnNpZGUgYSBtdWx0aWJ5dGUgY2hhcmFjdGVyLCBza2lwIGNoZWNrIGZvciBtdWx0aWJ5dGUg Y2hhcmFjdGVyCmJvdW5kYXJ5IHVudGlsIHRoZXJlLgotLS0KIHNyYy9kZmEuYyB8IDE1ICsrKysr KysrKysrKysrKwogMSBmaWxlIGNoYW5nZWQsIDE1IGluc2VydGlvbnMoKykKCmRpZmYgLS1naXQg YS9zcmMvZGZhLmMgYi9zcmMvZGZhLmMKaW5kZXggODA2Y2IwNC4uMDU5YzViMiAxMDA2NDQKLS0t IGEvc3JjL2RmYS5jCisrKyBiL3NyYy9kZmEuYwpAQCAtNDUxLDYgKzQ1MSwxOCBAQCBzdHJ1Y3Qg ZGZhCiBzdGF0aWMgdm9pZCBkZmFtdXN0IChzdHJ1Y3QgZGZhICpkZmEpOwogc3RhdGljIHZvaWQg cmVnZXhwICh2b2lkKTsKIAorLyogVHJ1ZSBpZiBlYWNoIGJ5dGUgY2FuIG5vdCBvY2N1ciBpbnNp ZGUgYSBtdWx0aWJ5dGUgY2hhcmFjdGVyICAqLworc3RhdGljIGJvb2wgYWx3YXlzX3NpbmdsZV9i eXRlW05PVENIQVJdOworCitzdGF0aWMgdm9pZAorZGZhYWx3YXlzc2IgKHZvaWQpCit7CisgIHNp emVfdCBpOworICB1bnNpZ25lZCBjaGFyIGNvbnN0IHVjW10gPSB7ICdcMCcsICdcbicsICdccics ICcuJywgJy8nIH07CisgIGZvciAoaSA9IDA7IGkgPCBzaXplb2YgdWMgLyBzaXplb2YgdWNbMF07 ICsraSkKKyAgICBhbHdheXNfc2luZ2xlX2J5dGVbdWNbaV1dID0gdHJ1ZTsKK30KKwogc3RhdGlj IHZvaWQKIGRmYW1iY2FjaGUgKHN0cnVjdCBkZmEgKmQpCiB7CkBAIC0zMjc5LDYgKzMyOTEsOCBA QCBza2lwX3JlbWFpbnNfbWIgKHN0cnVjdCBkZmEgKmQsIHVuc2lnbmVkIGNoYXIgY29uc3QgKnAs CiAgICAgICAgICAgICAgICAgIHVuc2lnbmVkIGNoYXIgY29uc3QgKm1icCwgY2hhciBjb25zdCAq ZW5kLCB3aW50X3QgKndjcCkKIHsKICAgd2ludF90IHdjID0gV0VPRjsKKyAgaWYgKGFsd2F5c19z aW5nbGVfYnl0ZVsqcF0pCisgICAgcmV0dXJuIHA7CiAgIHdoaWxlIChtYnAgPCBwKQogICAgIG1i cCArPSBtYnNfdG9fd2NoYXIgKCZ3YywgKGNoYXIgY29uc3QgKikgbWJwLAogICAgICAgICAgICAg ICAgICAgICAgICAgIGVuZCAtIChjaGFyIGNvbnN0ICopIG1icCwgZCk7CkBAIC0zNzEzLDYgKzM3 MjcsNyBAQCB2b2lkCiBkZmFjb21wIChjaGFyIGNvbnN0ICpzLCBzaXplX3QgbGVuLCBzdHJ1Y3Qg ZGZhICpkLCBpbnQgc2VhcmNoZmxhZykKIHsKICAgZGZhaW5pdCAoZCk7CisgIGRmYWFsd2F5c3Ni ICgpOwogICBkZmFtYmNhY2hlIChkKTsKICAgZGZhcGFyc2UgKHMsIGxlbiwgZCk7CiAgIGRmYW11 c3QgKGQpOwotLSAKMi4yLjAKCg== --------_548EF5F900000000B2D6_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 15 12:44:13 2014 Received: (at 18777) by debbugs.gnu.org; 15 Dec 2014 17:44:13 +0000 Received: from localhost ([127.0.0.1]:47135 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0Zgr-0002b4-1B for submit@debbugs.gnu.org; Mon, 15 Dec 2014 12:44:13 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:52568) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0Zgn-0002ap-RQ for 18777@debbugs.gnu.org; Mon, 15 Dec 2014 12:44:10 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id DD534A6004D; Mon, 15 Dec 2014 09:44:03 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 3FMrUN-yE4de; Mon, 15 Dec 2014 09:43:58 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 13BFBA60051; Mon, 15 Dec 2014 09:43:58 -0800 (PST) Message-ID: <548F1DDA.8070308@cs.ucla.edu> Date: Mon, 15 Dec 2014 09:43:54 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Norihiro Tanaka , Eric Blake Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> <54453338.6090406@redhat.com> <20141215235931.B2D8.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141215235931.B2D8.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) On 12/15/2014 06:59 AM, Norihiro Tanaka wrote: > +/* True if each byte can not occur inside a multibyte character */ > +static bool always_single_byte[NOTCHAR]; > + > +static void > +dfaalwayssb (void) > +{ > + size_t i; > + unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' }; > + for (i = 0; i < sizeof uc / sizeof uc[0]; ++i) > + always_single_byte[uc[i]] = true; > +} Can't we improve this when using_utf8 () is true? In that case, every ASCII character is always single byte. Also, the bytes 0xc0, 0xc1, and 0xf5 through 0xff can be added to the table: they are not single-byte characters but they are always encoding errors so they will be a character boundary as far as skip_remains_mb is concerned. This suggests that the table 'always_single_byte' should be renamed to something like 'always_character_boundary'. > wint_t wc = WEOF; > + if (always_single_byte[*p]) > + return p; This won't assign anything to *WCP, contrary to the documented API for for skip_remains_mb. This is OK (as callers don't care) but the API documentation should be changed to reflect the actual behavior. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 16 07:42:37 2014 Received: (at 18777) by debbugs.gnu.org; 16 Dec 2014 12:42:37 +0000 Received: from localhost ([127.0.0.1]:47544 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0rSW-0006jV-RL for submit@debbugs.gnu.org; Tue, 16 Dec 2014 07:42:37 -0500 Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:39500) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0rST-0006jK-UL for 18777@debbugs.gnu.org; Tue, 16 Dec 2014 07:42:35 -0500 Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231]) by mailgw06.kcn.ne.jp (Postfix) with ESMTP id A3BF3C8006 for <18777@debbugs.gnu.org>; Tue, 16 Dec 2014 21:42:31 +0900 (JST) Received: from mail02.kcn.ne.jp ([61.86.6.181]) by imp01 with bizsmtp id UCiX1p00J3uLcVp01CiX0m; Tue, 16 Dec 2014 21:42:31 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.76] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail02.kcn.ne.jp (Postfix) with ESMTPA id 374FEBE8001; Tue, 16 Dec 2014 21:42:31 +0900 (JST) Date: Tue, 16 Dec 2014 21:42:32 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <548F1DDA.8070308@cs.ucla.edu> References: <20141215235931.B2D8.27F6AC2D@kcn.ne.jp> <548F1DDA.8070308@cs.ucla.edu> Message-Id: <20141216214231.4D26.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_54902337000000004D16_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --------_54902337000000004D16_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Mon, 15 Dec 2014 09:43:54 -0800 Paul Eggert wrote: > Can't we improve this when using_utf8 () is true? In that case, every > ASCII character is always single byte. Also, the bytes 0xc0, 0xc1, > and 0xf5 through 0xff can be added to the table: they are not > single-byte characters but they are always encoding errors so they will > be a character boundary as far as skip_remains_mb is concerned. This > suggests that the table 'always_single_byte' should be renamed to > something like 'always_character_boundary'. > > > wint_t wc = WEOF; > > + if (always_single_byte[*p]) > > + return p; Thanks for the review and suggestion. If using_utf8 () is true, we can set always_character_boundary to true except 0x80-0xbf. > This won't assign anything to *WCP, contrary to the documented API for > for skip_remains_mb. This is OK (as callers don't care) but the API > documentation should be changed to reflect the actual behavior. Oh! if WCP is needed, we must be go through step by step, as a wide character before P is set to *WCP. I fixed it and updated the API documentation. --------_54902337000000004D16_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Disposition: attachment; filename="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Transfer-Encoding: base64 RnJvbSA4MmQ0YjJlNTUyMDNkZTBmYmFiYzU0N2RlOWI5NWIyYTZlNTAxOGQwIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDE1IERlYyAyMDE0IDIzOjQwOjE3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBpbXByb3ZlbWVudCBmb3IgY2hlY2tpbmcgb2YgbXVsdGlieXRlIGNoYXJhY3RlciBib3VuZGFy eQoKV2hlbiBmb3VuZCBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5ub3Qgb2NjdXIgaW5zaWRlIGEgbXVs dGlieXRlIGNoYXJhY3Rlcgp3ZSBjYW4gc2tpcCBjaGVjayBmb3IgbXVsdGlieXRlIGNoYXJhY3Rl ciBib3VuZGFyeSBiZWZvcmUgdGhlIGNoYXJhY3Rlci4KClRoZSBpbXByb3ZlbWVudCBzcGVlZHMg dXAgYWJvdXQgNDAlIGZvciBpbnB1dCBzdHJpbmcgd2hpY2ggZG9lc24ndCBtYXRjaApldmVuIHRo ZSBmaXJzdCBwYXJ0IG9mIGEgcGF0dGVybi4KCiogc3JjL2RmYS5jIChhbHdheXNfY2hhcmFjdGVy X2JvdW5kYXJ5KTogQWRkIGEgbmV3IHZhcmlhYmxlLiAgSXQgY2FjaGVzCndoZXRoZXIgZWFjaCBi eXRlIGNhbiBvY2N1ciBpbnNpZGUgYSBtdWx0aWJ5dGUgY2hhcmFjdGVyIG9yIG5vdC4KKGRmYWFs d2F5c2NiKTogQWRkIGEgbmV3IGZ1bmN0aW9uLgooZGZhY29tcCk6IFVzZSBpdC4KKHNraXBfcmVt YWluc19tYik6IElmIGFuIGlucHV0IGNoYXJhY3RlciBpcyBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5u b3QKb2NjdXIgaW5zaWRlIGEgbXVsdGlieXRlIGNoYXJhY3Rlciwgc2tpcCBjaGVjayBmb3IgbXVs dGlieXRlIGNoYXJhY3Rlcgpib3VuZGFyeSB1bnRpbCB0aGVyZS4KLS0tCiBzcmMvZGZhLmMgfCAz MiArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrLQogMSBmaWxlIGNoYW5nZWQsIDMxIGlu c2VydGlvbnMoKyksIDEgZGVsZXRpb24oLSkKCmRpZmYgLS1naXQgYS9zcmMvZGZhLmMgYi9zcmMv ZGZhLmMKaW5kZXggODA2Y2IwNC4uYjY1YjdmMiAxMDA2NDQKLS0tIGEvc3JjL2RmYS5jCisrKyBi L3NyYy9kZmEuYwpAQCAtNDUxLDYgKzQ1MSwyOSBAQCBzdHJ1Y3QgZGZhCiBzdGF0aWMgdm9pZCBk ZmFtdXN0IChzdHJ1Y3QgZGZhICpkZmEpOwogc3RhdGljIHZvaWQgcmVnZXhwICh2b2lkKTsKIAor LyogVHJ1ZSBpZiBlYWNoIGJ5dGUgY2FuIG5vdCBvY2N1ciBpbnNpZGUgYSBtdWx0aWJ5dGUgY2hh cmFjdGVyICAqLworc3RhdGljIGJvb2wgYWx3YXlzX2NoYXJhY3Rlcl9ib3VuZGFyeVtOT1RDSEFS XTsKKworc3RhdGljIHZvaWQKK2RmYWFsd2F5c2NiICh2b2lkKQoreworICBpbnQgaTsKKyAgaWYg KHVzaW5nX3V0ZjggKCkpCisgICAgeworICAgICAgZm9yIChpID0gQ0hBUl9NSU47IGkgPD0gQ0hB Ul9NQVg7ICsraSkKKyAgICAgICAgeworICAgICAgICAgIHVuc2lnbmVkIGNoYXIgdWMgPSBpOwor ICAgICAgICAgIGFsd2F5c19jaGFyYWN0ZXJfYm91bmRhcnlbdWNdID0gISgodWMgJiAweGMwKSBe IDB4ODApOworICAgICAgICB9CisgICAgfQorICBlbHNlCisgICAgeworICAgICAgdW5zaWduZWQg Y2hhciBjb25zdCB1Y3NbXSA9IHsgJ1wwJywgJ1xuJywgJ1xyJywgJy4nLCAnLycgfTsKKyAgICAg IGZvciAoaSA9IDA7IGkgPCBzaXplb2YgdWNzIC8gc2l6ZW9mIHVjc1swXTsgKytpKQorICAgICAg ICBhbHdheXNfY2hhcmFjdGVyX2JvdW5kYXJ5W3Vjc1tpXV0gPSB0cnVlOworICAgIH0KK30KKwog c3RhdGljIHZvaWQKIGRmYW1iY2FjaGUgKHN0cnVjdCBkZmEgKmQpCiB7CkBAIC0zMjczLDEyICsz Mjk2LDE4IEBAIHRyYW5zaXRfc3RhdGUgKHN0cnVjdCBkZmEgKmQsIHN0YXRlX251bSBzLCB1bnNp Z25lZCBjaGFyIGNvbnN0ICoqcHAsCiAgICBHaXZlbiBERkEgc3RhdGUgZCwgdXNlIG1ic190b193 Y2hhciB0byBhZHZhbmNlIE1CUCB1bnRpbCBpdCByZWFjaGVzIG9yCiAgICBleGNlZWRzIFAuICBJ ZiBXQ1AgaXMgbm9uLU5VTEwsIHNldCAqV0NQIHRvIHRoZSBmaW5hbCB3aWRlIGNoYXJhY3Rlcgog ICAgcHJvY2Vzc2VkLCBvciBpZiBubyB3aWRlIGNoYXJhY3RlciBpcyBwcm9jZXNzZWQsIHNldCBp dCB0byBXRU9GLgotICAgQm90aCBQIGFuZCBNQlAgbXVzdCBiZSBubyBsYXJnZXIgdGhhbiBFTkQu ICAqLworICAgQm90aCBQIGFuZCBNQlAgbXVzdCBiZSBubyBsYXJnZXIgdGhhbiBFTkQuCisKKyAg IElmIG5leHQgY2hhcmFjdGVyIGlzIGNoYXJhY3RlciBib3VuZGFyeSwgaXQgYWx3YXlzIHJlYWNo ZXMgUC4gIFNvIGlmCisgICBXQ1AgaXMgTlVMTCwgY2FuIG9taXQgY2hlY2sgc3RlcCBieSBzdGVw IGFuZCBpbW1lZGlhdGVseSByZXR1cm4gUC4gICovCiBzdGF0aWMgdW5zaWduZWQgY2hhciBjb25z dCAqCiBza2lwX3JlbWFpbnNfbWIgKHN0cnVjdCBkZmEgKmQsIHVuc2lnbmVkIGNoYXIgY29uc3Qg KnAsCiAgICAgICAgICAgICAgICAgIHVuc2lnbmVkIGNoYXIgY29uc3QgKm1icCwgY2hhciBjb25z dCAqZW5kLCB3aW50X3QgKndjcCkKIHsKICAgd2ludF90IHdjID0gV0VPRjsKKyAgaWYgKHdjcCA9 PSBOVUxMICYmIGFsd2F5c19jaGFyYWN0ZXJfYm91bmRhcnlbKnBdKQorICAgIHJldHVybiBwOwor CiAgIHdoaWxlIChtYnAgPCBwKQogICAgIG1icCArPSBtYnNfdG9fd2NoYXIgKCZ3YywgKGNoYXIg Y29uc3QgKikgbWJwLAogICAgICAgICAgICAgICAgICAgICAgICAgIGVuZCAtIChjaGFyIGNvbnN0 ICopIG1icCwgZCk7CkBAIC0zNzEzLDYgKzM3NDIsNyBAQCB2b2lkCiBkZmFjb21wIChjaGFyIGNv bnN0ICpzLCBzaXplX3QgbGVuLCBzdHJ1Y3QgZGZhICpkLCBpbnQgc2VhcmNoZmxhZykKIHsKICAg ZGZhaW5pdCAoZCk7CisgIGRmYWFsd2F5c2NiICgpOwogICBkZmFtYmNhY2hlIChkKTsKICAgZGZh cGFyc2UgKHMsIGxlbiwgZCk7CiAgIGRmYW11c3QgKGQpOwotLSAKMi4yLjAKCg== --------_54902337000000004D16_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 16 12:12:34 2014 Received: (at 18777) by debbugs.gnu.org; 16 Dec 2014 17:12:34 +0000 Received: from localhost ([127.0.0.1]:48057 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0vfm-0006P6-5L for submit@debbugs.gnu.org; Tue, 16 Dec 2014 12:12:34 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:58717) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y0vfk-0006Oy-83 for 18777@debbugs.gnu.org; Tue, 16 Dec 2014 12:12:33 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id F3ED9A60084; Tue, 16 Dec 2014 09:12:30 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HdTHbL1OgioK; Tue, 16 Dec 2014 09:12:22 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 486B9A60029; Tue, 16 Dec 2014 09:12:22 -0800 (PST) Message-ID: <549067F5.2080802@cs.ucla.edu> Date: Tue, 16 Dec 2014 09:12:21 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Norihiro Tanaka Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141215235931.B2D8.27F6AC2D@kcn.ne.jp> <548F1DDA.8070308@cs.ucla.edu> <20141216214231.4D26.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141216214231.4D26.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) On 12/16/2014 04:42 AM, Norihiro Tanaka wrote: > Thanks for the review and suggestion. If using_utf8 () is true, we can > set always_character_boundary to true except 0x80-0xbf. Even better, thanks. >> >This won't assign anything to *WCP, contrary to the documented API for >> >for skip_remains_mb. This is OK (as callers don't care) but the API >> >documentation should be changed to reflect the actual behavior. > Oh! if WCP is needed, we must be go through step by step, as a wide > character before P is set to *WCP. I fixed it and updated the API > documentation. This part of the patch does too much work, as the caller inspects *WCP only when skip_remains_mb returns a value not equal to p. So there's no need for the "wcp == NULL &&" test in the patch. Instead, the documented API can change, saying that *WCP is assigned to only if WCP is non-NULL and the result is greater than p. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 16 18:22:27 2014 Received: (at 18777) by debbugs.gnu.org; 16 Dec 2014 23:22:28 +0000 Received: from localhost ([127.0.0.1]:48236 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y11Rj-0004IK-I1 for submit@debbugs.gnu.org; Tue, 16 Dec 2014 18:22:27 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:39075) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y11Rg-0004I5-Pb for 18777@debbugs.gnu.org; Tue, 16 Dec 2014 18:22:26 -0500 Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id 532C76C1BFB for <18777@debbugs.gnu.org>; Wed, 17 Dec 2014 08:22:20 +0900 (JST) Received: from mail02.kcn.ne.jp ([61.86.6.181]) by imp03 with bizsmtp id UPNL1p0063uLcVp01PNLhn; Wed, 17 Dec 2014 08:22:20 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.76] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail02.kcn.ne.jp (Postfix) with ESMTPA id 109DFF10025; Wed, 17 Dec 2014 08:22:20 +0900 (JST) Date: Wed, 17 Dec 2014 08:22:20 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <549067F5.2080802@cs.ucla.edu> References: <20141216214231.4D26.27F6AC2D@kcn.ne.jp> <549067F5.2080802@cs.ucla.edu> Message-Id: <20141217082220.BD6E.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Tue, 16 Dec 2014 09:12:21 -0800 Paul Eggert wrote: > > This part of the patch does too much work, as the caller inspects *WCP > only when skip_remains_mb returns a value not equal to p. So there's > no need for the "wcp == NULL &&" test in the patch. Instead, the > documented API can change, saying that *WCP is assigned to only if WCP > is non-NULL and the result is greater than p. Thanks, you are right. However, first it is no longer portable after remove it. Second if it is compiled with GCC 4.3 or later, the function is inlined by and "WCP == NULL &&" will be pruned. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 16 19:07:11 2014 Received: (at 18777) by debbugs.gnu.org; 17 Dec 2014 00:07:11 +0000 Received: from localhost ([127.0.0.1]:48265 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1290-0005RF-TK for submit@debbugs.gnu.org; Tue, 16 Dec 2014 19:07:11 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:53742) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y128y-0005R6-81 for 18777@debbugs.gnu.org; Tue, 16 Dec 2014 19:07:09 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 25378A60093; Tue, 16 Dec 2014 16:07:07 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id cksesy68HW8K; Tue, 16 Dec 2014 16:06:58 -0800 (PST) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 67AB5A60091; Tue, 16 Dec 2014 16:06:58 -0800 (PST) Message-ID: <5490C91E.8070007@cs.ucla.edu> Date: Tue, 16 Dec 2014 16:06:54 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Norihiro Tanaka Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141216214231.4D26.27F6AC2D@kcn.ne.jp> <549067F5.2080802@cs.ucla.edu> <20141217082220.BD6E.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141217082220.BD6E.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Norihiro Tanaka wrote: > However, first it is no longer portable after > remove it. "portable"? This issue is independent of platform, surely. By "portable" did you mean "robust in the presence of future changes? > Second if it is compiled with GCC 4.3 or later, the function > is inlined by and "WCP == NULL &&" will be pruned. True, but I wasn't worried so much about that. I was worried about the case where WCP != NULL: there, the inlined function will be slower because it won't use the faster approach of checking always_character_boundary[*p]: it'll always use the much-slower loop. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 17 12:21:34 2014 Received: (at 18777) by debbugs.gnu.org; 17 Dec 2014 17:21:34 +0000 Received: from localhost ([127.0.0.1]:49142 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1II2-0004r4-2P for submit@debbugs.gnu.org; Wed, 17 Dec 2014 12:21:34 -0500 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:52116) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1IHz-0004qq-TS for 18777@debbugs.gnu.org; Wed, 17 Dec 2014 12:21:33 -0500 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id D337D80325 for <18777@debbugs.gnu.org>; Thu, 18 Dec 2014 02:21:29 +0900 (JST) Received: from mail08.kcn.ne.jp ([61.86.6.187]) by imp02 with bizsmtp id UhMV1p00D426eXR01hMVAa; Thu, 18 Dec 2014 02:21:29 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.47] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail08.kcn.ne.jp (Postfix) with ESMTPA id 761B412B802E; Thu, 18 Dec 2014 02:21:29 +0900 (JST) Date: Thu, 18 Dec 2014 02:21:30 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <5490C91E.8070007@cs.ucla.edu> References: <20141217082220.BD6E.27F6AC2D@kcn.ne.jp> <5490C91E.8070007@cs.ucla.edu> Message-Id: <20141218022130.3F0B.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Tue, 16 Dec 2014 16:06:54 -0800 Paul Eggert wrote: > did you mean "robust in the presence of future changes? Yes. However, I might have made too big a deal of the effect about "Portable". > True, but I wasn't worried so much about that. I was worried about the > case where WCP != NULL: there, the inlined function will be slower > because it won't use the faster approach of checking > always_character_boundary[*p]: it'll always use the much-slower loop. If WCP != NULL, all of following code will be pruned, although I think that it is ignorable for the performance. if (wcp == NULL && always_character_boundary[*p]) return p; From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 17 12:46:36 2014 Received: (at 18777) by debbugs.gnu.org; 17 Dec 2014 17:46:36 +0000 Received: from localhost ([127.0.0.1]:49178 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1IgF-0008Gb-Qp for submit@debbugs.gnu.org; Wed, 17 Dec 2014 12:46:36 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:39416) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1IgD-0008GT-Mi for 18777@debbugs.gnu.org; Wed, 17 Dec 2014 12:46:34 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 41213A600A0; Wed, 17 Dec 2014 09:46:32 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id TKVY6o7lKgyd; Wed, 17 Dec 2014 09:46:23 -0800 (PST) Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id A7547A6009F; Wed, 17 Dec 2014 09:46:23 -0800 (PST) Message-ID: <5491C161.3030900@cs.ucla.edu> Date: Wed, 17 Dec 2014 09:46:09 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Norihiro Tanaka Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141217082220.BD6E.27F6AC2D@kcn.ne.jp> <5490C91E.8070007@cs.ucla.edu> <20141218022130.3F0B.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141218022130.3F0B.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) On 12/17/2014 09:21 AM, Norihiro Tanaka wrote: > If WCP != NULL, all of following code will be pruned, although I think > that it is ignorable for the performance. > > if (wcp == NULL && always_character_boundary[*p]) > return p; Yes, and that's the point: we don't want this if-statement to be pruned if WCP != NULL. We want the code to return P right away in the typical case where P is at a character boundary. If MBP is way less than P, this will save the work of the following loop. From debbugs-submit-bounces@debbugs.gnu.org Wed Dec 17 18:50:26 2014 Received: (at 18777) by debbugs.gnu.org; 17 Dec 2014 23:50:26 +0000 Received: from localhost ([127.0.0.1]:49343 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1OML-0001k8-AJ for submit@debbugs.gnu.org; Wed, 17 Dec 2014 18:50:26 -0500 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:35425) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1OMH-0001jx-Ez for 18777@debbugs.gnu.org; Wed, 17 Dec 2014 18:50:22 -0500 Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 2C34F8032A for <18777@debbugs.gnu.org>; Thu, 18 Dec 2014 08:50:19 +0900 (JST) Received: from mail01.kcn.ne.jp ([61.86.6.180]) by imp03 with bizsmtp id UnqJ1p0063t2w9Z01nqJqN; Thu, 18 Dec 2014 08:50:18 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.47] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail01.kcn.ne.jp (Postfix) with ESMTPA id C12135A8232; Thu, 18 Dec 2014 08:50:18 +0900 (JST) Date: Thu, 18 Dec 2014 08:50:19 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <5491C161.3030900@cs.ucla.edu> References: <20141218022130.3F0B.27F6AC2D@kcn.ne.jp> <5491C161.3030900@cs.ucla.edu> Message-Id: <20141218085019.770B.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) On Wed, 17 Dec 2014 09:46:09 -0800 Paul Eggert wrote: > Yes, and that's the point: we don't want this if-statement to be pruned > if WCP != NULL. We want the code to return P right away in the typical > case where P is at a character boundary. If MBP is way less than P, > this will save the work of the following loop. We must set a wide character for not next but previous character to WCP in a case to return P. For example, I assume following sequence in Shift_JIS locale. A pair of 0x95 0x5c is a multibyte character in Shift_JIS locale. I assume to input MBP = position (a) and P = position (d) into skip_remains_mb(). 0x41 0x95 0x5c 0x0a (a) (b) (c) (d) If WCP == NULL, we can return P right away. On the other hands, if WCP != NULL, we must set a wide character for 0x95 0x5c to WCP before return P. Do you have any ideas to utilize always_character_boundary for the case? From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 18 04:40:38 2014 Received: (at 18777) by debbugs.gnu.org; 18 Dec 2014 09:40:38 +0000 Received: from localhost ([127.0.0.1]:49477 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1XZU-0008RZ-19 for submit@debbugs.gnu.org; Thu, 18 Dec 2014 04:40:37 -0500 Received: from smtp.cs.ucla.edu ([131.179.128.62]:52385) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1XZM-0008RI-9u for 18777@debbugs.gnu.org; Thu, 18 Dec 2014 04:40:29 -0500 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp.cs.ucla.edu (Postfix) with ESMTP id 169F7A600C1; Thu, 18 Dec 2014 01:40:27 -0800 (PST) X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu Received: from smtp.cs.ucla.edu ([127.0.0.1]) by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id S7jOutCBv9nB; Thu, 18 Dec 2014 01:40:18 -0800 (PST) Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net [71.177.17.123]) by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 65240A60034; Thu, 18 Dec 2014 01:40:18 -0800 (PST) Message-ID: <5492A102.9020603@cs.ucla.edu> Date: Thu, 18 Dec 2014 01:40:18 -0800 From: Paul Eggert Organization: UCLA Computer Science Department User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 MIME-Version: 1.0 To: Norihiro Tanaka Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary References: <20141218022130.3F0B.27F6AC2D@kcn.ne.jp> <5491C161.3030900@cs.ucla.edu> <20141218085019.770B.27F6AC2D@kcn.ne.jp> In-Reply-To: <20141218085019.770B.27F6AC2D@kcn.ne.jp> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Norihiro Tanaka wrote: > if WCP != NULL, we must set a wide character for 0x95 0x5c to WCP before return P. Why? The (only) caller with WCP != NULL doesn't use *WCP when skip_remains_mb (D, P, ..., WCP) returns P. So it's OK to not set *WCP in that case. From debbugs-submit-bounces@debbugs.gnu.org Thu Dec 18 10:55:04 2014 Received: (at 18777) by debbugs.gnu.org; 18 Dec 2014 15:55:05 +0000 Received: from localhost ([127.0.0.1]:50279 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1dPr-0002Xr-TK for submit@debbugs.gnu.org; Thu, 18 Dec 2014 10:55:04 -0500 Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:40926) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1Y1dPo-0002XJ-02 for 18777@debbugs.gnu.org; Thu, 18 Dec 2014 10:55:02 -0500 Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232]) by mailgw04.kcn.ne.jp (Postfix) with ESMTP id BED956C1209 for <18777@debbugs.gnu.org>; Fri, 19 Dec 2014 00:54:57 +0900 (JST) Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp02 with bizsmtp id V3ux1p0093zXHqt013uxSG; Fri, 19 Dec 2014 00:54:57 +0900 X-OrgRCPT: 18777@debbugs.gnu.org Received: from [10.120.1.76] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail06.kcn.ne.jp (Postfix) with ESMTPA id 78F441BF0021; Fri, 19 Dec 2014 00:54:57 +0900 (JST) Date: Fri, 19 Dec 2014 00:54:58 +0900 From: Norihiro Tanaka To: Paul Eggert Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <5492A102.9020603@cs.ucla.edu> References: <20141218085019.770B.27F6AC2D@kcn.ne.jp> <5492A102.9020603@cs.ucla.edu> Message-Id: <20141219005457.470B.27F6AC2D@kcn.ne.jp> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_5492F7920000000046FE_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 18777 Cc: Eric Blake , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --------_5492F7920000000046FE_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Thu, 18 Dec 2014 01:40:18 -0800 Paul Eggert wrote: > Why? The (only) caller with WCP != NULL doesn't use *WCP when > skip_remains_mb (D, P, ..., WCP) returns P. So it's OK to not set *WCP > in that case. Thanks, I understood that you said. You are right. I changed the patch so that always_character_boundary is not pruned even if WCP != NULL, and fixed the API document. --------_5492F7920000000046FE_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Disposition: attachment; filename="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Transfer-Encoding: base64 RnJvbSA5YzQ1MjVjM2IyMWVlOWIzMmIwNDIxNTE0YmFmNmQ3YmFjZDYxMDdlIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDE1IERlYyAyMDE0IDIzOjQwOjE3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBpbXByb3ZlbWVudCBmb3IgY2hlY2tpbmcgb2YgbXVsdGlieXRlIGNoYXJhY3RlciBib3VuZGFy eQoKV2hlbiBmb3VuZCBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5ub3Qgb2NjdXIgaW5zaWRlIGEgbXVs dGlieXRlIGNoYXJhY3Rlcgp3ZSBjYW4gc2tpcCBjaGVjayBmb3IgbXVsdGlieXRlIGNoYXJhY3Rl ciBib3VuZGFyeSBiZWZvcmUgdGhlIGNoYXJhY3Rlci4KClRoZSBpbXByb3ZlbWVudCBzcGVlZHMg dXAgYWJvdXQgNDAlIGZvciBpbnB1dCBzdHJpbmcgd2hpY2ggZG9lc24ndCBtYXRjaApldmVuIHRo ZSBmaXJzdCBwYXJ0IG9mIGEgcGF0dGVybi4KCiogc3JjL2RmYS5jIChhbHdheXNfY2hhcmFjdGVy X2JvdW5kYXJ5KTogQWRkIGEgbmV3IHZhcmlhYmxlLiAgSXQgY2FjaGVzCndoZXRoZXIgZWFjaCBi eXRlIGNhbiBvY2N1ciBpbnNpZGUgYSBtdWx0aWJ5dGUgY2hhcmFjdGVyIG9yIG5vdC4KKGRmYWFs d2F5c2NiKTogQWRkIGEgbmV3IGZ1bmN0aW9uLgooZGZhY29tcCk6IFVzZSBpdC4KKHNraXBfcmVt YWluc19tYik6IElmIGFuIGlucHV0IGNoYXJhY3RlciBpcyBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5u b3QKb2NjdXIgaW5zaWRlIGEgbXVsdGlieXRlIGNoYXJhY3Rlciwgc2tpcCBjaGVjayBmb3IgbXVs dGlieXRlIGNoYXJhY3Rlcgpib3VuZGFyeSB1bnRpbCB0aGVyZS4KLS0tCiBzcmMvZGZhLmMgfCAz NyArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrLS0tCiAxIGZpbGUgY2hhbmdlZCwg MzQgaW5zZXJ0aW9ucygrKSwgMyBkZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9zcmMvZGZhLmMg Yi9zcmMvZGZhLmMKaW5kZXggODA2Y2IwNC4uNTBiZjQ1ZCAxMDA2NDQKLS0tIGEvc3JjL2RmYS5j CisrKyBiL3NyYy9kZmEuYwpAQCAtNDUxLDYgKzQ1MSwyOSBAQCBzdHJ1Y3QgZGZhCiBzdGF0aWMg dm9pZCBkZmFtdXN0IChzdHJ1Y3QgZGZhICpkZmEpOwogc3RhdGljIHZvaWQgcmVnZXhwICh2b2lk KTsKIAorLyogVHJ1ZSBpZiBlYWNoIGJ5dGUgY2FuIG5vdCBvY2N1ciBpbnNpZGUgYSBtdWx0aWJ5 dGUgY2hhcmFjdGVyICAqLworc3RhdGljIGJvb2wgYWx3YXlzX2NoYXJhY3Rlcl9ib3VuZGFyeVtO T1RDSEFSXTsKKworc3RhdGljIHZvaWQKK2RmYWFsd2F5c2NiICh2b2lkKQoreworICBpbnQgaTsK KyAgaWYgKHVzaW5nX3V0ZjggKCkpCisgICAgeworICAgICAgZm9yIChpID0gQ0hBUl9NSU47IGkg PD0gQ0hBUl9NQVg7ICsraSkKKyAgICAgICAgeworICAgICAgICAgIHVuc2lnbmVkIGNoYXIgdWMg PSBpOworICAgICAgICAgIGFsd2F5c19jaGFyYWN0ZXJfYm91bmRhcnlbdWNdID0gISgodWMgJiAw eGMwKSBeIDB4ODApOworICAgICAgICB9CisgICAgfQorICBlbHNlCisgICAgeworICAgICAgdW5z aWduZWQgY2hhciBjb25zdCB1Y3NbXSA9IHsgJ1wwJywgJ1xuJywgJ1xyJywgJy4nLCAnLycgfTsK KyAgICAgIGZvciAoaSA9IDA7IGkgPCBzaXplb2YgdWNzIC8gc2l6ZW9mIHVjc1swXTsgKytpKQor ICAgICAgICBhbHdheXNfY2hhcmFjdGVyX2JvdW5kYXJ5W3Vjc1tpXV0gPSB0cnVlOworICAgIH0K K30KKwogc3RhdGljIHZvaWQKIGRmYW1iY2FjaGUgKHN0cnVjdCBkZmEgKmQpCiB7CkBAIC0zMjcx LDE0ICszMjk0LDIxIEBAIHRyYW5zaXRfc3RhdGUgKHN0cnVjdCBkZmEgKmQsIHN0YXRlX251bSBz LCB1bnNpZ25lZCBjaGFyIGNvbnN0ICoqcHAsCiAgICBjaGFyYWN0ZXIuCiAKICAgIEdpdmVuIERG QSBzdGF0ZSBkLCB1c2UgbWJzX3RvX3djaGFyIHRvIGFkdmFuY2UgTUJQIHVudGlsIGl0IHJlYWNo ZXMgb3IKLSAgIGV4Y2VlZHMgUC4gIElmIFdDUCBpcyBub24tTlVMTCwgc2V0ICpXQ1AgdG8gdGhl IGZpbmFsIHdpZGUgY2hhcmFjdGVyCi0gICBwcm9jZXNzZWQsIG9yIGlmIG5vIHdpZGUgY2hhcmFj dGVyIGlzIHByb2Nlc3NlZCwgc2V0IGl0IHRvIFdFT0YuCi0gICBCb3RoIFAgYW5kIE1CUCBtdXN0 IGJlIG5vIGxhcmdlciB0aGFuIEVORC4gICovCisgICBleGNlZWRzIFAuICBJZiBXQ1AgaXMgbm9u LU5VTEwgYW5kIHRoZSByZXN1bHQgaXMgZ3JlYXRlciB0aGFuIHAsIHNldAorICAgKldDUCB0byB0 aGUgZmluYWwgd2lkZSBjaGFyYWN0ZXIgcHJvY2Vzc2VkLCBvciBpZiBubyB3aWRlIGNoYXJhY3Rl cgorICAgaXMgcHJvY2Vzc2VkLCBzZXQgaXQgdG8gV0VPRi4gIEJvdGggUCBhbmQgTUJQIG11c3Qg YmUgbm8gbGFyZ2VyIHRoYW4KKyAgIEVORC4KKworICAgSWYgbmV4dCBjaGFyYWN0ZXIgaXMgY2hh cmFjdGVyIGJvdW5kYXJ5LCBpdCBhbHdheXMgcmVhY2hlcyBQLiAgU28gaWYKKyAgIFdDUCBpcyBO VUxMLCBjYW4gb21pdCBjaGVjayBzdGVwIGJ5IHN0ZXAgYW5kIGltbWVkaWF0ZWx5IHJldHVybiBQ LiAgKi8KIHN0YXRpYyB1bnNpZ25lZCBjaGFyIGNvbnN0ICoKIHNraXBfcmVtYWluc19tYiAoc3Ry dWN0IGRmYSAqZCwgdW5zaWduZWQgY2hhciBjb25zdCAqcCwKICAgICAgICAgICAgICAgICAgdW5z aWduZWQgY2hhciBjb25zdCAqbWJwLCBjaGFyIGNvbnN0ICplbmQsIHdpbnRfdCAqd2NwKQogewog ICB3aW50X3Qgd2MgPSBXRU9GOworICBpZiAoYWx3YXlzX2NoYXJhY3Rlcl9ib3VuZGFyeVsqcF0p CisgICAgcmV0dXJuIHA7CisKICAgd2hpbGUgKG1icCA8IHApCiAgICAgbWJwICs9IG1ic190b193 Y2hhciAoJndjLCAoY2hhciBjb25zdCAqKSBtYnAsCiAgICAgICAgICAgICAgICAgICAgICAgICAg ZW5kIC0gKGNoYXIgY29uc3QgKikgbWJwLCBkKTsKQEAgLTM3MTMsNiArMzc0Myw3IEBAIHZvaWQK IGRmYWNvbXAgKGNoYXIgY29uc3QgKnMsIHNpemVfdCBsZW4sIHN0cnVjdCBkZmEgKmQsIGludCBz ZWFyY2hmbGFnKQogewogICBkZmFpbml0IChkKTsKKyAgZGZhYWx3YXlzY2IgKCk7CiAgIGRmYW1i Y2FjaGUgKGQpOwogICBkZmFwYXJzZSAocywgbGVuLCBkKTsKICAgZGZhbXVzdCAoZCk7Ci0tIAoy LjIuMAoK --------_5492F7920000000046FE_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Fri Jan 16 23:27:31 2015 Received: (at 18777) by debbugs.gnu.org; 17 Jan 2015 04:27:31 +0000 Received: from localhost ([127.0.0.1]:59630 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YCKyu-0000yL-ND for submit@debbugs.gnu.org; Fri, 16 Jan 2015 23:27:30 -0500 Received: from conuserg004.nifty.com ([202.248.45.245]:16413 helo=conuserg004-v.nifty.com) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from ) id 1YCIat-00077f-VY for 18777@debbugs.gnu.org; Fri, 16 Jan 2015 20:54:34 -0500 Received: from [10.120.1.76] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) (authenticated) by conuserg004-v.nifty.com with ESMTP id t0H1sKcn015854; Sat, 17 Jan 2015 10:54:21 +0900 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nifty.com; s=mar2011msa; t=1421459661; bh=mzW3ytZM5TWn5divYIQ/F+6nyE+8rxC+j5WkxZ91AJM=; h=Date:From:To:Subject:Cc:In-Reply-To:References:Message-Id: MIME-Version:Content-Type:Content-Transfer-Encoding; b=j5U6OC4uBIINWyiNKnCCXQUSnAr6/Fo98SnahjxQ6nb8lMWJPoWseLG77RNKYFg5e JqwocOvkWrJ96k7Sk3NKB4fxyokG3XwovrcfQB+dhtwqWD6qcUpfu4wKQRnllUvmmk yvdxrkPkuie1BfT81ZThh9OLLV41V3BcxlIUJ07w= X-Nifty-SrcIP: [118.21.128.66] Date: Sat, 17 Jan 2015 10:54:21 +0900 From: Norihiro Tanaka To: 18777@debbugs.gnu.org Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary In-Reply-To: <20141219005457.470B.27F6AC2D@kcn.ne.jp> References: <5492A102.9020603@cs.ucla.edu> <20141219005457.470B.27F6AC2D@kcn.ne.jp> Message-Id: <20150117105419.A39F.D1735FA@nifty.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_54B9BEBA00000000A38E_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 18777 X-Mailman-Approved-At: Fri, 16 Jan 2015 23:27:27 -0500 Cc: Paul Eggert X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: 0.0 (/) --------_54B9BEBA00000000A38E_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Fri, 19 Dec 2014 00:54:58 +0900 Norihiro Tanaka wrote: > On Thu, 18 Dec 2014 01:40:18 -0800 > Thanks, I understood that you said. You are right. I changed the patch > so that always_character_boundary is not pruned even if WCP != NULL, and > fixed the API document. I fixed a mismatch with the comment. It does not changes the behavior. We expect that skip_remains_mb() is inlined, and "*WCP = WC" is merged into "IF (P < MBP)" in caller. -- + exceeds P. If WCP is non-NULL and the result is greater than p, set + *WCP to the final wide character processed, or if no wide character + is processed, set it to WEOF. Both P and MBP must be no larger than + END. ........ - if (wcp != NULL) + if (wcp != NULL && p < mbp) --------_54B9BEBA00000000A38E_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Disposition: attachment; filename="0001-dfa-improvement-for-checking-of-multibyte-character-.patch" Content-Transfer-Encoding: base64 RnJvbSAwYWM4N2VmM2QzNWUxYmQxNGRjMGNmMWQzYzRhNjJlMzM1Yzg4ZjVlIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBNb24sIDE1IERlYyAyMDE0IDIzOjQwOjE3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gZGZh OiBpbXByb3ZlbWVudCBmb3IgY2hlY2tpbmcgb2YgbXVsdGlieXRlIGNoYXJhY3RlciBib3VuZGFy eQoKV2hlbiBmb3VuZCBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5ub3Qgb2NjdXIgaW5zaWRlIGEgbXVs dGlieXRlIGNoYXJhY3Rlcgp3ZSBjYW4gc2tpcCBjaGVjayBmb3IgbXVsdGlieXRlIGNoYXJhY3Rl ciBib3VuZGFyeSBiZWZvcmUgdGhlIGNoYXJhY3Rlci4KClRoZSBpbXByb3ZlbWVudCBzcGVlZHMg dXAgYWJvdXQgNDAlIGZvciBpbnB1dCBzdHJpbmcgd2hpY2ggZG9lc24ndCBtYXRjaApldmVuIHRo ZSBmaXJzdCBwYXJ0IG9mIGEgcGF0dGVybi4KCiogc3JjL2RmYS5jIChhbHdheXNfY2hhcmFjdGVy X2JvdW5kYXJ5KTogQWRkIGEgbmV3IHZhcmlhYmxlLiAgSXQgY2FjaGVzCndoZXRoZXIgZWFjaCBi eXRlIGNhbiBvY2N1ciBpbnNpZGUgYSBtdWx0aWJ5dGUgY2hhcmFjdGVyIG9yIG5vdC4KKGRmYWFs d2F5c2NiKTogQWRkIGEgbmV3IGZ1bmN0aW9uLgooZGZhY29tcCk6IFVzZSBpdC4KKHNraXBfcmVt YWluc19tYik6IElmIGFuIGlucHV0IGNoYXJhY3RlciBpcyBzaW5nbGUgYnl0ZXMgdGhhdCBjYW5u b3QKb2NjdXIgaW5zaWRlIGEgbXVsdGlieXRlIGNoYXJhY3Rlciwgc2tpcCBjaGVjayBmb3IgbXVs dGlieXRlIGNoYXJhY3Rlcgpib3VuZGFyeSB1bnRpbCB0aGVyZS4KLS0tCiBzcmMvZGZhLmMgfCAz OCArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrLS0tLQogMSBmaWxlIGNoYW5nZWQs IDM0IGluc2VydGlvbnMoKyksIDQgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvc3JjL2RmYS5j IGIvc3JjL2RmYS5jCmluZGV4IDgwNmNiMDQuLjI0MDYzODcgMTAwNjQ0Ci0tLSBhL3NyYy9kZmEu YworKysgYi9zcmMvZGZhLmMKQEAgLTQ1MSw2ICs0NTEsMjkgQEAgc3RydWN0IGRmYQogc3RhdGlj IHZvaWQgZGZhbXVzdCAoc3RydWN0IGRmYSAqZGZhKTsKIHN0YXRpYyB2b2lkIHJlZ2V4cCAodm9p ZCk7CiAKKy8qIFRydWUgaWYgZWFjaCBieXRlIGNhbiBub3Qgb2NjdXIgaW5zaWRlIGEgbXVsdGli eXRlIGNoYXJhY3RlciAgKi8KK3N0YXRpYyBib29sIGFsd2F5c19jaGFyYWN0ZXJfYm91bmRhcnlb Tk9UQ0hBUl07CisKK3N0YXRpYyB2b2lkCitkZmFhbHdheXNjYiAodm9pZCkKK3sKKyAgaW50IGk7 CisgIGlmICh1c2luZ191dGY4ICgpKQorICAgIHsKKyAgICAgIGZvciAoaSA9IENIQVJfTUlOOyBp IDw9IENIQVJfTUFYOyArK2kpCisgICAgICAgIHsKKyAgICAgICAgICB1bnNpZ25lZCBjaGFyIHVj ID0gaTsKKyAgICAgICAgICBhbHdheXNfY2hhcmFjdGVyX2JvdW5kYXJ5W3VjXSA9ICEoKHVjICYg MHhjMCkgXiAweDgwKTsKKyAgICAgICAgfQorICAgIH0KKyAgZWxzZQorICAgIHsKKyAgICAgIHVu c2lnbmVkIGNoYXIgY29uc3QgdWNzW10gPSB7ICdcMCcsICdcbicsICdccicsICcuJywgJy8nIH07 CisgICAgICBmb3IgKGkgPSAwOyBpIDwgc2l6ZW9mIHVjcyAvIHNpemVvZiB1Y3NbMF07ICsraSkK KyAgICAgICAgYWx3YXlzX2NoYXJhY3Rlcl9ib3VuZGFyeVt1Y3NbaV1dID0gdHJ1ZTsKKyAgICB9 Cit9CisKIHN0YXRpYyB2b2lkCiBkZmFtYmNhY2hlIChzdHJ1Y3QgZGZhICpkKQogewpAQCAtMzI3 MSwxOCArMzI5NCwyNCBAQCB0cmFuc2l0X3N0YXRlIChzdHJ1Y3QgZGZhICpkLCBzdGF0ZV9udW0g cywgdW5zaWduZWQgY2hhciBjb25zdCAqKnBwLAogICAgY2hhcmFjdGVyLgogCiAgICBHaXZlbiBE RkEgc3RhdGUgZCwgdXNlIG1ic190b193Y2hhciB0byBhZHZhbmNlIE1CUCB1bnRpbCBpdCByZWFj aGVzIG9yCi0gICBleGNlZWRzIFAuICBJZiBXQ1AgaXMgbm9uLU5VTEwsIHNldCAqV0NQIHRvIHRo ZSBmaW5hbCB3aWRlIGNoYXJhY3RlcgotICAgcHJvY2Vzc2VkLCBvciBpZiBubyB3aWRlIGNoYXJh Y3RlciBpcyBwcm9jZXNzZWQsIHNldCBpdCB0byBXRU9GLgotICAgQm90aCBQIGFuZCBNQlAgbXVz dCBiZSBubyBsYXJnZXIgdGhhbiBFTkQuICAqLworICAgZXhjZWVkcyBQLiAgSWYgV0NQIGlzIG5v bi1OVUxMIGFuZCB0aGUgcmVzdWx0IGlzIGdyZWF0ZXIgdGhhbiBwLCBzZXQKKyAgICpXQ1AgdG8g dGhlIGZpbmFsIHdpZGUgY2hhcmFjdGVyIHByb2Nlc3NlZCwgb3IgaWYgbm8gd2lkZSBjaGFyYWN0 ZXIKKyAgIGlzIHByb2Nlc3NlZCwgc2V0IGl0IHRvIFdFT0YuICBCb3RoIFAgYW5kIE1CUCBtdXN0 IGJlIG5vIGxhcmdlciB0aGFuCisgICBFTkQuCisKKyAgIElmIG5leHQgY2hhcmFjdGVyIGlzIGNo YXJhY3RlciBib3VuZGFyeSwgaXQgYWx3YXlzIHJlYWNoZXMgUC4gIFNvIGlmCisgICBXQ1AgaXMg TlVMTCwgY2FuIG9taXQgY2hlY2sgc3RlcCBieSBzdGVwIGFuZCBpbW1lZGlhdGVseSByZXR1cm4g UC4gICovCiBzdGF0aWMgdW5zaWduZWQgY2hhciBjb25zdCAqCiBza2lwX3JlbWFpbnNfbWIgKHN0 cnVjdCBkZmEgKmQsIHVuc2lnbmVkIGNoYXIgY29uc3QgKnAsCiAgICAgICAgICAgICAgICAgIHVu c2lnbmVkIGNoYXIgY29uc3QgKm1icCwgY2hhciBjb25zdCAqZW5kLCB3aW50X3QgKndjcCkKIHsK ICAgd2ludF90IHdjID0gV0VPRjsKKyAgaWYgKGFsd2F5c19jaGFyYWN0ZXJfYm91bmRhcnlbKnBd KQorICAgIHJldHVybiBwOwogICB3aGlsZSAobWJwIDwgcCkKICAgICBtYnAgKz0gbWJzX3RvX3dj aGFyICgmd2MsIChjaGFyIGNvbnN0ICopIG1icCwKICAgICAgICAgICAgICAgICAgICAgICAgICBl bmQgLSAoY2hhciBjb25zdCAqKSBtYnAsIGQpOwotICBpZiAod2NwICE9IE5VTEwpCisgIGlmICh3 Y3AgIT0gTlVMTCAmJiBwIDwgbWJwKQogICAgICp3Y3AgPSB3YzsKICAgcmV0dXJuIG1icDsKIH0K QEAgLTM3MTMsNiArMzc0Miw3IEBAIHZvaWQKIGRmYWNvbXAgKGNoYXIgY29uc3QgKnMsIHNpemVf dCBsZW4sIHN0cnVjdCBkZmEgKmQsIGludCBzZWFyY2hmbGFnKQogewogICBkZmFpbml0IChkKTsK KyAgZGZhYWx3YXlzY2IgKCk7CiAgIGRmYW1iY2FjaGUgKGQpOwogICBkZmFwYXJzZSAocywgbGVu LCBkKTsKICAgZGZhbXVzdCAoZCk7Ci0tIAoyLjIuMAoK --------_54B9BEBA00000000A38E_MULTIPART_MIXED_-- From debbugs-submit-bounces@debbugs.gnu.org Thu Apr 21 02:21:44 2016 Received: (at 18777) by debbugs.gnu.org; 21 Apr 2016 06:21:44 +0000 Received: from localhost ([127.0.0.1]:41882 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1at7zk-0003xy-6c for submit@debbugs.gnu.org; Thu, 21 Apr 2016 02:21:44 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:48968) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1at7zi-0003xk-Cf for 18777@debbugs.gnu.org; Thu, 21 Apr 2016 02:21:43 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 1FDB61609A8; Wed, 20 Apr 2016 23:21:36 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id xnAGLzjGIoJc; Wed, 20 Apr 2016 23:21:32 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 0B0A3161254; Wed, 20 Apr 2016 23:21:32 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 8TnF4kWjoyTH; Wed, 20 Apr 2016 23:21:31 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id DDE1A1609A8; Wed, 20 Apr 2016 23:21:31 -0700 (PDT) To: Norihiro Tanaka From: Paul Eggert Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte Organization: UCLA Computer Science Department Message-ID: <57187168.40000@cs.ucla.edu> Date: Wed, 20 Apr 2016 23:21:28 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------------030701080402070100010301" X-Spam-Score: -1.0 (-) X-Debbugs-Envelope-To: 18777 Cc: 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) This is a multi-part message in MIME format. --------------030701080402070100010301 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit I'm attaching a revised patch, relative to the latest grep, to implement the idea of the Bug#18777 patch. This revision calls the new array "never_trail" instead of "always_character_boundary" to nail down the concept a bit more precisely. It also removes what appears to be an unnecessary p < mbp test, and adjusts to more-recent changes in the code. I'm not installing this into the master branch on savannah, as we'd like to release a new 'grep' soon and this patch should probably wait until after the release. --------------030701080402070100010301 Content-Type: text/x-diff; name="0001-dfa-speed-up-checking-for-character-boundary.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-dfa-speed-up-checking-for-character-boundary.patch" >From 730d7a2138104cf6b692fc1fc41345180e87f117 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Wed, 20 Apr 2016 23:13:16 -0700 Subject: [PATCH] dfa: speed up checking for character boundary This should help performance with gawk; not so much with grep. Suggested by Norihiro Tanaka in: http://bugs.gnu.org/18777 * src/dfa.c (never_trail): New static var. (dfasyntax): Initialize it. (skip_remains_mb): Use it to speed up a common case in Gawk. --- src/dfa.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/src/dfa.c b/src/dfa.c index 98ee4ac..e609801 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -651,6 +651,10 @@ static unsigned char eolbyte; /* Cache of char-context values. */ static int sbit[NOTCHAR]; +/* If never_trail[B], the byte B cannot be a non-initial byte in a + multibyte character. */ +static bool never_trail[NOTCHAR]; + /* Set of characters considered letters. */ static charclass letters; @@ -712,6 +716,11 @@ dfasyntax (reg_syntax_t bits, int fold, unsigned char eol) setbit (uc, newline); break; } + + /* POSIX requires that the five bytes in "\n\r./" (including the + terminating NUL) cannot occur inside a multibyte character. */ + never_trail[uc] = (using_utf8 () ? (uc & 0xc0) != 0x80 + : strchr ("\n\r./", uc) != NULL); } } @@ -3159,15 +3168,20 @@ transit_state (struct dfa *d, state_num s, unsigned char const **pp, that are not a single byte character nor the first byte of a multibyte character. - Given DFA state d, use mbs_to_wchar to advance MBP until it reaches or - exceeds P. If WCP is non-NULL, set *WCP to the final wide character - processed, or if no wide character is processed, set it to WEOF. + Given DFA state d, use mbs_to_wchar to advance MBP until it reaches + or exceeds P, and return the advanced MBP. If WCP is non-NULL and + the result is greater than P, set *WCP to the final wide character + processed, or to WEOF if no wide character is processed. Otherwise, + if WCP is non-NULL, *WCP may or may not be updated. + Both P and MBP must be no larger than END. */ static unsigned char const * skip_remains_mb (struct dfa *d, unsigned char const *p, unsigned char const *mbp, char const *end, wint_t *wcp) { wint_t wc = WEOF; + if (never_trail[*p]) + return p; while (mbp < p) mbp += mbs_to_wchar (&wc, (char const *) mbp, end - (char const *) mbp, d); -- 2.5.5 --------------030701080402070100010301-- From debbugs-submit-bounces@debbugs.gnu.org Thu Apr 21 02:48:04 2016 Received: (at 18777) by debbugs.gnu.org; 21 Apr 2016 06:48:04 +0000 Received: from localhost ([127.0.0.1]:41897 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1at8PE-0004aX-0z for submit@debbugs.gnu.org; Thu, 21 Apr 2016 02:48:04 -0400 Received: from mail-oi0-f65.google.com ([209.85.218.65]:32816) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1at8PC-0004a3-Fh for 18777@debbugs.gnu.org; Thu, 21 Apr 2016 02:48:02 -0400 Received: by mail-oi0-f65.google.com with SMTP id f63so8667989oig.0 for <18777@debbugs.gnu.org>; Wed, 20 Apr 2016 23:48:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=JMlBuw4ikHWEpb/bS+W6lljOb3pLQMQfj5XYKZPMSKQ=; b=SP8gSpDiYwxMsxNEe62WvSou7lZN5BmUu5L84BaRlUJES/tRY3rofmPxja3ygOjkkA 19iM7uB20aQDzm8eZxry/wz+4VpOpWSTEOiJyzpRaSeO3Pj6zOp4PfGpIPjs/XO3GI+J 76FQ/uztIvgmxPhC5Dz6ZNONmrTJcqQMNyDWTyyw+cOvbhdCotGn+8pKWz2aZI1cKGqw 3FBCKuUXux650TgRg7TEuKM+forwQXNKMSmOPWLkzovEGLiy+Kexw4zLpE9Tt/FH+jGU 6c/iOYv7i7TfZdQCpShEcfGOp06C8rSDJrsc53IKQQ75/XMmGrSLvLmYajhWJpUGL69w MwhA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=JMlBuw4ikHWEpb/bS+W6lljOb3pLQMQfj5XYKZPMSKQ=; b=Yxwk3FMn9ghqkIJnpxcb+16WIO3CgouOd063bfzBPjA5KX6oxrGiuN78Tzz6e5luLQ Irwbbx5B8R7WdiBun6O0N1ZRVUQktUtmbQG/CWElK4IatW/tKzQJ7/WrzC4sM6JLPz3p j995JmXF2loUhfa1YJItyWBRPtIUMP8kyhyvWcZ3TWJNxl7LkvV6NFkZZBp/Edv8HZPz LXmR2vOBeh2qzddJoWU5zbDMtmtZe8mLe+5eKkSlUozUC/7vh4flRFPptYHfuN29XF0r RTBS9lRtpa50ObHreAAYGNUkgrEPtQxyeYpNsie3ysaRPjn8CnGRgUjj9m5LS5W8WTMJ cgPQ== X-Gm-Message-State: AOPr4FW0aBUQ1o/Wx0yWBCCzuHqa5cr04PewqpgoNRtukWj/ETkVHbNoQtdlRWIzowrH/fiIXqKJraUBWFyPnA== X-Received: by 10.157.1.120 with SMTP id 111mr4505304otu.172.1461221276817; Wed, 20 Apr 2016 23:47:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.202.175.193 with HTTP; Wed, 20 Apr 2016 23:47:37 -0700 (PDT) In-Reply-To: <57187168.40000@cs.ucla.edu> References: <20141021000401.1015.27F6AC2D@kcn.ne.jp> <57187168.40000@cs.ucla.edu> From: Jim Meyering Date: Wed, 20 Apr 2016 23:47:37 -0700 X-Google-Sender-Auth: R9n3RrWc-YVwO3YP6H0Bl2voDvs Message-ID: Subject: Re: bug#18777: [PATCH] dfa: improvement for checking of multibyte To: Paul Eggert Content-Type: text/plain; charset=UTF-8 X-Spam-Score: -0.5 (/) X-Debbugs-Envelope-To: 18777 Cc: Norihiro Tanaka , 18777@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.5 (/) On Wed, Apr 20, 2016 at 11:21 PM, Paul Eggert wrote: > I'm attaching a revised patch, relative to the latest grep, to implement the > idea of the Bug#18777 patch. This revision calls the new array "never_trail" > instead of "always_character_boundary" to nail down the concept a bit more > precisely. It also removes what appears to be an unnecessary p < mbp test, > and adjusts to more-recent changes in the code. > > I'm not installing this into the master branch on savannah, as we'd like to > release a new 'grep' soon and this patch should probably wait until after > the release. Thanks for deferring that. I hope to have time to release grep-2.25 tomorrow evening. From debbugs-submit-bounces@debbugs.gnu.org Mon May 02 01:59:21 2016 Received: (at 18777-done) by debbugs.gnu.org; 2 May 2016 05:59:21 +0000 Received: from localhost ([127.0.0.1]:32944 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ax6t7-00011E-HE for submit@debbugs.gnu.org; Mon, 02 May 2016 01:59:21 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:36812) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ax6t6-000113-JA for 18777-done@debbugs.gnu.org; Mon, 02 May 2016 01:59:20 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 31E191601AA for <18777-done@debbugs.gnu.org>; Sun, 1 May 2016 22:59:15 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 36LFboEQym8M for <18777-done@debbugs.gnu.org>; Sun, 1 May 2016 22:59:14 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 9095E161253 for <18777-done@debbugs.gnu.org>; Sun, 1 May 2016 22:59:14 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id onXGR8LgU7fT for <18777-done@debbugs.gnu.org>; Sun, 1 May 2016 22:59:14 -0700 (PDT) Received: from [192.168.1.9] (unknown [100.32.155.148]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 778C21601AA for <18777-done@debbugs.gnu.org>; Sun, 1 May 2016 22:59:14 -0700 (PDT) To: 18777-done@debbugs.gnu.org From: Paul Eggert Subject: Re: [PATCH] dfa: improvement for checking of multibyte character boundary Organization: UCLA Computer Science Department Message-ID: <5726ECB2.808@cs.ucla.edu> Date: Sun, 1 May 2016 22:59:14 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Spam-Score: -1.0 (-) X-Debbugs-Envelope-To: 18777-done X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) I have installed this and am closing the bug report. From unknown Mon Aug 11 12:54:16 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Mon, 30 May 2016 11:24:05 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator