From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 03:35:01 2023 Received: (at submit) by debbugs.gnu.org; 2 May 2023 07:35:01 +0000 Received: from localhost ([127.0.0.1]:41607 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptkXM-0007DN-IU for submit@debbugs.gnu.org; Tue, 02 May 2023 03:35:00 -0400 Received: from lists.gnu.org ([209.51.188.17]:48852) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptkXH-0007D8-H3 for submit@debbugs.gnu.org; Tue, 02 May 2023 03:34:58 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ptkXD-0008LW-Bc for bug-gnu-emacs@gnu.org; Tue, 02 May 2023 03:34:55 -0400 Received: from mout02.posteo.de ([185.67.36.66]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ptkX9-0007LP-So for bug-gnu-emacs@gnu.org; Tue, 02 May 2023 03:34:51 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id 7052924028C for ; Tue, 2 May 2023 09:34:44 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683012884; bh=9hmokUBykZPsYxKdGVwwxSzsjcB+aFFtzdCG3O/htZ0=; h=From:To:Subject:Date:From; b=KWn6QP8rfHmS9VGMM9w41lO7x5fClZj382NEnVwOxZKkpEN3Iz+AD9CmobZHxFngW qLyIOXCFg31scFoMwWtJVRppd5IdZ9CgJHIGfjAepgoZVVSpYR6iUOYWP/3kRSSrLM tl5/SoNFf0k/vjVo7YKZgqVxl8JAGapwaszn5/fgONh17eFapwFGremUPDUvYqeC3Z qJty/T+ofsLH3vh/uAriyO01jj0okHuesYTK5LBrDGRg5QTzUipR49Lz9Nphf8WYuB pZpZKIBRIW+oUhyBMjareJu61V4EnDUAMPf3nfWY52jikFuVFfcR/r9oUr2AeEIpX5 KSFhQ42pqnSLQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4Q9X0G4Rv4z6tyM for ; Tue, 2 May 2023 09:34:38 +0200 (CEST) From: Ihor Radchenko To: bug-gnu-emacs@gnu.org Subject: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Tue, 02 May 2023 07:37:39 +0000 Message-ID: <87ttwvgp4s.fsf@localhost> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" Received-SPF: pass client-ip=185.67.36.66; envelope-from=yantar92@posteo.net; helo=mout02.posteo.de X-Spam_score_int: -43 X-Spam_score: -4.4 X-Spam_bar: ---- X-Spam_report: (-4.4 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) --=-=-= Content-Type: text/plain Tags: patch Hello, I am now studying the performance of Org mode parser on huge Org files. I noticed that `org-element-parse-buffer' spends a significant (~10%) fraction of CPU time simply compiling regexp patterns. This happens because Org parser performs a huge number repeated regexp searches as it incrementally parses the buffer. The searches happen on a fixed set of regexp patterns (several dozens). I was able to get rid of the regex compilation-related slowdown simply by increasing REGEXP_CACHE_SIZE 10x (see the attached patch). Does anyone know if there are potential side effects of this increase if applied across Emacs? Or, alternatively, may Emacs provide an ability to store compiled regexp patterns from Elisp (similar to what `treesit-query-compile' does)? I suspect that storing pre-compiled patterns may benefit a number of major modes that have to perform complex regexp matching. Best, Ihor In GNU Emacs 30.0.50 (build 4, x86_64-pc-linux-gnu, GTK+ Version 3.24.37, cairo version 1.17.8) of 2023-05-02 built on localhost Repository revision: a0a71ca12d585bca5173775f08eabae553e15659 Repository branch: master Windowing system distributor 'The X.Org Foundation', version 11.0.12101008 System Description: Gentoo Linux Configured using: 'configure --with-native-compilation' --=-=-= Content-Type: text/patch Content-Disposition: attachment; filename=0001-src-search.c-REGEXP_CACHE_SIZE-Increase-to-200.patch >From f0d9c814ff0601fa3487ad3da1ae4dcdf511d850 Mon Sep 17 00:00:00 2001 Message-Id: From: Ihor Radchenko Date: Tue, 2 May 2023 09:28:21 +0200 Subject: [PATCH] * src/search.c (REGEXP_CACHE_SIZE): Increase to 200 --- src/search.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/search.c b/src/search.c index 0bb52c03eef..cfcce3c7293 100644 --- a/src/search.c +++ b/src/search.c @@ -34,7 +34,7 @@ Copyright (C) 1985-1987, 1993-1994, 1997-1999, 2001-2023 Free Software #include "regex-emacs.h" -#define REGEXP_CACHE_SIZE 20 +#define REGEXP_CACHE_SIZE 200 /* If the regexp is non-nil, then the buffer contains the compiled form of that regexp, suitable for searching. */ -- 2.40.0 --=-=-= Content-Type: text/plain -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at --=-=-=-- From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 10:34:15 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 14:34:15 +0000 Received: from localhost ([127.0.0.1]:44785 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptr54-00062i-T6 for submit@debbugs.gnu.org; Tue, 02 May 2023 10:34:15 -0400 Received: from mail1474c50.megamailservers.eu ([91.136.14.74]:36640 helo=mail102c50.megamailservers.eu) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptr4y-000624-Ki for 63225@debbugs.gnu.org; Tue, 02 May 2023 10:34:13 -0400 X-Authenticated-User: mattiase@bredband.net DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=megamailservers.eu; s=maildub; t=1683038041; bh=lIcDI6tSrqXq9IyTIq4PYfCwpIWit9bhG9xNPUBWWrQ=; h=From:Subject:Date:Cc:To:From; b=FAStFQE+lXH51c7Hd1oFvF2P0uvUQT4Jyfob1jXCRKs749xQLiXihZTLkFRskbyq3 PfDPG3LvlhNzeLXJZ6BHSw0UVZwbb3HDiodWq2IaFOtUCU+km7s5BbYFhWpq7CrFFK dcMDzg3ANEHQaSPiKchudpx9EvqEtZ4KHuVkIqh4= Feedback-ID: mattiase@acm.or Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se [188.150.165.235]) (authenticated bits=0) by mail102c50.megamailservers.eu (8.14.9/8.13.1) with ESMTP id 342EXwXY006204; Tue, 2 May 2023 14:34:00 +0000 From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Content-Type: multipart/mixed; boundary="Apple-Mail=_ECFA48ED-6129-4805-92CD-2E035048FD18" Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Message-Id: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> Date: Tue, 2 May 2023 16:33:58 +0200 To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-VADE-SPAMSTATE: clean X-VADE-SPAMSCORE: 0 X-VADE-SPAMCAUSE: gggruggvucftvghtrhhoucdtuddrgedvhedrfedviedgjeekucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecujffquffvqffrkfetpdfqfgfvpdfgpfggqdevhedtnecuuegrihhlohhuthemuceftddunecunecujfgurhephfgtggfukfffvefvofesmhdtmherhhdtvdenucfhrhhomhepofgrthhtihgrshcugfhnghguvghgnohrugcuoehmrghtthhirghsvgesrggtmhdrohhrgheqnecuggftrfgrthhtvghrnhepgefftedvuddvfeffiefggfeglefhheeiudffueegieetkefgueeiuedvkeeitddunecukfhppedukeekrdduhedtrdduieehrddvfeehnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepudekkedrudehtddrudeihedrvdefhedphhgvlhhopehsmhhtphgtlhhivghnthdrrghpphhlvgdpmhgrihhlfhhrohhmpehmrghtthhirghsvgesrggtmhdrohhrghdpnhgspghrtghpthhtohepvddprhgtphhtthhopeihrghnthgrrhelvdesphhoshhtvghordhnvghtpdhrtghpthhtohepieefvddvheesuggvsggsuhhgshdrghhnuhdrohhrgh X-Origin-Country: SE X-Spam-Score: 1.0 (+) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -0.0 (/) --Apple-Mail=_ECFA48ED-6129-4805-92CD-2E035048FD18 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii > I was able to get rid of the regex compilation-related slowdown simply > by increasing REGEXP_CACHE_SIZE 10x (see the attached patch). Indeed it sounds like you are suffering from regexp cache thrashing. I'm = attaching two patches: one to measure the cache miss rate, and one that = allows the regexp cache size to be changed at run time. That should let you find the working set size for your application, and = ideally come up with a way to reduce it. Perhaps you could give us an = idea of what these regexps look like and how they are used? > Does anyone know if there are potential side effects of this increase = if > applied across Emacs? Or, alternatively, may Emacs provide an ability = to > store compiled regexp patterns from Elisp (similar to what > `treesit-query-compile' does)? I don't think it's necessarily a good idea to increase the size to 200 = right away because of the linear cache lookup mechanism. Allowing the = size to be changed at run time is probably less controversial (but = arguably just as much of a crutch). Introducing regexp objects that could store compiled regexps and be used = instead of strings would be quite some work but probably worthwhile. --Apple-Mail=_ECFA48ED-6129-4805-92CD-2E035048FD18 Content-Disposition: attachment; filename=0001-Add-regexp-cache-hit-miss-counters.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="0001-Add-regexp-cache-hit-miss-counters.patch" Content-Transfer-Encoding: quoted-printable =46rom=20f1246af3cc558bd38527f320964bb0e0a1e74de0=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20=3D?UTF-8?q?Mattias=3D20Engdeg=3DC3=3DA5rd?=3D=20= =0ADate:=20Sat,=207=20Nov=202020=2017:00:53=20+0100=0A= Subject:=20[PATCH=201/2]=20Add=20regexp=20cache=20hit/miss=20counters=0A=0A= ---=0A=20src/search.c=20|=2013=20++++++++++++-=0A=201=20file=20changed,=20= 12=20insertions(+),=201=20deletion(-)=0A=0Adiff=20--git=20a/src/search.c=20= b/src/search.c=0Aindex=200bb52c03eef..6f71f3d16c1=20100644=0A---=20= a/src/search.c=0A+++=20b/src/search.c=0A@@=20-220,7=20+220,10=20@@=20= compile_pattern=20(Lisp_Object=20pattern,=20struct=20re_registers=20= *regp,=0A=20=09=20=20=20=20=20=20||=20EQ=20(cp->syntax_table,=20BVAR=20= (current_buffer,=20syntax_table)))=0A=20=09=20=20&&=20!NILP=20(Fequal=20= (cp->f_whitespace_regexp,=20Vsearch_spaces_regexp))=0A=20=09=20=20&&=20= cp->buf.charset_unibyte=20=3D=3D=20charset_unibyte)=0A-=09break;=0A+=20=20= =20=20=20=20=20=20{=0A+=20=20=20=20=20=20=20=20=20=20regexp_cache_hit++;=0A= +=20=20=20=20=20=20=20=20=20=20break;=0A+=20=20=20=20=20=20=20=20}=0A=20=0A= =20=20=20=20=20=20=20/*=20If=20we're=20at=20the=20end=20of=20the=20= cache,=20compile=20into=20the=20last=0A=20=09=20(least=20recently=20= used)=20non-busy=20cell=20in=20the=20cache.=20=20*/=0A@@=20-232,6=20= +235,7=20@@=20compile_pattern=20(Lisp_Object=20pattern,=20struct=20= re_registers=20*regp,=0A=20=20=20=20=20=20=20=20=20=20=20cp=20=3D=20= *cpp;=0A=20=09compile_it:=0A=20=20=20=20=20=20=20=20=20=20=20eassert=20= (!cp->busy);=0A+=20=20=20=20=20=20=20=20=20=20regexp_cache_miss++;=0A=20=09= =20=20compile_pattern_1=20(cp,=20pattern,=20translate,=20posix);=0A=20=09= =20=20break;=0A=20=09}=0A@@=20-3431,6=20+3435,13=20@@=20syms_of_search=20= (void)=0A=20is=20to=20bind=20it=20with=20`let'=20around=20a=20small=20= expression.=20=20*/);=0A=20=20=20Vinhibit_changing_match_data=20=3D=20= Qnil;=0A=20=0A+=20=20DEFVAR_INT("regexp-cache-hit",=20regexp_cache_hit,=0A= +=20=20=20=20=20=20=20=20=20=20=20=20=20doc:=20/*=20Regexp=20cache=20hit=20= count.=20=20Internal=20use=20only.=20*/);=0A+=20=20regexp_cache_hit=20=3D=20= 0;=0A+=20=20DEFVAR_INT("regexp-cache-miss",=20regexp_cache_miss,=0A+=20=20= =20=20=20=20=20=20=20=20=20=20=20doc:=20/*=20Regexp=20cache=20miss=20= count.=20=20Internal=20use=20only.=20*/);=0A+=20=20regexp_cache_miss=20=3D= =200;=0A+=0A=20=20=20defsubr=20(&Slooking_at);=0A=20=20=20defsubr=20= (&Sposix_looking_at);=0A=20=20=20defsubr=20(&Sstring_match);=0A--=20=0A= 2.32.0=20(Apple=20Git-132)=0A=0A= --Apple-Mail=_ECFA48ED-6129-4805-92CD-2E035048FD18 Content-Disposition: attachment; filename=0002-Make-regexp-cache-size-adjustable-at-run-time.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="0002-Make-regexp-cache-size-adjustable-at-run-time.patch" Content-Transfer-Encoding: quoted-printable =46rom=2081461c26b7e40e7c27804407571199a5247815d4=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20=3D?UTF-8?q?Mattias=3D20Engdeg=3DC3=3DA5rd?=3D=20= =0ADate:=20Tue,=202=20May=202023=2016:23:11=20+0200=0A= Subject:=20[PATCH=202/2]=20Make=20regexp=20cache=20size=20adjustable=20= at=20run-time=0A=0A---=0A=20src/alloc.c=20=20|=20=202=20++=0A=20= src/lisp.h=20=20=20|=20=201=20+=0A=20src/search.c=20|=2085=20= +++++++++++++++++++++++++++++++++++++++-------------=0A=203=20files=20= changed,=2067=20insertions(+),=2021=20deletions(-)=0A=0Adiff=20--git=20= a/src/alloc.c=20b/src/alloc.c=0Aindex=20d09fc41dec6..0b11b5469a9=20= 100644=0A---=20a/src/alloc.c=0A+++=20b/src/alloc.c=0A@@=20-6451,6=20= +6451,8=20@@=20garbage_collect=20(void)=0A=20=20=20mark_terminals=20();=0A= =20=20=20mark_kboards=20();=0A=20=20=20mark_threads=20();=0A+=20=20= mark_regexp_cache=20();=0A+=0A=20#ifdef=20HAVE_PGTK=0A=20=20=20= mark_pgtkterm=20();=0A=20#endif=0Adiff=20--git=20a/src/lisp.h=20= b/src/lisp.h=0Aindex=20ab66109d5df..8b2beadefd6=20100644=0A---=20= a/src/lisp.h=0A+++=20b/src/lisp.h=0A@@=20-4745,6=20+4745,7=20@@=20= XMODULE_FUNCTION=20(Lisp_Object=20o)=0A=20extern=20void=20syms_of_fileio=20= (void);=0A=20=0A=20/*=20Defined=20in=20search.c.=20=20*/=0A+extern=20= void=20mark_regexp_cache=20(void);=0A=20extern=20void=20= shrink_regexp_cache=20(void);=0A=20extern=20void=20restore_search_regs=20= (void);=0A=20extern=20void=20update_search_regs=20(ptrdiff_t=20oldstart,=0A= diff=20--git=20a/src/search.c=20b/src/search.c=0Aindex=20= 6f71f3d16c1..c454d5e1ca9=20100644=0A---=20a/src/search.c=0A+++=20= b/src/search.c=0A@@=20-34,7=20+34,7=20@@=20Copyright=20(C)=201985-1987,=20= 1993-1994,=201997-1999,=202001-2023=20Free=20Software=0A=20=0A=20= #include=20"regex-emacs.h"=0A=20=0A-#define=20REGEXP_CACHE_SIZE=2020=0A= +#define=20DEFAULT_REGEXP_CACHE_SIZE=2020=0A=20=0A=20/*=20If=20the=20= regexp=20is=20non-nil,=20then=20the=20buffer=20contains=20the=20compiled=20= form=0A=20=20=20=20of=20that=20regexp,=20suitable=20for=20searching.=20=20= */=0A@@=20-55,7=20+55,9=20@@=20#define=20REGEXP_CACHE_SIZE=2020=0A=20};=0A= =20=0A=20/*=20The=20instances=20of=20that=20struct.=20=20*/=0A-static=20= struct=20regexp_cache=20searchbufs[REGEXP_CACHE_SIZE];=0A+static=20= struct=20regexp_cache=20*searchbufs;=0A+/*=20Allocated=20size=20of=20= searchbufs=20array,=20in=20elements.=20=20*/=0A+static=20int=20= regexp_cache_size=20=3D=200;=0A=20=0A=20/*=20The=20head=20of=20the=20= linked=20list;=20points=20to=20the=20most=20recently=20used=20buffer.=20=20= */=0A=20static=20struct=20regexp_cache=20*searchbuf_head;=0A@@=20-158,7=20= +160,7=20@@=20clear_regexp_cache=20(void)=0A=20{=0A=20=20=20int=20i;=0A=20= =0A-=20=20for=20(i=20=3D=200;=20i=20<=20REGEXP_CACHE_SIZE;=20++i)=0A+=20=20= for=20(i=20=3D=200;=20i=20<=20regexp_cache_size;=20++i)=0A=20=20=20=20=20= /*=20It's=20tempting=20to=20compare=20with=20the=20syntax-table=20we've=20= actually=20changed,=0A=20=20=20=20=20=20=20=20but=20it's=20not=20= sufficient=20because=20char-table=20inheritance=20means=20that=0A=20=20=20= =20=20=20=20=20modifying=20one=20syntax-table=20can=20change=20others=20= at=20the=20same=20time.=20=20*/=0A@@=20-3376,18=20+3378,68=20@@=20DEFUN=20= ("newline-cache-check",=20Fnewline_cache_check,=20Snewline_cache_check,=0A= =20}=0A=20=0C=0A=20=0A-static=20void=20syms_of_search_for_pdumper=20= (void);=0A+static=20void=0A+set_regexp_cache_size=20(int=20n)=0A+{=0A+=20= =20for=20(int=20i=20=3D=200;=20i=20<=20regexp_cache_size;=20i++)=0A+=20=20= =20=20xfree=20(searchbufs[i].buf.buffer);=0A+=0A+=20=20size_t=20bytes=20= =3D=20n=20*=20sizeof=20*searchbufs;=0A+=20=20searchbufs=20=3D=20xrealloc=20= (searchbufs,=20bytes);=0A+=20=20memset=20(searchbufs,=200,=20bytes);=0A+=20= =20regexp_cache_size=20=3D=20n;=0A+=0A+=20=20for=20(int=20i=20=3D=200;=20= i=20<=20n;=20i++)=0A+=20=20=20=20{=0A+=20=20=20=20=20=20= searchbufs[i].buf.allocated=20=3D=20100;=0A+=20=20=20=20=20=20= searchbufs[i].buf.buffer=20=3D=20xmalloc=20(100);=0A+=20=20=20=20=20=20= searchbufs[i].buf.fastmap=20=3D=20searchbufs[i].fastmap;=0A+=20=20=20=20=20= =20searchbufs[i].regexp=20=3D=20Qnil;=0A+=20=20=20=20=20=20= searchbufs[i].f_whitespace_regexp=20=3D=20Qnil;=0A+=20=20=20=20=20=20= searchbufs[i].busy=20=3D=20false;=0A+=20=20=20=20=20=20= searchbufs[i].syntax_table=20=3D=20Qnil;=0A+=20=20=20=20=20=20= searchbufs[i].next=20=3D=20(i=20=3D=3D=20regexp_cache_size-1=20?=200=20:=20= &searchbufs[i+1]);=0A+=20=20=20=20}=0A+=20=20searchbuf_head=20=3D=20= &searchbufs[0];=0A+}=0A+=0A+DEFUN=20("regexp-cache-size",=20= Fregexp_cache_size,=20Sregexp_cache_size,=0A+=20=20=20=20=20=20=200,=20= 0,=200,=0A+=20=20=20=20=20=20=20doc:=20/*=20Current=20regexp=20cache=20= size.=20=20Internal=20use=20only.=20=20*/)=0A+=20=20(void)=0A+{=0A+=20=20= return=20make_int=20(regexp_cache_size);=0A+}=0A+=0A+DEFUN=20= ("set-regexp-cache-size",=20Fset_regexp_cache_size,=20= Sset_regexp_cache_size,=0A+=20=20=20=20=20=20=201,=201,=200,=0A+=20=20=20= =20=20=20=20doc:=20/*=20Set=20the=20regexp=20cache=20size=20to=20N=20= elements.=20=20Internal=20use=20only.=20=20*/)=0A+=20=20(Lisp_Object=20= n)=0A+{=0A+=20=20CHECK_FIXNUM=20(n);=0A+=20=20EMACS_INT=20nelems=20=3D=20= XFIXNUM=20(n);=0A+=20=20if=20(nelems=20<=3D=200)=0A+=20=20=20=20error=20= ("regexp=20cache=20size=20must=20be=20positive");=0A+=20=20= set_regexp_cache_size=20(nelems);=0A+=20=20return=20Qnil;=0A+}=0A=20=0A=20= void=0A-syms_of_search=20(void)=0A+mark_regexp_cache=20(void)=0A=20{=0A-=20= =20for=20(int=20i=20=3D=200;=20i=20<=20REGEXP_CACHE_SIZE;=20++i)=0A+=20=20= for=20(int=20i=20=3D=200;=20i=20<=20regexp_cache_size;=20++i)=0A=20=20=20= =20=20{=0A-=20=20=20=20=20=20staticpro=20(&searchbufs[i].regexp);=0A-=20=20= =20=20=20=20staticpro=20(&searchbufs[i].f_whitespace_regexp);=0A-=20=20=20= =20=20=20staticpro=20(&searchbufs[i].syntax_table);=0A+=20=20=20=20=20=20= mark_object=20(searchbufs[i].regexp);=0A+=20=20=20=20=20=20mark_object=20= (searchbufs[i].f_whitespace_regexp);=0A+=20=20=20=20=20=20mark_object=20= (searchbufs[i].syntax_table);=0A=20=20=20=20=20}=0A+}=0A=20=0A+static=20= void=20syms_of_search_for_pdumper=20(void);=0A+=0A+void=0A= +syms_of_search=20(void)=0A+{=0A=20=20=20/*=20Error=20condition=20used=20= for=20failing=20searches.=20=20*/=0A=20=20=20DEFSYM=20(Qsearch_failed,=20= "search-failed");=0A=20=0A@@=20-3460,6=20+3512,8=20@@=20syms_of_search=20= (void)=0A=20=20=20defsubr=20(&Smatch_data__translate);=0A=20=20=20= defsubr=20(&Sregexp_quote);=0A=20=20=20defsubr=20= (&Snewline_cache_check);=0A+=20=20defsubr=20(&Sregexp_cache_size);=0A+=20= =20defsubr=20(&Sset_regexp_cache_size);=0A=20=0A=20=20=20= pdumper_do_now_and_after_load=20(syms_of_search_for_pdumper);=0A=20}=0A= @@=20-3467,16=20+3521,5=20@@=20syms_of_search=20(void)=0A=20static=20= void=0A=20syms_of_search_for_pdumper=20(void)=0A=20{=0A-=20=20for=20(int=20= i=20=3D=200;=20i=20<=20REGEXP_CACHE_SIZE;=20++i)=0A-=20=20=20=20{=0A-=20=20= =20=20=20=20searchbufs[i].buf.allocated=20=3D=20100;=0A-=20=20=20=20=20=20= searchbufs[i].buf.buffer=20=3D=20xmalloc=20(100);=0A-=20=20=20=20=20=20= searchbufs[i].buf.fastmap=20=3D=20searchbufs[i].fastmap;=0A-=20=20=20=20=20= =20searchbufs[i].regexp=20=3D=20Qnil;=0A-=20=20=20=20=20=20= searchbufs[i].f_whitespace_regexp=20=3D=20Qnil;=0A-=20=20=20=20=20=20= searchbufs[i].busy=20=3D=20false;=0A-=20=20=20=20=20=20= searchbufs[i].syntax_table=20=3D=20Qnil;=0A-=20=20=20=20=20=20= searchbufs[i].next=20=3D=20(i=20=3D=3D=20REGEXP_CACHE_SIZE-1=20?=200=20:=20= &searchbufs[i+1]);=0A-=20=20=20=20}=0A-=20=20searchbuf_head=20=3D=20= &searchbufs[0];=0A+=20=20set_regexp_cache_size=20= (DEFAULT_REGEXP_CACHE_SIZE);=0A=20}=0A--=20=0A2.32.0=20(Apple=20Git-132)=0A= =0A= --Apple-Mail=_ECFA48ED-6129-4805-92CD-2E035048FD18-- From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 11:24:47 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 15:24:47 +0000 Received: from localhost ([127.0.0.1]:44842 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptrrz-0007gO-2J for submit@debbugs.gnu.org; Tue, 02 May 2023 11:24:47 -0400 Received: from eggs.gnu.org ([209.51.188.92]:52754) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptrru-0007g4-SI for 63225@debbugs.gnu.org; Tue, 02 May 2023 11:24:45 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ptrrp-0007GV-5l; Tue, 02 May 2023 11:24:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=GzyTW1zd79cWFyUI7R8nYeQZMFKbuDeJuEDHzW9n2tU=; b=QhZGUotufJ2DMmhAfmTd GhEw3ltd5hJ45jRHmiwA2B/XlhBh/7CFIo0UWUr2AfRiEvRC2+GiIy0hghqtbKrTP5PE1F1nwz/1e eioOI0S5rAofx2yMh88U1HLK9UrjF8GRynGu1D3p69lGwObLcosc/Pdh6R9vVuSaU3ZK0xF7PX1Hz ML3/HEXMXehuYHaVLOnnuuEH6uHw5R2j7InVqnbVyTKSl+kft58AsoN/2Mc9Ne+YX/qMZJW1FZh5N Qf2np4c0RPgBtG2hhQtogoesbJCcSmgciiuPDhSGtzF4bgx4qJ4cz+QTZTUTLKLLESM24LQLwtqwu 2uPuTVIUdH6UqA==; Received: from [87.69.77.57] (helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1ptrrn-0002Dv-GT; Tue, 02 May 2023 11:24:35 -0400 Date: Tue, 02 May 2023 18:25:18 +0300 Message-Id: <83sfcen4bl.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-Reply-To: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Tue, 2 May 2023 16:33:58 +0200) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) References: <87ttwvgp4s.fsf@localhost> <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org, yantar92@posteo.net X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > Cc: 63225@debbugs.gnu.org > From: Mattias EngdegÄrd > Date: Tue, 2 May 2023 16:33:58 +0200 > > Indeed it sounds like you are suffering from regexp cache thrashing. I'm attaching two patches: one to measure the cache miss rate, and one that allows the regexp cache size to be changed at run time. Thanks, but the new primitives need to be documented (why do the doc strings say "internal use only"?), and also we should include the cache size in the output of memory-report. > +static void > +set_regexp_cache_size (int n) > +{ > + for (int i = 0; i < regexp_cache_size; i++) > + xfree (searchbufs[i].buf.buffer); > + > + size_t bytes = n * sizeof *searchbufs; > + searchbufs = xrealloc (searchbufs, bytes); > + memset (searchbufs, 0, bytes); > + regexp_cache_size = n; > + Should this first check that the new size is identical to the old one, and if so, do nothing? Come to think of this, do we really need to realloc if the new size is smaller? And why zero out the cache when changing the size? > +DEFUN ("regexp-cache-size", Fregexp_cache_size, Sregexp_cache_size, > + 0, 0, 0, > + doc: /* Current regexp cache size. Internal use only. */) > + (void) > +{ > + return make_int (regexp_cache_size); > +} Since the size of the cache is a fixnum, why not use make_fixnum here? it's a tad faster. From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 11:28:44 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 15:28:44 +0000 Received: from localhost ([127.0.0.1]:44847 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptrvn-0007nQ-PY for submit@debbugs.gnu.org; Tue, 02 May 2023 11:28:43 -0400 Received: from mail-lf1-f42.google.com ([209.85.167.42]:47470) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptrvl-0007n7-Bh for 63225@debbugs.gnu.org; Tue, 02 May 2023 11:28:42 -0400 Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-4edb26f762dso4755686e87.3 for <63225@debbugs.gnu.org>; Tue, 02 May 2023 08:28:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683041313; x=1685633313; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=p2e5irSVU3h72g8ebU1GAzExQaXh07PL2Wv67WPt86s=; b=THxy/VHXM3LukovZbwS3wROkH8iSUfw5fig5fk3Q2eSnRSWDbftlJRYDcrMsY7UfyI 5E33kZT3sVNdUGfijNUf32G986VLYO6Pul8W9sq93SXCbTS/K6ORQO9BkwqoKsUTR/AD 1fZBUQw8E7jkiY0d/gjPzI5zmQIviemT+qOkRvfB8ZBhVxfXjvDPF9ecfxVsgJDja5fx 9vZdxyPK9gEDoWfSTtGMQIthRgECIqUPMouuyfkRJ+oIjgO9q2/EQzxLJ3iYWwegCL64 t7Z0fMFgAiH12OGUTBDv0Wb9lFoU7JH1OI7++u2emKF8ih7q2Dc4wnbcZNhuDxism9mp 7z3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683041313; x=1685633313; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=p2e5irSVU3h72g8ebU1GAzExQaXh07PL2Wv67WPt86s=; b=TwMO1MC6/C0VuC7cjK4IV4H5YVkSJY6b5yBAAKHJxnHoAiqwRjp8fyI8kLzEudZFkT URBtYCyMOLTZum1mT1uooB1Ai9uuluDRVISMWAfgM+JynLBg192kKlloc79REiG0zQDq RC/YhfCSJXApRlILTuQW4Azxz/FMoAlvLvjY6a6Y9PpEN6xRGkgS9qWIoodRNNBfOv6b BuWu29KZSlJWxq3mWtIpcq9lha0CucUXcKaml8lu746dahQh0unQcpnfoI5IBGwXMoXH IRs5YyIt0sW9/JZmeWHbNXsdwaVqfwdbQIIhppRH0udyHhqPzPyzzAOvKh6a5KvNIZB0 RFaA== X-Gm-Message-State: AC+VfDyGRhDXQXycCxjneed8yunMfIJiqG2+joB/Y8gLSpOgzezaOGDo pHW4NTJryXy5L9q2uhAuvdw= X-Google-Smtp-Source: ACHHUZ7QVub0nN0g1HZ7+GKkmv048UUuv9z4c6MsgtUzaqwDMndRD2LDQXbOGADKrRXtlb25ZUTvHA== X-Received: by 2002:a19:7601:0:b0:4eb:4335:e104 with SMTP id c1-20020a197601000000b004eb4335e104mr107805lff.47.1683041313010; Tue, 02 May 2023 08:28:33 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id v1-20020ac25921000000b004d57fc74f2csm5379820lfi.266.2023.05.02.08.28.32 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 02 May 2023 08:28:32 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <83sfcen4bl.fsf@gnu.org> Date: Tue, 2 May 2023 17:28:31 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <2CE6FE12-70E4-49D2-8C06-1F2ADF2A6E39@gmail.com> References: <87ttwvgp4s.fsf@localhost> <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <83sfcen4bl.fsf@gnu.org> To: Eli Zaretskii X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org, yantar92@posteo.net X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 2 maj 2023 kl. 17.25 skrev Eli Zaretskii : > Thanks, but the new primitives need to be documented These patches were not proposed for inclusion in Emacs but to help Ihor = solve his problems in other ways. Sorry about not making it clear. From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 12:11:56 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 16:11:56 +0000 Received: from localhost ([127.0.0.1]:44886 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptsbb-0000mO-Fk for submit@debbugs.gnu.org; Tue, 02 May 2023 12:11:55 -0400 Received: from mout01.posteo.de ([185.67.36.65]:38751) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptsbW-0000m6-3H for 63225@debbugs.gnu.org; Tue, 02 May 2023 12:11:53 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 51F862402ED for <63225@debbugs.gnu.org>; Tue, 2 May 2023 18:11:43 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683043903; bh=dE01Hu82ZWOzemRW381LBYLMMAQcI9s11VQoQTf6Ng0=; h=From:To:Cc:Subject:Date:From; b=Kz407mE2QwNKTBwX34w/QXT5ZUupcgovybHlMNQAnkxVKua0e7SU9Vxp4moMaZhWn la5EIOibmAO0XyudhwGXbfVprGuZt+dQc/WsaQv5YEiMw3wg1NqFrhbx+G1WcvREOh 1ZPGl48jBOi1lekS7kqWR3q8zdG+ZaHJ0MSQaVBVLGzeyvVO6RIjwWWjnbqSK2Umhb mvnFWXHJ/i0cPW9tSG1+fpLOlzv47OWMPdwmqvXw7yElLe6zN+69DcPr5TLMinrTX8 xZJLZ9ZnUzlYU5Kt4ya7PUsxJ2LPiE8NGHmTVVSZpZxiapLoemycT1yAf+eH3JVNUt hweT8bJ5q4QNQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4Q9lSp2wVPz9rxb; Tue, 2 May 2023 18:11:38 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> Date: Tue, 02 May 2023 16:14:40 +0000 Message-ID: <874jou7lsf.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> I was able to get rid of the regex compilation-related slowdown simply >> by increasing REGEXP_CACHE_SIZE 10x (see the attached patch). > > Indeed it sounds like you are suffering from regexp cache thrashing. I'm = attaching two patches: one to measure the cache miss rate, and one that all= ows the regexp cache size to be changed at run time. Here are the results: Command: (benchmark-progn (setq regexp-cache-hit 0 regexp-cache-miss 0) (set-regexp-cache-size 42) (org-element-parse-buffer) nil) Buffer size: 22Mb | Cache size | Hit | Miss | % miss from total | ~org-element-parse= -buffer~ time | |--------------+---------+---------+-------------------+-------------------= --------------| | 20 (default) | 3219470 | 1491165 | 31.66 | 21.035765s (1.0911= 27s in 2 GCs) | | 40 | 4418377 | 293805 | 6.24 | 18.294018s (1.1238= 54s in 2 GCs) | | 42 | 4550483 | 161820 | 3.43 | 17.946184s (1.0735= 28s in 2 GCs) | | 45 | 4636222 | 76582 | 1.62 | 18.410150s (1.0788= 44s in 2 GCs) | | 50 | 4693497 | 44174 | 0.93 | 17.896177s (1.0829= 44s in 2 GCs) | | 60 | 4734712 | 10807 | 0.23 | 18.011224s (1.0979= 61s in 2 GCs) | | 80 | 4710155 | 1386 | 0.03 | 18.047544s (1.1035= 18s in 2 GCs) | | 100 | 4711821 | 399 | 0.01 | 17.880491s (1.1026= 58s in 2 GCs) | | 160 | 4711895 | 160 | 0.00 | 17.950772s (1.0689= 75s in 2 GCs) | | 320 | 4737968 | 393 | 0.01 | 17.773617s (1.0891= 00s in 2 GCs) | | 640 | 4737388 | 320 | 0.01 | 18.225701s (1.0976= 88s in 2 GCs) | | 1280 | 4711353 | 160 | 0.00 | 17.847522s (1.0995= 75s in 2 GCs) | | 2560 | 4711898 | 160 | 0.00 | 18.168488s (1.0823= 94s in 2 GCs) | | 5120 | 4711835 | 160 | 0.00 | 17.797036s (1.0974= 45s in 2 GCs) | #+TBLFM: $4=3D100*$3/($3+$2);%.2f > That should let you find the working set size for your application, > and ideally come up with a way to reduce it. Perhaps you could give us > an idea of what these regexps look like and how they are used? The Org parser is basically a giant `cond' of a number of regexp matches. See `org-element--current-element'. It is called repeatedly on every syntax element in Org buffer (like heading, table, paragraph, etc). Each clause in the `cond' additionally calls for more complex series regexps to look into smaller components of the parsed syntax elements. For example, see `org-element-keyword-parser'. So, we are cycling across several dozens (more than regexp cache size) of regexps repeatedly. >> Does anyone know if there are potential side effects of this increase if >> applied across Emacs? Or, alternatively, may Emacs provide an ability to >> store compiled regexp patterns from Elisp (similar to what >> `treesit-query-compile' does)? > > I don't think it's necessarily a good idea to increase the size to 200 > right away because of the linear cache lookup mechanism. Allowing the > size to be changed at run time is probably less controversial (but > arguably just as much of a crutch). Fair point. Although overshooting within a single command does not appear to do much as long as we really re-use these regexps - everything gets cached. > Introducing regexp objects that could store compiled regexps and be > used instead of strings would be quite some work but probably > worthwhile. What exactly needs to be done? Assuming that regexp objects are not going to be readable, for simplicity. > +DEFUN ("set-regexp-cache-size", Fset_regexp_cache_size, Sset_regexp_cach= e_size, > + 1, 1, 0, > + doc: /* Set the regexp cache size to N elements. Internal use on= ly. */) If this is something to be used in practice, it will be more convenient to provide a macro like (with-regexp-cache-size N ). --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 13:30:15 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 17:30:15 +0000 Received: from localhost ([127.0.0.1]:44942 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pttpO-00032g-W5 for submit@debbugs.gnu.org; Tue, 02 May 2023 13:30:15 -0400 Received: from eggs.gnu.org ([209.51.188.92]:43822) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pttpL-00031T-Oz for 63225@debbugs.gnu.org; Tue, 02 May 2023 13:30:13 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pttpE-0007l1-SS; Tue, 02 May 2023 13:30:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-version:References:Subject:In-Reply-To:To:From: Date; bh=FpbeIcE1l/MvaWycYiRpM0AoiYDxRSt01LNv829X150=; b=TBlrL8TxGS+FZnyohfke MYuqavumumtRnzkdn8FhLDpu3ryGXrE0zRIx1IhAhfx7QM5jpuDAetrWiBXUpMkCS+CZooe0ZN5ZB nHTkcNiKqQUHZ73Vn0+ZFEE1YFCmECyH3mnEEAfKGirF6T55W9wmB2VVM3FmR6/gCPYArA8ByLDvG qJZlJuPnGhCmwce4OeKNLJp7Arhx28DW+Q2w88VJRtRHjU6FOTva2yFPh1vFFYQwh8tBFVXM/kpCr 4LQs8BO/Kd2yTV7O/DZsDJ+58tpC1S9P63ihgPbk1MCoxifgoGYGzSG7oZnuc7KIjtX5i36dW1M7/ G7XVpGp4i3Dg8A==; Received: from [87.69.77.57] (helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pttpE-0003bK-CR; Tue, 02 May 2023 13:30:04 -0400 Date: Tue, 02 May 2023 20:30:50 +0300 Message-Id: <83r0rymyid.fsf@gnu.org> From: Eli Zaretskii To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= In-Reply-To: <2CE6FE12-70E4-49D2-8C06-1F2ADF2A6E39@gmail.com> (message from Mattias =?utf-8?Q?Engdeg=C3=A5rd?= on Tue, 2 May 2023 17:28:31 +0200) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) References: <87ttwvgp4s.fsf@localhost> <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <83sfcen4bl.fsf@gnu.org> <2CE6FE12-70E4-49D2-8C06-1F2ADF2A6E39@gmail.com> MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org, yantar92@posteo.net X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) > From: Mattias EngdegÄrd > Date: Tue, 2 May 2023 17:28:31 +0200 > Cc: yantar92@posteo.net, > 63225@debbugs.gnu.org > > 2 maj 2023 kl. 17.25 skrev Eli Zaretskii : > > > Thanks, but the new primitives need to be documented > > These patches were not proposed for inclusion in Emacs but to help Ihor solve his problems in other ways. Sorry about not making it clear. I thought you said that exposing the cache size to Lisp _was_ the way to better support Lisp programs that need to use huge amounts of regular expressions, no? If not, how will those patches be able to help Ihor, given that he wants to solve them in Emacs? From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 13:55:03 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 17:55:03 +0000 Received: from localhost ([127.0.0.1]:44985 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptuDP-0003to-0E for submit@debbugs.gnu.org; Tue, 02 May 2023 13:55:03 -0400 Received: from mout01.posteo.de ([185.67.36.65]:48209) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptuDM-0003sg-TZ for 63225@debbugs.gnu.org; Tue, 02 May 2023 13:55:02 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 184E424034C for <63225@debbugs.gnu.org>; Tue, 2 May 2023 19:54:54 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683050095; bh=HazWvOwphNtazXRuixJH9DO3NwvcFTJEUMHfayWspJk=; h=From:To:Cc:Subject:Date:From; b=mmbR+12+njmRWZJx7oYOj4uI1mvklCcit3NlDnn7ljucAbtWxkNEdq9ALnb4VkxW9 dVl7icdGQvk2xQlyEmHZHjln4mNSGSCgTKZdEbEk6hH1eewpjZqs2Xr5ngLgYr+quF EeDDjl8Kw85vbeuPa7V8wp13v4j0IVDqxCv2KHf2bB5OvGuzKdoGb33N40fjD64wk6 fekfeOofhV57T6JVS/GFNxFg/zdT444wXY541+9gXPt2IX1NndhPhN5czW/BFw6LrG HF5x4gS8n/AsSFfM6z1mkHtOfPZRnXsymMfhsbhFodLWc4bNFM1y7hiVEM8PCI+rej cBQcUJA/mWiLw== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4Q9nly3Mc5z9rxB; Tue, 2 May 2023 19:54:54 +0200 (CEST) From: Ihor Radchenko To: Eli Zaretskii Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <83r0rymyid.fsf@gnu.org> References: <87ttwvgp4s.fsf@localhost> <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <83sfcen4bl.fsf@gnu.org> <2CE6FE12-70E4-49D2-8C06-1F2ADF2A6E39@gmail.com> <83r0rymyid.fsf@gnu.org> Date: Tue, 02 May 2023 17:58:01 +0000 Message-ID: <87wn1q62fq.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org, Mattias =?utf-8?Q?Engdeg=C3=A5rd?= X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Eli Zaretskii writes: >> These patches were not proposed for inclusion in Emacs but to help Ihor solve his problems in other ways. Sorry about not making it clear. > > I thought you said that exposing the cache size to Lisp _was_ the way > to better support Lisp programs that need to use huge amounts of > regular expressions, no? If not, how will those patches be able to > help Ihor, given that he wants to solve them in Emacs? Well. My original simplistic proposal was to increase REGEXP_CACHE_SIZE. Exposing this to Elisp certainly gives more flexibility, but also has downsides (due to linear search across the regexp cache). Ultimately, compiled regexp objects should be much more universal and will not require extensive benchmarking to balance between regexp compilation and increasing regexp cache search latency. But adding a new Elisp object type is a big deal. For now, we should probably study what is going on in my use case more generally and maybe figure out if some alternative approach could be better. If a simple solution will do, it may be good enough. -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 17:00:39 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 21:00:39 +0000 Received: from localhost ([127.0.0.1]:45141 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptx71-0003Rf-Dm for submit@debbugs.gnu.org; Tue, 02 May 2023 17:00:39 -0400 Received: from mail-lf1-f46.google.com ([209.85.167.46]:48531) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptx6y-0003RP-UW for 63225@debbugs.gnu.org; Tue, 02 May 2023 17:00:38 -0400 Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-4efe9a98736so4969080e87.1 for <63225@debbugs.gnu.org>; Tue, 02 May 2023 14:00:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683061231; x=1685653231; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=gNqFMrPOOW5jWj7PFAMFZfl7MenLYpZOK/1APyGD8BA=; b=dwzi4EyxUOgBCKeLLJOaFEPfYJF8trFaGW0vZrajPoCYH/6zBQABi/Zn8VGhr6oriz owpwQa0p1mMk5UpNwP4Nb/MUaWYmfIR+PLNtnBLB4ZD/0dQWcMal9XDFX/wdbd4mehI2 YsFQ0C03ltqvXR/+8YlFC8p8KjWqITDTQCzUo7diGxHBLTPaah9tHDLfNaorxn6dqfD3 TDcPMmaSW+C5riTeIBzca2wlUWoBl/YJzse/cNHI3yCTEUSHkbnI0fMaa7LEJ5ksgOcr ktB98TdCheLcCdNeAHqRYxz+beXBbRzfeslq03HQ0Yry/2A33x6ifJXvDrWZrsnpXQ96 NVhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683061231; x=1685653231; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=gNqFMrPOOW5jWj7PFAMFZfl7MenLYpZOK/1APyGD8BA=; b=hhjbxjBLzKAs2TqPsXS+mYv9sMYYmGQocmd5NQ3OFxE42iTZ/1nmZiobXxKAG9NB19 d6xvSsEGzZM6pKBtzYP8nWRgsUDrWtPTmMilzmJyaZsWO5REcXJ5a8uZDoCGu2M4j8tz Ajrm5AOm4/xfKWWRsIMTFFygDHcFBSM+RR5jPlDo/hH96avmHN2AZseyNXvjtmD75Wav C6D5itEQ/vM8x3LgNwtS6JUq+r0+UZPwUExFiVBon/OAleYkssVCkdY53rLK4dr1L5jN WfnZpnLxbGZzKDTnF/freIqGHW+XgeEk/OEki2WZ09dCzkyIA+20Wdix4G0tFNBQC5Q2 /nvw== X-Gm-Message-State: AC+VfDyoG2U+xpZ3AicllvUNG9PNSbV9+1evIU4l6KvcuREw3mdUQlSf pn+DRbH9nDpDutHZZDEFVtg= X-Google-Smtp-Source: ACHHUZ6w2HzVwZwYZFHchk54PQbeyMR2lbcuwtlsCa8UY7NQhYfhks/rqFz/K2yhVw2HaQvkwtGX/A== X-Received: by 2002:a05:6512:505:b0:4ef:d4ee:1a6a with SMTP id o5-20020a056512050500b004efd4ee1a6amr282692lfb.44.1683061230622; Tue, 02 May 2023 14:00:30 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id w3-20020ac254a3000000b004f0c9120a41sm1627557lfk.214.2023.05.02.14.00.29 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 02 May 2023 14:00:30 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <874jou7lsf.fsf@localhost> Date: Tue, 2 May 2023 23:00:29 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 2 maj 2023 kl. 18.14 skrev Ihor Radchenko : > | Cache size | Hit | Miss | % miss from total | = ~org-element-parse-buffer~ time | > = |--------------+---------+---------+-------------------+------------------= ---------------| > | 20 (default) | 3219470 | 1491165 | 31.66 | 21.035765s = (1.091127s in 2 GCs) | > | 40 | 4418377 | 293805 | 6.24 | 18.294018s = (1.123854s in 2 GCs) | > | 42 | 4550483 | 161820 | 3.43 | 17.946184s = (1.073528s in 2 GCs) | > | 45 | 4636222 | 76582 | 1.62 | 18.410150s = (1.078844s in 2 GCs) | Good, this quite solidly puts the working set size at 40-odd elements. > The Org parser is basically a giant `cond' of a number of regexp > matches. See `org-element--current-element'. A common way to handle this is to build a big regexp to match many cases = at the same time, essentially transforming (cond ((looking-at RE1) ...) ((looking-at RE2) ...) ...) to (looking-at (rx (or (group RE1) (group RE2) ...))) (cond ((match-beginning 1) ...) ((match-beginning 2) ...) ...) This reduces the number of regexps used and is also typically faster. (Essentially this is what `syntax-propertize-rules` does but in a more = specialised context.) Using tree-sitter for this could very well be even faster but it's not = guaranteed to be available. Otherwise it's very much a matter of optimisation of everything, = including regexps. Minimise backtracking. If you want to match five or more dashes, use "------*" instead of = "-\\{5,\\}". And so on. It's also obviously a good idea not to generate regexps dynamically each = time if you can help it, and minimise consing in general. >> Introducing regexp objects that could store compiled regexps and be >> used instead of strings would be quite some work but probably >> worthwhile. >=20 > What exactly needs to be done? Assuming that regexp objects are not > going to be readable, for simplicity. A proper design, for starters. For example, we probably want them to be = usable in customised variables which calls for them to be readable. > If this is something to be used in practice, it will be more = convenient > to provide a macro like (with-regexp-cache-size N ). Maybe, we'll see if it's something we need to add. From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 17:18:41 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 21:18:41 +0000 Received: from localhost ([127.0.0.1]:45162 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptxOS-0003wk-OD for submit@debbugs.gnu.org; Tue, 02 May 2023 17:18:41 -0400 Received: from mout01.posteo.de ([185.67.36.65]:48647) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptxOO-0003wJ-7x for 63225@debbugs.gnu.org; Tue, 02 May 2023 17:18:39 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 4EAB32403AD for <63225@debbugs.gnu.org>; Tue, 2 May 2023 23:18:30 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683062310; bh=XDlJFdqCLvJ4VzD2S9WZBaLOnQwwzVfVn/xvhrEIwNY=; h=From:To:Cc:Subject:Date:From; b=aNWgBWMLeioftZqn7d3fPpghXRTK7LG1EvbBDSq7AUrWgY/CtT6buC4omuPlXIVvA v2YlDssRmAeHxNB6DVUURO7V/kDQ4fXW6nKPL/bPzor1pcNY1L/UKxtDm/grVZNpl4 AZLgOX5LJGh9+jC54DsJpN83ccDm1wzc6cqcptgco5BqNr0HefVszeTMPWBabgH5vV mfZ+z91FYW6BY/6D1fwAU2CKlWKN1VGZQXaVOTyeI/9ySqZ4ooUc4vQOPkPdxQM7mg 8bLIf36o6VqE//F6brj77JVRq9KlVYeatn4C59uS8LsbHvc7T3aBzC4kvleqBveNw0 R4TgCr4meSqHw== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4Q9tGs3p9gz6tm4; Tue, 2 May 2023 23:18:29 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> Date: Tue, 02 May 2023 21:21:39 +0000 Message-ID: <87wn1qqvj0.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> The Org parser is basically a giant `cond' of a number of regexp >> matches. See `org-element--current-element'. > > A common way to handle this is to build a big regexp to match many cases = at the same time, essentially transforming > > (cond ((looking-at RE1) ...) > ((looking-at RE2) ...) > ...) > > to > > (looking-at (rx (or (group RE1) (group RE2) ...))) > (cond ((match-beginning 1) ...) > ((match-beginning 2) ...) > ...) > > This reduces the number of regexps used and is also typically faster. > (Essentially this is what `syntax-propertize-rules` does but in a more sp= ecialised context.) I tried this, and it did not give any noticeable improvement. Most likely, because the actual `cond' is (cond ((looking-at "foo) ()) ...) Actually, I started looking into C-level perf data right after trying to consolidate the regexps into one giant looking-at form and not seeing any improvement. At that point, I already cached most of the dynamically generated regexps in there and ran out of reasonable ideas. > Using tree-sitter for this could very well be even faster but it's not gu= aranteed to be available. The available tree-sitter grammars for Org are about 1.5-2x faster in my previous tests, but they do less granular parsing of fields. And not complete. Org is not context-free and does not fit well into GLR. And we are not going to use tree sitter for development to avoid increasing contribution barriers. > Otherwise it's very much a matter of optimisation of everything, includin= g regexps. Minimise backtracking. > If you want to match five or more dashes, use "------*" instead of "-\\{5= ,\\}". And so on. This example sounds like something that regexp compilation should be able to optimize, no? I do not easily see why the latter should cause more CPU time compared to the former. > It's also obviously a good idea not to generate regexps dynamically each = time if you can help it, and minimise consing in general. Sure. I was able to shave a few seconds off using this idea. Other than regexp compilation hotspot, I only see re-writing parser non-recursively (`org-element--parse-elements' and `org-element--parse-objects'). >>> Introducing regexp objects that could store compiled regexps and be >>> used instead of strings would be quite some work but probably >>> worthwhile. >>=20 >> What exactly needs to be done? Assuming that regexp objects are not >> going to be readable, for simplicity. > > A proper design, for starters. For example, we probably want them to be u= sable in customised variables which calls for them to be readable. Or, alternatively, the parsed regexps can be attached to string objects internally. Then, regexp cache lookup will degenerate to looking into a string object slot. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Tue May 02 19:37:03 2023 Received: (at 63225) by debbugs.gnu.org; 2 May 2023 23:37:03 +0000 Received: from localhost ([127.0.0.1]:45256 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptzYN-0001ou-2A for submit@debbugs.gnu.org; Tue, 02 May 2023 19:37:03 -0400 Received: from sonic304-22.consmr.mail.ne1.yahoo.com ([66.163.191.148]:43508) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1ptzYL-0001oQ-4J for 63225@debbugs.gnu.org; Tue, 02 May 2023 19:37:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1683070615; bh=PojXzKecAFAqjlheGPHwRLmbHw6rx3Vu/Lu776LJ/Y0=; h=From:To:Cc:Subject:In-Reply-To:References:Date:From:Subject:Reply-To; b=auh/hL8fZHZmNTh2mvLVYFijyPYGeLQiATBBUf4J7CRlOvnhNTFbon68QP+jcD8H80BOWIk364OETSPvJF9gEwIHfyKuyy0dtBmTad9Au4bHnwNSkLMMOydoGpoYEsu2Ccs895GFSqrUMQsAFVyFtsQsm/kO4vHGksMdBM4x5RqRcEjVKBYOciF7us1u6p5a9TDsUhzCcJNokKbw22nKa8a6z5kI3P//vMQqv3o+3k2gKaf9Wzk2iivu7uQbZVCXA8Q0vWedbZOgLLmUbfqLeO/S1hXdnYrpp6IMY4NSTblmXmg7Tm8R9yd1E3/8AHf5d4dyIgNFx7DX2qq+sblclQ== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1683070615; bh=BAoQ0b8A2lXWyKkMQB2bBd611LYyaFVmzEo2I1FdVJm=; h=X-Sonic-MF:From:To:Subject:Date:From:Subject; b=JLaidhhZiKRArjq2wrSuWJ2Y/QN6E5rksdd39az5HzypCdzPHp/vbwjlbUAcT1RbhHalE0sr9+WIUdDVcPMsYjyW3hn6J/hXx8rl/97xe8s9gg7gYPehq8UJksr3+KU79QQ+CxMuXDCnc2LNQ0BStLLXzKXwMhIWSlzSemSRScDNuPHcpmwpjx4FIencDhCWikuA7Uths0Tkcm79w6pBmviiTDNL7+WOs8nyYklBkoKjhFk4vaqKjx28GX3uRh+AsGMOEMrYhkiVkB5cCg1Kczi6712bR5HNtd069XT3XN6ijfMh/TlUZcvt2YIJTbMTLF+HjI9B30fE/2A4g2RLIw== X-YMail-OSG: SWNl9voVM1lEVzo2hOqu5b3Yokn11sWzR0ED..bVYVWVRBm2nS4JDrDoEpVRMlJ 8wgqc9ciMdDSBkeWmVEOiH_ZaocEvv0_SKje_lZM_lXB_HfujRKemZQ5pOs5GqgE.k6byL7m.vgG jNgR_YnEQqzxVXMU0LR4r_SudonVtEd020s9hYusAU7MaAYAWO8MOcGt8QiZA7jUG6hReXU_UyG1 hOhv7r_Exkjj_w4zM5RN7Sd9MZUE8ZpcYE63v8exM544JFPU1M6zDxJMhZRibR.ilkOriaZaBCVg BPTTtNHo5BXxuW2npBbjpL3C4vTlzF_1wpC9TW2lU1LquPguNZ4wKATj4r9CmtVQ7v30XJV59Cd1 0eZ.rf91VTrSIhaffLehxruYOjgRwbgeGHFhyRx8u4G9Ni6ylMzaf.o2aQ9KstYERQMg5QvZ52tp PboEWVPXJz.jJ0NmPwkctAgaf7rDdAHVXt6mb_9Ts68xIJ._U2wQKElAGlRtpEMLxGSzAIcGWL56 JdpNALWin60aVPVcq3d.3auq9yWR1Kni65np2_7MWVbEWPOJ3Yfh9Jx6pml17YPPNqhIOhfpXuzf oFaYFAfs3PX0ggdYCqbC.l.h8d7leNE4MzZJmT9JJ03FDtB9F5uPI_3C.AnObwW1r5rolVrcqo.4 hs6TgW9KP3tssVpuNB9Zdj0Xsn7hOvtBricYpcyjVhyEBLWFi4rygTAZv5tjwF8aJlI0J2ReI8lz MzGiuEP8iqH7wCTd.7ky9qpd9VSRZGG0RL8.jrKEsLw0wtAIYa1njrTeWFiPpMMwgV8uJnsr9VaR LJvpPisdn1LN_NptXzP_oZDKlGMNIBxygB5ZMV0G9Ez8c5TJ6arrSbB14JuGhuNmnkOhCDz2WGke lUOsyUij55bHrXXJTlqeWgpz7xTBtap3lxdGPA_DI1SjK8c.Mw_fBSd5NszhCNVcAwB20Q0Ncx4v qNIWPmFESw6F4gY3KNyCP8JpfRGbAWOnia4QR0EZEyVWEFGAECc_Ols9ekwqurQnQ.nx6_PFnwW9 nnbRL17fepEJDnmmdhcKiLMEP7hUqCHMMRP2Ezw7gh4GgWNViqXzufPUJZUYKtsguHJ2k5g7aDBF ayvf2ZaxLKwYsSRzIl.3Nnks3XzNrwS8SKELO1dMk5q3pKg_CA8JsOwevUwvlYL5bw62EMCgOlZz SD4XvrXA3YGm7EjhkS0FYNNYcuRCLqsyBiAO29P1jMttTcTU4PQnZEcueqL7rzMp5Kif3vwrJk6v bLQbOG38cJEThPRCu7sSaDl4bPgHV0mzVtSp5a2zh61GUwuyv2qz6FJcXy_cMBZV7MlYnpXEhN_. Ovpe3dFUdOfM0GFVMNv4umUUSrktYe_zWIlil2EF2bPb8mj_axyKqope90WN9f3fJNRjCMjuXlk6 89AoDTuKutAdW3m7Lu3qIPIxZuNxt_NB.6qg0Qm5euEuYGnVV9ekP73dVhSSEvQj9jfQxKDHrZfe LwBVSHSsxaaUfaEygU3jXtQNKj6S53.RAH3gm5A.SXWrkdYNB0xELeLaUXv5zlle4By6MH7wAgqm RAMz9biGA8i3YW1_yazMZ1O4lD2Vzt424theqXXQW_FRPmqkNmEGy4bwbCF_LQDw4ShheiW8Ob0R L7.x9FCYq5vuTPw8y0kGVSwlt83jC9hTPdoBW3qhMaxqF.Q0kcTL0l8idlKLg5zKdZh7YEI20mxX gn5HYB_NLXTOLtSpSU.b.CILUzIkmRPDwrgAoudchpkT90UXMMwr2E0E_4enUpXZweMcCsFOWZF2 zpmiG.s6PjkpAYlXRD2qQ7LegPTJf9lreiXMJ.vlxH51L9WF0K9N6170QX.fr3IolRDK8boArMhx GxoploKk5xK_AlG09Ge_0xtUTwllWyKFbmvE2T06QvPz8kL5PvjVBRqk16pg2hWAtxgqdQwfCXFj SVJrJCBF.5nEGJ_B3fJAEb0uAbDXFGzoybQsv2775Pc3O3urr.gaMhXA6tqURMuQ.OGatclgNvcm rMqR8qui4WvJeR3waYPT1pF.emoydAHQNJd3sr0aVeanTJ4m_rHp2wetOk7U5m7mxZQhewJIVG6G 7clZBfv_PA3gzEqoDs6Okx84rycuv6tNz14rp40mJXS3TscCl48dEMDm35TXeg6qA3K4vKPpPeXj ezrpbKROJS1Eb5z6gmQcxOubFB06KgH5dmbQEigrlkadbMLgvw7TY3FD7zy5wToy2ClKnTcJkiLH Fzs0AO2w7mEmN7nMdEdN7_u99y0zBppP08QLhZ15742BTwfVX50Imoqj.0DHG51B9BOKa6Mw- X-Sonic-MF: X-Sonic-ID: 54181fd4-9539-4ea8-bd28-71a0bc58571f Received: from sonic.gate.mail.ne1.yahoo.com by sonic304.consmr.mail.ne1.yahoo.com with HTTP; Tue, 2 May 2023 23:36:55 +0000 Received: by hermes--production-sg3-6d6fb994f6-pcrg5 (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID c7a243b9f9a6f9247a47af73228abde5; Tue, 02 May 2023 23:36:51 +0000 (UTC) From: Po Lu To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> ("Mattias =?utf-8?Q?Engdeg=C3=A5rd=22's?= message of "Tue, 2 May 2023 16:33:58 +0200") References: <87ttwvgp4s.fsf@localhost> <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> Date: Wed, 03 May 2023 07:36:46 +0800 Message-ID: <87ildacnld.fsf@yahoo.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Mailer: WebService/1.1.21417 mail.backend.jedi.jws.acl:role.jedi.acl.token.atz.jws.hermes.yahoo Content-Length: 3106 X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org, Ihor Radchenko X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) Mattias Engdeg=C3=A5rd writes: >> I was able to get rid of the regex compilation-related slowdown simply >> by increasing REGEXP_CACHE_SIZE 10x (see the attached patch). > > Indeed it sounds like you are suffering from regexp cache thrashing. I'm = attaching two patches: one to measure the cache miss rate, and one that all= ows the regexp cache size to be changed at run time. > > That should let you find the working set size for your application, and i= deally come up with a way to reduce it. Perhaps you could give us an idea o= f what these regexps look like and how they are used? > >> Does anyone know if there are potential side effects of this increase if >> applied across Emacs? Or, alternatively, may Emacs provide an ability to >> store compiled regexp patterns from Elisp (similar to what >> `treesit-query-compile' does)? > > I don't think it's necessarily a good idea to increase the size to 200 > right away because of the linear cache lookup mechanism. Allowing the > size to be changed at run time is probably less controversial (but > arguably just as much of a crutch). > > Introducing regexp objects that could store compiled regexps and be used = instead of strings would be quite some work but probably worthwhile. Thanks for curing this instance of C programmer's disease. > From f1246af3cc558bd38527f320964bb0e0a1e74de0 Mon Sep 17 00:00:00 2001 > From: =3D?UTF-8?q?Mattias=3D20Engdeg=3DC3=3DA5rd?=3D > Date: Sat, 7 Nov 2020 17:00:53 +0100 > Subject: [PATCH 1/2] Add regexp cache hit/miss counters > > --- > src/search.c | 13 ++++++++++++- > 1 file changed, 12 insertions(+), 1 deletion(-) > > diff --git a/src/search.c b/src/search.c > index 0bb52c03eef..6f71f3d16c1 100644 > --- a/src/search.c > +++ b/src/search.c > @@ -220,7 +220,10 @@ compile_pattern (Lisp_Object pattern, struct re_regi= sters *regp, > || EQ (cp->syntax_table, BVAR (current_buffer, syntax_table))) > && !NILP (Fequal (cp->f_whitespace_regexp, Vsearch_spaces_regexp)) > && cp->buf.charset_unibyte =3D=3D charset_unibyte) > - break; > + { > + regexp_cache_hit++; > + break; > + } >=20=20 > /* If we're at the end of the cache, compile into the last > (least recently used) non-busy cell in the cache. */ > @@ -232,6 +235,7 @@ compile_pattern (Lisp_Object pattern, struct re_regis= ters *regp, > cp =3D *cpp; > compile_it: > eassert (!cp->busy); > + regexp_cache_miss++; > compile_pattern_1 (cp, pattern, translate, posix); > break; > } > @@ -3431,6 +3435,13 @@ syms_of_search (void) > is to bind it with `let' around a small expression. */); > Vinhibit_changing_match_data =3D Qnil; >=20=20 > + DEFVAR_INT("regexp-cache-hit", regexp_cache_hit, > + doc: /* Regexp cache hit count. Internal use only. */); > + regexp_cache_hit =3D 0; > + DEFVAR_INT("regexp-cache-miss", regexp_cache_miss, > + doc: /* Regexp cache miss count. Internal use only. */); > + regexp_cache_miss =3D 0; Please put a space between `DEFVAR_INT' and `('. From debbugs-submit-bounces@debbugs.gnu.org Wed May 03 04:39:24 2023 Received: (at 63225) by debbugs.gnu.org; 3 May 2023 08:39:24 +0000 Received: from localhost ([127.0.0.1]:45526 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pu81E-0000Zz-Bq for submit@debbugs.gnu.org; Wed, 03 May 2023 04:39:24 -0400 Received: from mail-lf1-f49.google.com ([209.85.167.49]:62708) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pu818-0000Zf-O0 for 63225@debbugs.gnu.org; Wed, 03 May 2023 04:39:22 -0400 Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-4f139de8cefso1410918e87.0 for <63225@debbugs.gnu.org>; Wed, 03 May 2023 01:39:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683103153; x=1685695153; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=lph4oKkInYP6BIr+GJlBGbxdEAw/GyFK4lF1nclYuR0=; b=W0ghVmICIWuy0AnuOUrehodkVvnfKUfx0zKMmZ1gu91tcY7QoMRF8pqfrXN1tA1X7y EAUtcgJp5EFK+XznIvaCN8fJnleeNFjNbUeYalmK0l4XYy3DC2763DGzA0hRJgsoo5l8 ejLTVnJ+xVuB+wsL2iOsDuJkOXtHlmwhuSkUY91cPJrqEFjin1ymc4x4BppN9LboV9tK zGMlUJIPllDUPX6BOUEDy7dSCwphG9uigVDD+SLtGrwF2AaacaIfFTXHUWjnG+nd8Rh8 C/17st1atzZgaJoC65g73RvSHRil9/IfErlbqB0bY8xpm1LmI8HsNnd/l5NOyH/NAQSc MAHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683103153; x=1685695153; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=lph4oKkInYP6BIr+GJlBGbxdEAw/GyFK4lF1nclYuR0=; b=XTGvMj+Di9eWARBeeFyaNMTYfE7ul+4KNMcWCmz915uZ+BaGjWugtVOIgsZmrC3e0v Bc4WHaCy9nvLaiEL+BCC5dwft/Y6Mpq8zFP5qrU/4DMvfvB9lJJaxYIyFULJAdV8NlgC zVxkuWsyiUC8tsTnbP29fRRLfht7k36lYYm0JmkufNnqMoq0rrirr2paZwBiqJsWP/qq UWSiVv0paSUcONcul/xza6nJmouuNULcX+XjUz2C68CMHOVnSIXtyIizFGQg2fDWee1L FVRmKjiA05dt1+PGCvfLsqmHNNKDV7dNTc7k5u5jPVK2KWzPyLQ1TjJxEXbjlJZjqkEu nFkw== X-Gm-Message-State: AC+VfDwXevNAgu1YadJsyUezFp7Qz5/I7QDfeDlA1JYkGw6L675g0ZEf wt208FQv03y+oQ7LFAIZv1w= X-Google-Smtp-Source: ACHHUZ73HPHUCkb/gxtRISpuqT9bCmfKZXGzYelCnXmyX9YCVYxYnE21x/f3U5AjOnpMvi4CEXfOxQ== X-Received: by 2002:a05:6512:224f:b0:4ec:36d6:1517 with SMTP id i15-20020a056512224f00b004ec36d61517mr291714lfu.2.1683103152552; Wed, 03 May 2023 01:39:12 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id f11-20020a056512092b00b004b4b600c093sm5906020lft.92.2023.05.03.01.39.11 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 May 2023 01:39:11 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87wn1qqvj0.fsf@localhost> Date: Wed, 3 May 2023 10:39:10 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 2 maj 2023 kl. 23.21 skrev Ihor Radchenko : > I tried this, and it did not give any noticeable improvement. > Most likely, because the actual `cond' is >=20 > (cond ((looking-at "foo) ()) ...) I see, so it doesn't run through all top-level cases very often then? I = thought that would be the common path (plain text). Would consolidating some of the secondary regexps help at all? What are = the most frequent branches in the parser? Perhaps you just don't see much improvement until the working set of = regexps fits in the cache. >> Otherwise it's very much a matter of optimisation of everything, = including regexps. Minimise backtracking. >> If you want to match five or more dashes, use "------*" instead of = "-\\{5,\\}". And so on. >=20 > This example sounds like something that regexp compilation should be > able to optimize, no? I do not easily see why the latter should cause > more CPU time compared to the former. It's a trivial point and definitely not the source of your problems, = sorry! (Counted repetitions are slightly less efficient because they = need to maintain the counter, it's all done in a terrible way.) The regexp compiler doesn't do much optimisation in order to keep the = translation fast. It doesn't even convert "[a]" to "a". > Or, alternatively, the parsed regexps can be attached to string = objects > internally. Then, regexp cache lookup will degenerate to looking into = a > string object slot. That would work too but we really don't want to make our strings any = fancier, they are already much too big and slow. From debbugs-submit-bounces@debbugs.gnu.org Wed May 03 05:33:01 2023 Received: (at 63225) by debbugs.gnu.org; 3 May 2023 09:33:01 +0000 Received: from localhost ([127.0.0.1]:45574 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pu8r6-0002DI-GG for submit@debbugs.gnu.org; Wed, 03 May 2023 05:33:01 -0400 Received: from mout02.posteo.de ([185.67.36.66]:40983) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pu8r3-0002D0-HN for 63225@debbugs.gnu.org; Wed, 03 May 2023 05:32:59 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id 7B94D240390 for <63225@debbugs.gnu.org>; Wed, 3 May 2023 11:32:51 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683106371; bh=AWgE0KqQ2UHdr13zBUfVrgnRzCh3FUA3A4ZCgA1hWN0=; h=From:To:Cc:Subject:Date:From; b=dAlP8PF9OBbUSNAMxLqiwYiwDYTQRQ9u2m7JE/pqWRlH4XeHcllCVVIqEU97Qe2DN qnvaaacslgC4HJ2qah9q4iyvVlQA/N+BkNDO14i9hd02NIGygobwwY6SqYeSNfIxX0 Tn6X92PjCADBh9wLekMbGhOuCKmwczicp4wDLDpCFD+XdIgaN0C64aWHZe4xNkzIpA MNMMHUKOGovqmZ1bA75C0MczZa4+rHY4PXGDhPcP3qAXRHYERiieBpLdxUf38+0FIu VE2ipkyueddfIxQROpcNmA1BC/i0wgZth2O+oktJaxRf6IQsmlGcOThZOYkMZNIc3v gS8GWlGTbxHAw== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QBBZB5dbtz6twT; Wed, 3 May 2023 11:32:50 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> Date: Wed, 03 May 2023 09:36:01 +0000 Message-ID: <87wn1psqny.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> I tried this, and it did not give any noticeable improvement. >> Most likely, because the actual `cond' is >>=20 >> (cond ((looking-at "foo) ()) ...) > > I see, so it doesn't run through all top-level cases very often then? I t= hought that would be the common path (plain text). You are indeed right. Top-level cases are ran very often. So, what I said does not make much sense. Yet, in my tests, I am unable to see any improvement when I consolidate the regexps. If I do (progn (set-regexp-cache-size 50) (org-element-parse-buffer) nil) Without consolidation, but using `looking-at-p' as much as possible: Profiler top ;; 4160 21% + org-element--current-element ;; 2100 10% + org-element--parse-elements ;; 1894 9% + org-element--parse-objects ;; 1422 7% Automatic GC ;; 871 4% + org-element--headline-deferred ;; 806 4% + apply ;; 796 4% + org-element-create ;; 638 3% + org-element--list-struct Perf top ;; 16.72% emacs emacs [.] re_= match_2_internal ;; 7.16% emacs emacs [.] exe= c_byte_code ;; 4.08% emacs emacs [.] fun= call_subr ;; 4.06% emacs emacs [.] re_= search_2 With consolidation into a giant rx (or ...) with groups: ;; 4158 21% + org-element--current-element ;; 2163 11% + org-element--parse-objects ;; 1796 9% + org-element--parse-elements ;; 1276 6% Automatic GC ;; 921 4% + org-element--headline-deferred ;; 833 4% + apply ;; 793 4% + org-element-create ;; 660 3% + org-element--list-struct ;; 16.44% emacs emacs [.] re_= match_2_internal ;; 7.03% emacs emacs [.] exe= c_byte_code ;; 6.78% emacs emacs [.] pro= cess_mark_stack ;; 4.05% emacs emacs [.] re_= search_2 ;; 4.02% emacs emacs [.] fun= call_subr The version with giant single rx form is actually slower overall (!), making no difference at all in `org-element--current-element'. > Perhaps you just don't see much improvement until the working set of rege= xps fits in the cache. As you see, I now increased cache size to 50. No improvement. Same with my observations on current master. > The regexp compiler doesn't do much optimisation in order to keep the tra= nslation fast. It doesn't even convert "[a]" to "a". I guess that it is another thing that could be improved if we were to have compiled regexp objects. Compilation time would not matter as much. Ideally, the compiler should do something similar to what https://www.colm.net/open-source/ragel/ does. >> Or, alternatively, the parsed regexps can be attached to string objects >> internally. Then, regexp cache lookup will degenerate to looking into a >> string object slot. > > That would work too but we really don't want to make our strings any fanc= ier, they are already much too big and slow. Then, what about making compiled regexp object similar to string, but with plist slot replaced by compiled regexp slot? Maybe some other slots removed (I am not very familiar with specific of internal string representation) AFAIU, compiled regexp read/write syntax can be uniquely represented simply by a string. Something like #r"[a-z]+" (maybe even with special handling for backslashes, like proposed in https://yhetil.org/emacs-devel/4209edd83cfee7c84b2d75ebfcd38784fa21b23c.cam= el@crossproduct.net) --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Wed May 03 09:59:41 2023 Received: (at 63225) by debbugs.gnu.org; 3 May 2023 13:59:41 +0000 Received: from localhost ([127.0.0.1]:47480 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puD1B-0005Bv-Do for submit@debbugs.gnu.org; Wed, 03 May 2023 09:59:41 -0400 Received: from mail-lj1-f171.google.com ([209.85.208.171]:55587) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puD17-0005Bf-E7 for 63225@debbugs.gnu.org; Wed, 03 May 2023 09:59:40 -0400 Received: by mail-lj1-f171.google.com with SMTP id 38308e7fff4ca-2a8bbea12d7so53786541fa.3 for <63225@debbugs.gnu.org>; Wed, 03 May 2023 06:59:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683122371; x=1685714371; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=0Pk7WwOLcoQ1WrUNG3nDvmFOmUIL4XxdkipS6ky/YM4=; b=Vu5ZbYBY6QzcfhZE+hGul1o1rsi3a3rVdPFaMXOb5I0cIDqdB6rTYuAYZqWccbNgN2 1pD9zYMHnzoHnwDDH2eZRTkhaJMaAAt9ZSARWTs9Apk2k3QwzfVoJXb1ad3evWZYQZHC 44P6DupqqmRkI+HKi7uyEoRf+BUA7xChOaRa3TiK+oxMiDXmRe8qTMqiic0uJ29EZa/7 HlArx7NQzazrOy8LnupMZ16pbAIm2EjFR2EA4bpaVDzVZXnFcy00flFo5wP17p9fm2mb /L0/2MviuvSu1nR8W4j491w37bz0/6YbLzaO3orGRYPN3vVyd/Qj8NYKBXQefNSXUBoH hMCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683122371; x=1685714371; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=0Pk7WwOLcoQ1WrUNG3nDvmFOmUIL4XxdkipS6ky/YM4=; b=DgoweiUlcV0St34/FxgasD6e++MFhutXlkvsycAXBcRCQNy+GG+HrnoIAfkuZjLtDc YGj/VUw6ZVh5RsQlHNHxWuSlMZTrZFuRFnT5LAf8QYqjq+wyhUK3FXZ4jRljSbYUj9ri MABpm8QvAH02juc9fDmlisVcKKAW638hQJRSXtW/ckKhjIbDWBiXOKzbs9PByp/uX6tt 0CQDQtcIdoQTeNpDrriUoAiceJgBaoLPMSbM/HhjR4e00IjCwPbaSKkRK1rgSY++yPUt wBt2ZcBtc8ep49IKSxkdIzCERpQys6BWpKdMsYJg3kxhdq2E6hjFfLR46RNX2LJKSRdr uWjw== X-Gm-Message-State: AC+VfDxL9MQVytK5IRJSQ/bjtvxiNWu5BvtkdZqOz2AxLjRRdyNQrc1i IEm9ip8w7mf76chD88x30+o= X-Google-Smtp-Source: ACHHUZ5gBuQPz/kKWWdjNvc+jOdxNc/4X3z2iaM5b9tsoGqdah4sPpOt/VcJb0xWtqrvxMsi75w1Hw== X-Received: by 2002:a2e:9149:0:b0:2a8:a6a9:4303 with SMTP id q9-20020a2e9149000000b002a8a6a94303mr73126ljg.8.1683122371234; Wed, 03 May 2023 06:59:31 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id u17-20020a2e9b11000000b00293d7c95df1sm5977159lji.78.2023.05.03.06.59.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 May 2023 06:59:30 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87wn1psqny.fsf@localhost> Date: Wed, 3 May 2023 15:59:29 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 3 maj 2023 kl. 11.36 skrev Ihor Radchenko : > Yet, in my tests, I am unable to see any improvement when I = consolidate > the regexps. That's odd, but do you get a better cache hit rate (assuming a cache = size of 20)? > The version with giant single rx form is actually slower overall (!), > making no difference at all in `org-element--current-element'. Can't say what's going on here, really. Normally a combined regexp = shouldn't be slower. Are you sure you get the same parse? > Ideally, the compiler should do something similar to > what https://www.colm.net/open-source/ragel/ does. Yes, constructing a DFA would be more realistic when it's less in danger = of being thrown away at any time. From debbugs-submit-bounces@debbugs.gnu.org Wed May 03 11:02:08 2023 Received: (at 63225) by debbugs.gnu.org; 3 May 2023 15:02:08 +0000 Received: from localhost ([127.0.0.1]:47497 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puDzc-00074Y-66 for submit@debbugs.gnu.org; Wed, 03 May 2023 11:02:08 -0400 Received: from mout02.posteo.de ([185.67.36.66]:34755) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puDzX-00073r-2f for 63225@debbugs.gnu.org; Wed, 03 May 2023 11:02:06 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id 9E7292405ED for <63225@debbugs.gnu.org>; Wed, 3 May 2023 17:01:56 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683126116; bh=aF1QTNAQuIfcJK3tWm+0fkYzvpkviA81ErBG3amTb1k=; h=From:To:Cc:Subject:Date:From; b=lVTk6yMp+s6jNOLTn8gH9/wxZhTUiCDmxB87PeXp4d8bP82m17YBoBgjHnqElqXZV hx+aVxaIEoRaAGqbBT16wIZZJmsBoszXf1bX567+MEEOL4rZwx0MfFwwZvGlwKMbY+ vxIicrAMKhmyMAZ9kBz1Tdw/24Cg6gh2IbjtePewInFqS5qAZBWUU1xQHNVCkguLml nEjWCL7N4NVIoCFO2BHXYhR3uaTfbTmvYb3lQkw7LF+5q0T2oDcNWJYXvxV+vpwavZ LZ2bJhEeWumm8eU4WG17wQgNLnEgLvKWLGT0RtIow+xxlmoIhWfYecljz1u8nKW8OT Yjedp2P0pdq5g== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QBKsv6LsTz9rxP; Wed, 3 May 2023 17:01:55 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> Date: Wed, 03 May 2023 15:05:06 +0000 Message-ID: <87zg6lfobh.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: > 3 maj 2023 kl. 11.36 skrev Ihor Radchenko : > >> Yet, in my tests, I am unable to see any improvement when I consolidate >> the regexps. > > That's odd, but do you get a better cache hit rate (assuming a cache size= of 20)? With the default cache size of 20, (benchmark-progn (setq regexp-cache-hit 0 regexp-cache-miss 0) (set-regexp-cache-size 20) (org-element-parse-buffer) nil) (cond ((looking-at-p ...) ...)) gives misses: 1493570 hits: 3225203 % misses from total: 31% giant rx + looking-at gives misses: 1177242 hits: 3233553 % misses from total: 27% >> The version with giant single rx form is actually slower overall (!), >> making no difference at all in `org-element--current-element'. > > Can't say what's going on here, really. Normally a combined regexp > shouldn't be slower. Are you sure you get the same parse? All the tests are passing... Note that I am using `looking-at-p' I now also tried replacing `looking-at-p' with `looking-at' and I get 4880 21% + org-element--current-element (previous data with `looking-at-p') 4160 21% + org-element--current-element=20 with total time increasing compared to the version with `looking-at-p' (21.743226s (1.364015s in 2 GCs) compared to 21.035765s (1.091127s in 2 GCs= )) --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Wed May 03 11:21:02 2023 Received: (at 63225) by debbugs.gnu.org; 3 May 2023 15:21:02 +0000 Received: from localhost ([127.0.0.1]:47523 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puEHu-0007ZF-Ai for submit@debbugs.gnu.org; Wed, 03 May 2023 11:21:02 -0400 Received: from mail-lf1-f48.google.com ([209.85.167.48]:45190) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puEHs-0007YP-23 for 63225@debbugs.gnu.org; Wed, 03 May 2023 11:21:01 -0400 Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-4ecb137af7eso6201357e87.2 for <63225@debbugs.gnu.org>; Wed, 03 May 2023 08:21:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683127254; x=1685719254; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=7VOd5xSA5RYxe+f/wPLr4/gb/i8fjUbC9dEkgmQRzaM=; b=m2pmbcFKA0HxqZzQoUrSK5+uN9FyBXaCHWuxPZLW2MlKVasFJokUAA/FlpvZerE8DY rc8ts1adyMAPfq4JwYQvg+6rEfHI8ivis+uKgnpHMkOlZoW6PQMCnCdZ2267r72HLc8T xqpnO48KdCnHaXcWtZzAPaOj4PI5QGZ/nRi9aNEmyRK5OPPqmlyI8me+qbYk6ShSsloB OMraFq+u4xEnnlvF181e8Nja02axF2Wx5eMX+umT8M36LDBXDraJuwiajMFGlAWNv/qO s7EiM4qWPfgU+iBs1iBK6PsVwF2dkAm9Ngusb0gWzIOgmD4+t0Qn2s4RsprveYwRuONq g9+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683127254; x=1685719254; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=7VOd5xSA5RYxe+f/wPLr4/gb/i8fjUbC9dEkgmQRzaM=; b=ggupMJPzRLxXr81yE6Gm6/igikCKOcHIKsyRjMjgcH+qhZ3HHlzNby0+49vHJscKei 52FdQat3gnOIN5o+dKfY2Y6eCDxM17i0nCWQCQRvw9kf44FUCN6H8QFq0cDCyfaiJK0j 1A61RR7YvWNFW/0SRnw5PrQ2s0mqT+ZYqZl0KE1t4Aw4nBoVSrSfPaQwK6rYP3U6+TfJ DP+Y8tUlHqBQgy3s+H84OL3MDjEMkTMcI9ZmMgphccGuK35AuYlIIhPRc7SydRSsM7zM 5BHLLAI0+H913TLnnRwdGRpf0fm/dontBYNhHoxmo9xJcIQkZq9IheSb1HcK0Os44SrN gQaw== X-Gm-Message-State: AC+VfDwAdhLIsEGD8A1Zsa3FwJ/OTiWGUWmm2pyLz+Dyj8XfbLpkDWw0 h0uj/0EJqWUFZakkjbHOBio= X-Google-Smtp-Source: ACHHUZ47exDnOHzgW9yAG4DTbaZCeaqhmthDpSWry8d7+1vs2KM3Kr9PtT/Cqqzl5srkDA7gCCJ+6Q== X-Received: by 2002:ac2:5dcc:0:b0:4f0:181:5a14 with SMTP id x12-20020ac25dcc000000b004f001815a14mr933959lfq.21.1683127253713; Wed, 03 May 2023 08:20:53 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id k26-20020ac2457a000000b004f11eb32f20sm1802644lfm.13.2023.05.03.08.20.52 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 03 May 2023 08:20:53 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87zg6lfobh.fsf@localhost> Date: Wed, 3 May 2023 17:20:52 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 3 maj 2023 kl. 17.05 skrev Ihor Radchenko : > (cond ((looking-at-p ...) ...)) gives >=20 > misses: 1493570 > hits: 3225203 > % misses from total: 31% >=20 > giant rx + looking-at gives >=20 > misses: 1177242 > hits: 3233553 > % misses from total: 27% Maybe you should instrument the regexp engine and log the pattern and = whether compilation was needed to a file. Run on a reduced dataset, and = see if the sequence of regexps being exercised, and their frequencies, = are consistent with what you expect. From debbugs-submit-bounces@debbugs.gnu.org Wed May 03 11:59:58 2023 Received: (at 63225) by debbugs.gnu.org; 3 May 2023 15:59:58 +0000 Received: from localhost ([127.0.0.1]:47560 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puEtZ-0000Jy-LC for submit@debbugs.gnu.org; Wed, 03 May 2023 11:59:57 -0400 Received: from mout01.posteo.de ([185.67.36.65]:33779) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puEtW-0000Jk-OF for 63225@debbugs.gnu.org; Wed, 03 May 2023 11:59:56 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 6868A2403E6 for <63225@debbugs.gnu.org>; Wed, 3 May 2023 17:59:48 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683129588; bh=wSt2/Aly3R+qlhAnVKWEYEckiIGOfFV6VRunrJj53PU=; h=From:To:Cc:Subject:Date:From; b=EXCQbv9Wa9fy+6g5G6iDVeDEH+WzWxhIhTkKXYwQ3icZgf/R3NLkZQCLmqD4mqDu3 5uA6JYUmmkLCUdpn7ZqNgtoROjWIYW4UkzSKRiN3/z+GKLKtiok9qkqQNRLbrmr9Zv 10GnJVhOVM/+JdI22MSQ533kgMpulriP/OT65+CK7CaCOvV7899FNbbhoGFxAOux6N BdBceWWZGzrYrJ622FysnonFkhTp75QPNAdB7Rcg06hj96wXCwPvRh32IJBX3f4aLN pFuk9i23hgeh6lzfYD4hhPUO3RVNCX/EDljkvkhP1uDkHdC3XwxPB94174A4ueYAjo iCuB7xcJalcAQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QBM8d0qDwz9rxG; Wed, 3 May 2023 17:59:45 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> Date: Wed, 03 May 2023 16:02:52 +0000 Message-ID: <87o7n1v1w3.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: > Maybe you should instrument the regexp engine and log the pattern and whe= ther compilation was needed to a file. Run on a reduced dataset, and see if= the sequence of regexps being exercised, and their frequencies, are consis= tent with what you expect. Sorry, but I am starting to lose track of the purpose here. What is the aim of instrumenting regexp engine in this scenario? I already know that additional regexps will be tested by individual `org-element-X-parser' functions. I am also not sure how to instrument the regexp engine and what I can see there. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Thu May 04 05:24:43 2023 Received: (at 63225) by debbugs.gnu.org; 4 May 2023 09:24:43 +0000 Received: from localhost ([127.0.0.1]:48567 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puVCd-0002Xf-9f for submit@debbugs.gnu.org; Thu, 04 May 2023 05:24:43 -0400 Received: from mail-lj1-f176.google.com ([209.85.208.176]:61602) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puVCb-0002XS-IG for 63225@debbugs.gnu.org; Thu, 04 May 2023 05:24:42 -0400 Received: by mail-lj1-f176.google.com with SMTP id 38308e7fff4ca-2a8afef50f2so2801771fa.2 for <63225@debbugs.gnu.org>; Thu, 04 May 2023 02:24:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683192275; x=1685784275; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=ltzJDtz61HSK57E1ta3j2HDi96xPcohHrxp2zVlO3T8=; b=h2EzRZbo3riZ09KqoQVlQ9yEXXqFdRjcIkcditgiIngek14+/RLe++yDaIoIKPanYt DkIu4xtQAuAG/ii6Fj1Yw73xTgY4DEhwG/uUhIpCqWTX8Lpb/4V9hrFFWhwbv+Ya8Bq6 o26M7xk8bqBgFu7bC08UFUicONHU2UTo4KU1Jf162yPWnQwNlFgVdno73OSqvpUtBKjf Je3r3VEYLzaHDfQSMRrBrPcW9aM1yWIJewpwRNgZ9TqU9BBIGPb4PjU8zx7W4+W4UQc6 4+xEG7owfwA8IWRc9c0qmbkhPKwhh6mwXgZBFZJoqfnZsebfzLHI10iTRz7JtInTyyZw 7BgA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683192275; x=1685784275; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=ltzJDtz61HSK57E1ta3j2HDi96xPcohHrxp2zVlO3T8=; b=I0aY6CfbFVk3Ih7fe8wII2O8C9xUlrQaVLFLJIlQtaXvh8adkyMyx5VyMQklNyO3+t RJtiouDsiuXik9d7CY0QdTjxH3sicxH3uqt1/rPW5RjVX6rGkFkN8/M3olZQtOZfuPsK LjbgtBfos47+QPpX34hPgbc+dzb6Kf8te6VGUuaKToMiIVXgQUC+JdW9I1iMrEFz4+o7 o1l+C9m8YeDMl4FI6B7gkut7I7WPhqQspE+IMf5RUQk64XvdjyL5qulbboHRMFtB5K3a f+08sw1z8Azk+XnLCrtyOvLH9vumn6NBqEyk9GyM+acVa24Ru4v+ddl22nhlLvTRAKEi o7Vw== X-Gm-Message-State: AC+VfDyNm/QC2kqUuaYI5lNnHxzak28vssYabf2201qCXL0Prje6vOi+ aZdfO5NzMYQ+i/tlaYarv0Q= X-Google-Smtp-Source: ACHHUZ47ctrCJjKoOUbD4dCOCAbD0MS+C6khAwD8d8vSzYi/HtAWF29TNJ7t3WviXXNoWXZ7vLlQ8g== X-Received: by 2002:a2e:9bc3:0:b0:295:8fd5:da00 with SMTP id w3-20020a2e9bc3000000b002958fd5da00mr1060491ljj.22.1683192275479; Thu, 04 May 2023 02:24:35 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id e17-20020a2e9851000000b002a8c2a4fe99sm6484032ljj.28.2023.05.04.02.24.34 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 04 May 2023 02:24:34 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87o7n1v1w3.fsf@localhost> Date: Thu, 4 May 2023 11:24:34 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 3 maj 2023 kl. 18.02 skrev Ihor Radchenko : > What is the aim of instrumenting regexp engine in this scenario? > I already know that additional regexps will be tested by individual > `org-element-X-parser' functions. I got the impression that the 'spine' of the parser, the sequence of = `looking-at` calls in `org-element--current-element`, would frequently = be run through in its entirety which means that consolidating these = would reduce the number of working regexps by about 20 (if I'm counting = correctly). Now if as you suggest the parsing is dominated by sequences of regexps = in the branches, it prompts the questions: which branches, what regexps, = why are there so many of them, and is there anything that can be done to = reduce their number? > I am also not sure how to instrument the regexp engine and what I can > see there. Sorry, it is just what I who know nothing about the structure of Org = would do to get a better view. You may find it easier to work at the = Lisp level. From debbugs-submit-bounces@debbugs.gnu.org Thu May 04 08:55:10 2023 Received: (at 63225) by debbugs.gnu.org; 4 May 2023 12:55:10 +0000 Received: from localhost ([127.0.0.1]:48798 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puYUH-0002oV-FI for submit@debbugs.gnu.org; Thu, 04 May 2023 08:55:09 -0400 Received: from mout01.posteo.de ([185.67.36.65]:37889) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puYUF-0002nb-Ro for 63225@debbugs.gnu.org; Thu, 04 May 2023 08:55:08 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id ADE2B2403CD for <63225@debbugs.gnu.org>; Thu, 4 May 2023 14:55:01 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683204901; bh=/ICKVeAnMkZfkbdWTYlnhHqAn8ALp+Zs4befplYSwkk=; h=From:To:Cc:Subject:Date:From; b=FZfGOiBNbd0MLZxndErWvRBGaRPpgFqmyhRBf3cpUUjBxty42W5Kirob+EtGEmfxM z0GJn43N6tMohcGzlnuwI9qYhJ4CC+rWFd5Z/pDKGb2LGo9b/I9Mufp948cNwZXJc9 kylhjHYGa7gRuWOvL+NAdZ6Y7yXuCSmzxfCpuxFuSpvGYr7mLGSuqIPJmm4265gwim DHZUFFUjz3fNTABR99wTgkR2HEWK12JeCzvHv7VVWIpTOOD5m9BIinfS7I6zbhD3sv o6PcXBSfuebA1p5gtL/Wa9x1W7gPESXwjRnQAKAGRqtiiLkfJNHyslPfUHFFdtOF7B lQbRWsd4F9VZQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QBv106kxQz6twC; Thu, 4 May 2023 14:55:00 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> Date: Thu, 04 May 2023 12:58:13 +0000 Message-ID: <875y98jlsq.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> Ideally, the compiler should do something similar to >> what https://www.colm.net/open-source/ragel/ does. > > Yes, constructing a DFA would be more realistic when it's less in danger = of being thrown away at any time. I tried to look closer into regex-emacs.c and I note that the regexp compiler basically generates a specialized bytecode to be executed by regexp matcher later. So, may the existing vector type be reused to store compiled regexp pattern objects? AFIU, pvec_type will need to have one more item and reader/printer will need to be modified for a specialized print representation. The compiled regexp itself will be stored as unibyte string, containing the instructions. This will also open a possibility to compile regexp patterns from Elisp, if necessary. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Fri May 05 06:27:57 2023 Received: (at 63225) by debbugs.gnu.org; 5 May 2023 10:27:57 +0000 Received: from localhost ([127.0.0.1]:53191 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pusfM-00016R-E7 for submit@debbugs.gnu.org; Fri, 05 May 2023 06:27:57 -0400 Received: from mout01.posteo.de ([185.67.36.65]:59701) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pusfK-00016E-CC for 63225@debbugs.gnu.org; Fri, 05 May 2023 06:27:55 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id CAA10240039 for <63225@debbugs.gnu.org>; Fri, 5 May 2023 12:27:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683282467; bh=J0eROI9qq+qq0/NfGXd9hYTb4Wie+nLUyGh67qRHDpQ=; h=From:To:Cc:Subject:Date:From; b=K62rIlZU9HSk9BVDMWy0vjpM9U+AfnVmeSk4ouDKgeevrmod8E5QoyFc25Dut+RaX mqUjTWfW1MJdQdmbX610ZSiyThjynDQx7cB8DCVoFFCk7c7zpejAo10pKMM3TU/fNQ h1exSJNn4AwSY6hSVFgyiAPPmOTAHZg2HArEaY7MWKkRs0feM2pE0KvStt0FahYWD7 XyB9QTt2XYdRS/vcrs8ZBuQnw/8RKvZkChFnvd1bexERtWMGcj4XxF+mY2Xt+1OX2O iWCZnViohDctSp2mRjUJlVUN65yTTAc+xXKpwrials33My9wAaO1Zwuqup57eav3Ah cFQLGlzh0EruQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QCRhg2GVMz6twv; Fri, 5 May 2023 12:27:47 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> Date: Fri, 05 May 2023 10:31:01 +0000 Message-ID: <87ednvul22.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> What is the aim of instrumenting regexp engine in this scenario? >> I already know that additional regexps will be tested by individual >> `org-element-X-parser' functions. > > I got the impression that the 'spine' of the parser, the sequence of `loo= king-at` calls in `org-element--current-element`, would frequently be run t= hrough in its entirety which means that consolidating these would reduce th= e number of working regexps by about 20 (if I'm counting correctly). Not exactly. The actual statistics is the following (of course, it is a subject of the actual parsed file structure). Below, I measured time spent in different branches of cond. Note describes the cond type. | Depth | count | Time, msec | Note | Avg= time, =CE=BCsec/count | Element type | |-------+--------+----------------+-----------------------------------+----= ------------------+---------------| | 0 | 89592 | 31.094 | eq | = 0.43315819 | item | | 1 | 1984 | 0.68 | eq | = 0.45513710 | table row | | 2 | 206607 | 43.23 | eq | = 0.24850257 | node-property | | 3 | 72770 | 302.95 | looking-at-p, skip-chars | = 4.8545025 | headline | | 4 | 56000 | 39.190916 | memq | = 0.69983779 | section | | 5 | 8231 | 26.129109 | looking-at-p, lookback | = 3.1744756 | planning | | 6 | 54852 | 503.97346 | looking-at-p, multiline, lookback | = 9.1878776 | prop drawer | | 7 | 89510 | 78.514284 | bolp | = 0.87715656 | paragraph | | 8 | 29610 | 79.589466 | looking-at-p | = 2.6879252 | clock | | 9 | 231 | 1.644304 | eq | = 7.1181991 | inlinetask | | 10 | 0 | tot: 1173 msec | eq | = 0/0 | affiliated | |-------+--------+----------------+-----------------------------------+----= ------------------+---------------| | 11 | 30 | 0.060081 | looking-at-p | = 2.0027 | latex env | | 12 | 45443 | 187.41703 | looking-at-p | = 4.1242222 | drawer | | 13 | 21 | 0.255528 | looking-at-p | = 12.168 | fixed width | | 14 | 967 | 6.67522 | looking-at | = 6.9030196 | block | | 15 | 53 | 0.342144 | looking-at-p | = 6.4555472 | call | | 16 | 0 | 0 | looking-at-p | = 0/0 | dynblock | | 17 | 29 | 0.361915 | looking-at-p | = 12.479828 | keyword | | 18 | 0 | 0 | eq | = 0/0 | paragraph | | 19 | 0 | 0 | looking-at-p | = 0/0 | footnote def | | 20 | 0 | 0 | looking-at-p | = 0/0 | rule | | 21 | 0 | 0 | looking-at-p | = 0/0 | diary sexp | |-------+--------+----------------+-----------------------------------+----= ------------------+---------------| | 22 | 66 | 0.752823 | looking-at-p, re-search-forward | = 11.406409 | table | | 23 | 41509 | 303.39472 | looking-at-p | = 7.3091310 | item | | 24 | 5340 | 41.188231 | t | = 7.7131519 | paragraph | | | | tot: 1713 msec | | = | | If I try to group regexps into one giant rx form and then compare time spend in different cond branches, I get the following. (I grouped the regexps between horizontal rules in the above table) I tried two different files: (1) notes.org that is heavy on headlines; (2) org-manual that is heavy on actual text. Grouping with rx gives no noticeable impact.=20 | Depth | Avg time, =CE=BCs | Avg time, =CE=BCs | Avg time, =CE=BCs | Av= g time, =CE=BCs | | | (notes+no rx) | (notes+rx) | (manual+no rx) | (manual+rx) | |-------+---------------+--------------+----------------+--------------| | 0 | 0.34576248 | 0.35948186 | 0.43996679 | 0.44675874 | | 1 | 0.35749752 | 0.37239325 | 0.44559585 | 0.43868739 | | 2 | 0.18958309 | 0.20197035 | 0.29960921 | 0.29960921 | | 3 | 4.1282904 | 4.2407582 | 4.4482968 | 4.4711219 | | 4 | 0.61503580 | 0.59914459 | 0.64377158 | 0.63460540 | | 5 | 0.88028169 | 0.83916084 | 1.2820513 | 1.2820513 | | 6 | 2.6515244 | 2.6348024 | 2.6795055 | 2.7579648 | | 7 | 7.8175124 | 7.8262918 | 7.1999256 | 7.1996154 | | 8 | 0.75458424 | 0.75368242 | 0.70958084 | 0.72455090 | | 9 | 2.1446653 | 2.1466905 | 10. | 10. | | 10 | 5.2813853 | 5.2813853 | 5.4761905 | 6.5476190 | | 11 | 0./0 | 0./0 | 0./0 | 0./0 | | 12 | 2. | 2.3333333 | 0./0 | 0./0 | | 13 | 3.5030250 | 4.6581886 | 4.0623783 | 5.8718692 | | 14 | 11.428571 | 10.952381 | 2.6970634 | 3.3307573 | | 15 | 5.6508264 | 4.6177686 | 5.1308629 | 4.3741902 | | 16 | 6.2264151 | 4.1509434 | 0./0 | 0./0 | | 17 | 0./0 | 0./0 | 0./0 | 0./0 | | 18 | 10.689655 | 12. | 5.7134386 | 3.7413831 | | 19 | 0./0 | 0./0 | 0./0 | 0./0 | | 20 | 0./0 | 0./0 | 2.8888889 | 2.9444444 | | 21 | 0./0 | 0./0 | 0./0 | 0./0 | | 22 | 0./0 | 0./0 | 0./0 | 0./0 | | 23 | 10.746269 | 9.4029851 | 6.2695313 | 6.1328125 | | 24 | 6.4371193 | 6.2419339 | 6.0138782 | 5.8558211 | | 25 | 6.4154824 | 6.3855647 | 4.9707695 | 4.7727182 | | 26 | 0./0 | 0./0 | 0./0 | 0./0 | > Now if as you suggest the parsing is dominated by sequences of regexps in= the branches, it prompts the questions: which branches, what regexps, why = are there so many of them, and is there anything that can be done to reduce= their number? Oh. No. The parsing is dominated by `org-element--current-element'. I can clearly see it because the profiler hits `org-element--current-element', not the branches. I just had no idea what to make of your suggestion about Run on a reduced dataset, and see if the sequence of regexps being exercised, and their frequencies, are consistent with what you expect. Also, my testing showed that (looking-at (rx (or (group-n 1 (regexp org-element--latex-begin-environment)) (group-n 2 (regexp org-element-drawer-re)) (group-n 3 (regexp "[ \t]*:\\( \\|$\\)")) (group-n 7 (regexp org-element-dynamic-block-open-re)) (seq (group-n 4 (regexp "[ \t]*#\\+")) (or (seq "BEGIN_" (group-n 5 (1+ (not space)))) (group-n 6 "CALL:") (group-n 8 (1+ (not space)) ":") )) (group-n 9 (regexp org-footnote-definition-re)) (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$")) (group-n 11 "%%(")))) is actually slightly slower overall compared to a series of `looking-at-p'. AFAIU, because the `looking-at' needs to allocate match-data vector for all these 11 groups, which leads to ;; 6.78% emacs emacs [.] process= _mark_stack floating up in the perf top. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Fri May 05 12:27:06 2023 Received: (at 63225) by debbugs.gnu.org; 5 May 2023 16:27:06 +0000 Received: from localhost ([127.0.0.1]:57317 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puyGv-0005YH-Dw for submit@debbugs.gnu.org; Fri, 05 May 2023 12:27:05 -0400 Received: from mail-lf1-f41.google.com ([209.85.167.41]:55471) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1puyGp-0005Xl-Uk for 63225@debbugs.gnu.org; Fri, 05 May 2023 12:27:03 -0400 Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-4f14ea058dcso315448e87.2 for <63225@debbugs.gnu.org>; Fri, 05 May 2023 09:26:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683304014; x=1685896014; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=R8RoTyCQ6Edy1gdG8/zPqi/m0z0u8F8LfitKn/6ybdo=; b=em8R14OOT2oOw+vuvJeAgxdh6qRJGPioVmA8oSuWQj/E71NHFI4nCjqCpx53XoOc38 1I/pG+YFbBFoJgl7R2CO90RFzYSfE+/qamcze3eLubdhCZMgghx6SMiPIl4wWTx9zeij FvZGuH2rbaI2vD2eXWreVvzTfcd2+sZLudXqO16Wqq111e2gElgdgrpXFuATEFBinNf8 Tv/Sp45MGs9d3ggO00+2s7i4G2JlFjmI2okLZtM9zbefoitb9vVPRisMr0xU2fnoexFm sHFQlyHXEA70+O3tfGHy3HpF8URmWIll9gfF28BiAKRP1V1nWlxc/UEjLW/j+M5fttZ6 u+bA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683304014; x=1685896014; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=R8RoTyCQ6Edy1gdG8/zPqi/m0z0u8F8LfitKn/6ybdo=; b=A/P2XXk21e8m2X3OcN/JPNHC+RJk0lY1QDxE2rzfHa+WQv9vE7hUl6Y/lEjQ12sxcI iZAybt0c/RbP+P5grMJ6QG3IyuDeur7LTPk5Pkwb1YNf16TY4L2Ciyd/P8GeFdsMPkkR IlPTcebH6x5cVBkNeWV1NWT1TjR/3hDnN/XPURQUUzCja1dDevpobTTl7HrjfqBCfhYw m5m3G78G8CvHFJo8Hpnhmk6btsNGk5qBTHynSQPsaIg/fy7lpwg8KjJBQtwlSfU1dCo4 KuQdRrWsV8ehUs/SKwv90ymaiFjV7825uFnet+TOgHPN9DcebW/jSWtOdRyP3XXBgjTw axyQ== X-Gm-Message-State: AC+VfDw93jWBNJ2meobouiSfpjBjr9ztNz7sJ9P/WqIDjlcyd60dvRSR dP40bY434AU9z2QR9IM5Ffw= X-Google-Smtp-Source: ACHHUZ6p0gNwnsUzi7FBxo+2r0zGHk3DhjsvfIFD8v/F8Lolg56g4Tpvc5EuXJ/c2tOe+xJ/MDir5A== X-Received: by 2002:a19:f70c:0:b0:4f0:6aa3:d860 with SMTP id z12-20020a19f70c000000b004f06aa3d860mr705473lfe.39.1683304013925; Fri, 05 May 2023 09:26:53 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id d18-20020a05651221d200b004edc76b6aabsm339284lft.209.2023.05.05.09.26.53 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 05 May 2023 09:26:53 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87ednvul22.fsf@localhost> Date: Fri, 5 May 2023 18:26:52 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 5 maj 2023 kl. 12.31 skrev Ihor Radchenko : > Not exactly. The actual statistics is the following (of course, it is = a > subject of the actual parsed file structure). >=20 > Below, I measured time spent in different branches of cond. This is useful. It looks like drawers consume a lot of time, and list = items. I know very little about Org, but from afar it looks like all = drawers have the same basic form. Can't you recognise them with a single = regexp and then branch on the drawer type for subtype-specific = treatment? There are micro-inefficiencies in the regexps here and there that you = might want to try fixing (although I can't promise any noticeable gain = from doing so): (defconst org-property-drawer-re (concat "^[ \t]*:PROPERTIES:[ \t]*\n" "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?" "[ \t]*:END:[ \t]*$") Look at the middle line in particular. Translated to rx, that part = becomes (*? (* (in "\t ")) ":" (+ (not (syntax whitespace))) ":" (? (in "\t ") (* nonl)) (* (in "\t ")) "\n") There are too many ways this could match. Maybe you could change it to (*? (* (in " \t")) ":" (+ (not (in " \t\n:"))) ":" (* nonl) "\n") which prevents a lot of unnecessary backtracking and does away with = parsing structure that doesn't matter here. Another example: (defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$" which is (: bol (* (in "\t ")) ":" (group (+ (| wordchar (in "_-")))) ; <-- ":" (* (in "\t ")) eol) Making reasonable assumptions about characters, the line marked with an = arrow could become (group (+ (not (in " \t\n:")))) but it's fine if you want to exclude more characters here, as long as = you avoid leaving backtrack points everywhere. (Character syntax is kind = of expensive too.) Regarding list items, are you still calling (org-item-re) each time? >> Now if as you suggest the parsing is dominated by sequences of = regexps in the branches, it prompts the questions: which branches, what = regexps, why are there so many of them, and is there anything that can = be done to reduce their number? >=20 > Oh. No. The parsing is dominated by `org-element--current-element'. I > can clearly see it because the profiler hits > `org-element--current-element', not the branches. Well there must be regexps being matched elsewhere since you did show = early on the working set to be above 40, not the ca. 20 in = org-element--current-element. > I just had no idea what to make of your suggestion about >=20 > Run on a reduced dataset, and see if the sequence of regexps being > exercised, and their frequencies, are consistent with what you > expect. Stupid printf-debugging actually, nothing fancier than that. I'll see if I can put together a patch for you a bit later on. > (looking-at > (rx > (or > (group-n 1 (regexp org-element--latex-begin-environment)) > (group-n 2 (regexp org-element-drawer-re)) > (group-n 3 (regexp "[ \t]*:\\( \\|$\\)")) > (group-n 7 (regexp org-element-dynamic-block-open-re)) > (seq (group-n 4 (regexp "[ \t]*#\\+")) > (or > (seq "BEGIN_" (group-n 5 (1+ (not space)))) > (group-n 6 "CALL:") > (group-n 8 (1+ (not space)) ":") > )) > (group-n 9 (regexp org-footnote-definition-re)) > (group-n 10 (regexp "[ \t]*-\\{5,\\}[ \t]*$")) > (group-n 11 "%%(")))) This actually incurs some unnecessary run-time cost: the (regexp ...) = forms make this expand to a `concat` call to construct this rather long = regexp each time. Either only recompute it when any of the variables = (org-element--latex-begin-environment etc) change, or if you intend them = to be compile-time constants, make sure they are expanded as such. > is actually slightly slower overall compared to a series of = `looking-at-p'. > AFAIU, because the `looking-at' needs to allocate match-data vector = for > all these 11 groups, which leads to > ;; 6.78% emacs emacs [.] = process_mark_stack > floating up in the perf top. Quite sure that's the concat calls. Match data doesn't actually = contribute to any GC-level consing unless you reify it by calling = `match-data`, or indirectly through `safe-match-data` (which I see that = you are using in several places -- try not to). From debbugs-submit-bounces@debbugs.gnu.org Sat May 06 09:35:46 2023 Received: (at 63225) by debbugs.gnu.org; 6 May 2023 13:35:46 +0000 Received: from localhost ([127.0.0.1]:59731 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvI4f-00008l-St for submit@debbugs.gnu.org; Sat, 06 May 2023 09:35:46 -0400 Received: from mout02.posteo.de ([185.67.36.66]:50947) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvI4V-00008O-Ps for 63225@debbugs.gnu.org; Sat, 06 May 2023 09:35:44 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout02.posteo.de (Postfix) with ESMTPS id B94FA240101 for <63225@debbugs.gnu.org>; Sat, 6 May 2023 15:35:29 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683380129; bh=97TOvOxEWjgjHU1/vvT8UMwbAj9qnLAeETThHLI+fI0=; h=From:To:Cc:Subject:Date:From; b=RXxh8AJjcNzEgmx4tRbuO10jxB3wJaQGMoQRY71L4rJUVJgmPSY0QdrlqOeRVLUOJ W6nG+1a6v7+TN5dNBXoGt3fe5sEt0LhHqnv58XuDx0l5i4mVpHHtpG/FPglXayytka O5vEY75Z0f8bja13rWX1Ao/S71FNclxprcIYQFesqPrSelJTzP2r4xPU8oh/sk6LbQ grDkW6lkUTfF6LCxZB+q0buJmXUnTAsO2F9caD/+1e2PIdJpyjhSJelDvdcmef/V0j nEoZI8/cM0/7vVb/OIeclLSrIJ8l1C3Wi5V2rMhbYz1MRrdUr2kuxJT9ztsS0fw6u8 wN0vf4te2QVvA== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QD7pn1Jm9z6v16; Sat, 6 May 2023 15:35:29 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> Date: Sat, 06 May 2023 13:38:38 +0000 Message-ID: <87y1m1oa01.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> Below, I measured time spent in different branches of cond. > > This is useful. It looks like drawers consume a lot of time, and list ite= ms. I know very little about Org, but from afar it looks like all drawers h= ave the same basic form. Can't you recognise them with a single regexp and = then branch on the drawer type for subtype-specific treatment? I may, but it will be even more complex regexp. Currently, ordinary drawers have somewhat complex :BEGIN: line, because they can have any word there, while property drawers require very complex match for the lines inside. Also, property drawers only occur right after headings, as marked by appropriate parser flag. So, matching property drawers mostly happens what they are supposed to be. If we try to match ordinary drawers at the same time, it will actually be slower in practice. > There are micro-inefficiencies in the regexps here and there that you mig= ht want to try fixing (although I can't promise any noticeable gain from do= ing so): > > (defconst org-property-drawer-re > (concat "^[ \t]*:PROPERTIES:[ \t]*\n" > "\\(?:[ \t]*:\\S-+:\\(?:[ \t].*\\)?[ \t]*\n\\)*?" > "[ \t]*:END:[ \t]*$") > ... > There are too many ways this could match. Maybe you could change it to > > (*? (* (in " \t")) > ":" (+ (not (in " \t\n:"))) ":" > (* nonl) > "\n") Sure. Thanks! It was a sub-second improvement, but an improvement. > Another example: > > (defconst org-drawer-regexp "^[ \t]*:\\(\\(?:\\w\\|[-_]\\)+\\):[ \t]*$" > ... > Making reasonable assumptions about characters, the line marked with an a= rrow could become > > (group (+ (not (in " \t\n:")))) This will account for Org syntax change, so no. Slight improvement in performance cannot justify syntax changes. =20 > Regarding list items, are you still calling (org-item-re) each time? Yes and no. `org-item-re' now looks like (defvar org--item-re-cache nil "Results cache for `org-item-re'.") (defsubst org-item-re () "Return the correct regular expression for plain lists." (or (plist-get org--item-re-cache (cons org-list-allow-alphabetical org-plain-list-ordered-item-terminator) #'equal) ...)) It should not give much overhead. >> Oh. No. The parsing is dominated by `org-element--current-element'. I >> can clearly see it because the profiler hits >> `org-element--current-element', not the branches. > > Well there must be regexps being matched elsewhere since you did show ear= ly on the working set to be above 40, not the ca. 20 in org-element--curren= t-element. Of course. A larger number of regexps is matched in the individual element parsers. They just don't contribute as much as `org-element--current-element' individually and thus do not show up high in the profiler. For reference, I calculated the time taken in `org-element--current-element' to decide about parsing specific element type (Time/Avg) vs. time taken to actual parse it (Time2/Avg2). (Note that the data below is for my WIP parser refactoring branch at https://git.sr.ht/~yantar92/org-mode/tree/feature/org-element-ast/item/lisp= /org-element.el; The original, e.g. headline will be way slower) | Depth | Count | Time, msec | Time2, msec | Avg, =CE=BCsec | Avg2, =CE=BC= sec | Type | |-------+--------+------------+-------------+-----------+------------+-----= ---------| | 0 | 89729 | 30.714894 | 1339.9075 | 0.34 | 14.93 | item= | | 1 | 2074 | 0.779739 | 19.040295 | 0.38 | 9.18 | tabl= e row | | 2 | 207365 | 37.53366 | 1970.9524 | 0.18 | 9.50 | node= | | 3 | 72849 | 303.36754 | 2448.6616 | 4.16 | 33.61 | head= line | | 4 | 56076 | 33.117519 | 763.41927 | 0.59 | 13.61 | sect= ion | | 5 | 291 | 0.258913 | 2.622451 | 0.89 | 9.01 | comm= ent | | 6 | 8247 | 23.15524 | 224.61437 | 2.81 | 27.24 | plan= ning | | 7 | 54924 | 362.36612 | 523.11581 | 6.60 | 9.52 | prop= drawer | | 8 | 89647 | 69.361279 | 761.29519 | 0.77 | 8.49 | para= graph | | 9 | 29652 | 67.658072 | 829.21937 | 2.28 | 27.97 | cloc= k | | 10 | 231 | 1.285224 | 3.832217 | 5.56 | 16.59 | inli= netask | | 11 | 0 | 0 | 0 | 0.00 | 0.00 | keyw= ord | | 12 | 30 | 0.059978 | 0.413909 | 2.00 | 13.80 | late= x env | | 13 | 45401 | 159.57401 | 515.15776 | 3.51 | 11.35 | draw= er | | 14 | 21 | 0.265039 | 0.265754 | 12.62 | 12.65 | fixe= d width | | 15 | 913 | 5.597659 | 17.326571 | 6.13 | 18.98 | bloc= k | | 16 | 53 | 0.355013 | 1.329438 | 6.70 | 25.08 | call= | | 17 | 0 | 0 | 0 | 0.00 | 0.00 | dynb= lock | | 18 | 29 | 0.365553 | 0.494062 | 12.61 | 17.04 | keyw= ord | | 19 | 0 | 0 | 0 | 0.00 | 0.00 | para= graph | | 20 | 0 | 0 | 0 | 0.00 | 0.00 | foot= note def | | 21 | 0 | 0 | 0 | 0.00 | 0.00 | hrul= e | | 22 | 0 | 0 | 0 | 0.00 | 0.00 | diar= y sexp | | 23 | 69 | 0.739084 | 1.459472 | 10.71 | 21.15 | tabl= e | | 24 | 41586 | 281.42632 | 1327.9897 | 6.77 | 31.93 | plai= n list | | 25 | 5370 | 36.202665 | 66.853115 | 6.74 | 12.45 | para= graph | #+TBLFM: $5=3D1000.0*$3/$2;%.2f::$6=3D1000.0*$4/$2;%.2f >> I just had no idea what to make of your suggestion about >>=20 >> Run on a reduced dataset, and see if the sequence of regexps being >> exercised, and their frequencies, are consistent with what you >> expect. > > Stupid printf-debugging actually, nothing fancier than that. > I'll see if I can put together a patch for you a bit later on. I once tried #defun REGEX_EMACS_DEBUG + regex_emacs_debug =3D 100000, but it produced so much output that I cannot even open Emacs in reasonable time because of the wall of output in terminal. >> (looking-at >> (rx >> (or >> ... >> (group-n 11 "%%(")))) > > This actually incurs some unnecessary run-time cost: the (regexp ...) for= ms make this expand to a `concat` call to construct this rather long regexp= each time. Either only recompute it when any of the variables (org-element= --latex-begin-environment etc) change, or if you intend them to be compile-= time constants, make sure they are expanded as such. > >> is actually slightly slower overall compared to a series of `looking-at-= p'. >> AFAIU, because the `looking-at' needs to allocate match-data vector for >> all these 11 groups, which leads to >> ;; 6.78% emacs emacs [.] proc= ess_mark_stack >> floating up in the perf top. > > Quite sure that's the concat calls. Match data doesn't actually contribut= e to any GC-level consing unless you reify it by calling `match-data`, or i= ndirectly through `safe-match-data` (which I see that you are using in seve= ral places -- try not to). After moving that giant rx into defconst, the parsing time is not growing significantly anymore: ;; No rx: 17.947628s (1.373926s in 2 GCs) ;; rx: 18.058193s (1.379169s in 2 GCs) But there is no improvement either... [ now we are just 2x slower than tree-sitter rather than 2.5x :) ] --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Sun May 07 06:33:06 2023 Received: (at 63225) by debbugs.gnu.org; 7 May 2023 10:33:06 +0000 Received: from localhost ([127.0.0.1]:36534 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvbhS-0006Fp-6n for submit@debbugs.gnu.org; Sun, 07 May 2023 06:33:06 -0400 Received: from mail-lf1-f44.google.com ([209.85.167.44]:48438) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvbhM-0006FH-St for 63225@debbugs.gnu.org; Sun, 07 May 2023 06:33:03 -0400 Received: by mail-lf1-f44.google.com with SMTP id 2adb3069b0e04-4efe9a98736so3973269e87.1 for <63225@debbugs.gnu.org>; Sun, 07 May 2023 03:33:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683455575; x=1686047575; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=JptQaIBbn8iVr3OCXKmYEicFomURJ3nQi8JjtAqgJrQ=; b=rBoSd40ZVi6eso0kkuYDIj5R+xmbvNAxC9KL+CxkwG6yc/Utg5feRPLyZyCoKB1LU6 lKdzWQfj56fy6m9oQuEFBrW2AvOQEXepCToILYG4xanniAi4aqyjEHVpgR2n3jEOy2dU aSD8lrTXIRKBgrSVxhP6JfdedLuQVcd614aN378grXqfhQ+dkqtCPbqy0qeR4vTt0iHp qaJXvwUOhfjN5FTgFDn2865M2HUnjRKIBi8Oeng35pPLMOt6WvfARxRs7kiGRcRS20SK MDKCogN0VX9HVGu5sQJX2UdNmW9etrEoo7udZDG5Gp1P1aBo+CwUTIeSIIdkGu44jJR0 PTDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683455575; x=1686047575; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=JptQaIBbn8iVr3OCXKmYEicFomURJ3nQi8JjtAqgJrQ=; b=f6vmymXbTagJ7fzfR77oBdlxab1SxBfcvPRQ0rFS4LXhy6LPsR4tKyuEc9SRKxI1lK P9gr/IlrLa8CshdhItd5DnbcJ04rF2uMnwdHTs843QnM+2YTSOlXCK785j1Y2McJPyt9 ftHMbzgjkjP2l8PyGJDJCzGKSn0Ly3s4fQLBJJvyiIjO780vyKDp85E2q9yiHWs/wSgm XEP5BSfMJNRxn5e21iqkC0Wm6TNbJLsyJS0RuwjvPBg9n3RLnxs7EW/06TvvC2MdNwA7 HLnZQ36nYXgnBD2FL6IxKtN0MXsWvxA2ua0T3jBRHgZgTLfpGyiG3c25Grlk6rPtvhT3 LMzA== X-Gm-Message-State: AC+VfDxyp7jxN7sxTVGacOP08MMlzUuHUWaHETlCVzBUyh8xfh9FhuiE YdVO6YQ8tbYCoMJ+JBNQuS4= X-Google-Smtp-Source: ACHHUZ5bB2Dwd83ydvcxG0e8Dr1rmIHWUS0/nvyS+Cjwf0c9aWYnKTFQ0mu+LunbvakSTW2bLf+2gQ== X-Received: by 2002:a19:a414:0:b0:4f1:450b:a13 with SMTP id q20-20020a19a414000000b004f1450b0a13mr1742990lfc.2.1683455574573; Sun, 07 May 2023 03:32:54 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id p8-20020a05651211e800b004e96afb1e9asm937212lfs.253.2023.05.07.03.32.53 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 07 May 2023 03:32:53 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87y1m1oa01.fsf@localhost> Date: Sun, 7 May 2023 12:32:52 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 6 maj 2023 kl. 15.38 skrev Ihor Radchenko : > I may, but it will be even more complex regexp. Currently, ordinary > drawers have somewhat complex :BEGIN: line, because they can have any > word there, while property drawers require very complex match for the > lines inside. Also, property drawers only occur right after headings, = as > marked by appropriate parser flag. So, matching property drawers = mostly > happens what they are supposed to be. If we try to match ordinary > drawers at the same time, it will actually be slower in practice. What I meant was that the consolidated root regexp could just match the = initial :BEGIN: line and then dispatch to different branches for parsers = specific to the drawer type. That would reduce complexity and time spent = at the critical parser root. > This will account for Org syntax change, so no. Don't dismiss it out of hand. I'm not trying to optimise a few regexps, = but to use examples to illustrate some useful principles that would help = you improve many of them yourself. When matching something terminated by a specific character, it's = particularly useful if the regexp engine can be made to understand that = the terminator doesn't occur in what precedes it, as that enables it to = omit backtracking points. For example, in "a*b", the engine doesn't need = to save backtracking points for each 'a' matched since the sets {a} and = {b} are obviously disjoint. In this case, the (group (+ (| wordchar (in "_-")))) part is unnecessarily slow because it's an or-pattern, which also = inhibits that optimisation. Fortunately it can easily be rewritten as (group (+ (in "_-" word))) which solves both problems. > Slight improvement in performance cannot justify syntax changes. Always question your assumptions. A slight change of spec may not be so = bad after all if it buys speed and/or improves our understanding of the = code. Do you know what characters have 'word' syntax in org-mode? If = not, better be careful before using them in regexps. (Looks like org-tags-expand permanently adds @ and _ to the set of word = chars. A bug, surely?) > (defvar org--item-re-cache nil > "Results cache for `org-item-re'.") > (defsubst org-item-re () > "Return the correct regular expression for plain lists." > (or (plist-get > org--item-re-cache > (cons org-list-allow-alphabetical > org-plain-list-ordered-item-terminator) > #'equal) > ...)) >=20 > It should not give much overhead. Maybe, but you still cons each time. (And remember that the plist-get = equality funarg is new in Emacs 29.) > A larger number of regexps is matched in the individual > element parsers. They just don't contribute as much as > `org-element--current-element' individually and thus do not show up = high > in the profiler. Still, if called often enough they do outsized damage by evicting = regexps used elsewhere. Also make sure that if the same regexp is used in multiple places, it = should always use the same `case-fold-search` value or they will be = considered different for cache purposes. > [ now we are just 2x slower than tree-sitter rather than 2.5x :) ] Progress! From debbugs-submit-bounces@debbugs.gnu.org Sun May 07 08:45:46 2023 Received: (at 63225) by debbugs.gnu.org; 7 May 2023 12:45:46 +0000 Received: from localhost ([127.0.0.1]:36648 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvdlp-0003ms-Ml for submit@debbugs.gnu.org; Sun, 07 May 2023 08:45:45 -0400 Received: from mail-lf1-f46.google.com ([209.85.167.46]:59773) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvdlo-0003me-Bs for 63225@debbugs.gnu.org; Sun, 07 May 2023 08:45:44 -0400 Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-4f13ef4ad91so4054985e87.3 for <63225@debbugs.gnu.org>; Sun, 07 May 2023 05:45:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683463538; x=1686055538; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:sender:from:to:cc:subject:date:message-id:reply-to; bh=mkqs5c0ZSio6ks4fBti2i3SVquYlICDdSmJzKoJ2CYU=; b=OvX9Hxlv4irjkRlExmyImr6O8kBhFpZbnCY1KWIPZz5DuZ6ZGKDeZrSm7Ynje47RuQ ANHe4pMpGzYuoTl3nkz0hJ8dWOY/9csOH8/H6Vf0ehh9AklSexmstEshgUULkjRNZINZ 39m9ea7TxDZFGKfe6nw8M8O9EUggm328UqSQXec9qogiPnDoLSgVwhwcfHYalIviLIf+ HNySFQH1TV6B2b7orOWbPa9f5beo+IiG3SIC3xppS/lzJ1dAKhqfv3G6jXk/FdyM1Sv9 XLZrl5PUMUQASJihfON3xgTRysdonZGmz+1PtAdW8SrTc5TT3XXL7qUUt9+pK4CZI73i RFBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683463538; x=1686055538; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:sender:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mkqs5c0ZSio6ks4fBti2i3SVquYlICDdSmJzKoJ2CYU=; b=A7P7UnzdKaZ1MAwF7NP0TX+8B4lJHwIxDXzTeJh0JpB8oyzAmPxeWh4Su6e6fb7sRY wESfhaW7Zo+fDpxTLD0OY8J3BcMM9j7apLPQaAAFOLYnLJJSbBdT5eI6iT7YPzKhKKsG xe2mqNgmaYy/g16vZ6/c7607NSsWiBIZ8E3QQ851Kg4eO6VQEXKhaeuGhUdyWzC2r03h 88JE0kglrIq45ZuiaFQ8lhifxQ2PtfwhMMdoiqjVMK9iszLFGm3a6PRypcse/kmWrOHH ANqhpoyukByFl+KNrto4In91RPYn8csQkhIFF9qL9pN+Nctpr2r1iSTPxfWhTUjRdaYh 9pUA== X-Gm-Message-State: AC+VfDyLrSON4j26/fn9OYS/Sh1iR8K0G/1ci96oI/XT2ANnQONQIZP5 rVPIprQnrd793OH94dZT9ws= X-Google-Smtp-Source: ACHHUZ6QBqwx8q5Q87p+HCY3JJYISDT+2VknSxpX3s4c86m6uSV1t+HlZkiztGhk++AxQtj34XdNZA== X-Received: by 2002:ac2:5293:0:b0:4f1:408f:4fa6 with SMTP id q19-20020ac25293000000b004f1408f4fa6mr2115324lfm.49.1683463538015; Sun, 07 May 2023 05:45:38 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id v16-20020a056512049000b004ddaea30ba6sm961922lfq.235.2023.05.07.05.45.36 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Sun, 07 May 2023 05:45:36 -0700 (PDT) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= Message-Id: Content-Type: multipart/mixed; boundary="Apple-Mail=_72D6074B-8E17-43A7-BA13-992D511AF733" Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) Date: Sun, 7 May 2023 14:45:36 +0200 In-Reply-To: To: Ihor Radchenko References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --Apple-Mail=_72D6074B-8E17-43A7-BA13-992D511AF733 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 5 maj 2023 kl. 18.26 skrev Mattias Engdeg=C3=A5rd = : > Stupid printf-debugging actually, nothing fancier than that. Here is some of that stupidity I promised. You probably want to use it = with (set-regexp-trace-file "re.log") (unwind-protect (do-something-interesting) (set-regexp-trace-file nil)) so that you don't trace more than necessary. The file may become large, = but it's useful data for off-line analysis, scripted or just looking at = it in an editor. The first letter of each line indicates a regexp cache hit (H) or miss = (M). --Apple-Mail=_72D6074B-8E17-43A7-BA13-992D511AF733 Content-Disposition: attachment; filename=0003-Regexp-tracing-add-set-regexp-trace-file.patch Content-Type: application/octet-stream; x-unix-mode=0644; name="0003-Regexp-tracing-add-set-regexp-trace-file.patch" Content-Transfer-Encoding: quoted-printable =46rom=20cd66a560a74d2ed94202cab278455544f0c9337c=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20=3D?UTF-8?q?Mattias=3D20Engdeg=3DC3=3DA5rd?=3D=20= =0ADate:=20Sun,=207=20May=202023=2014:05:31=20+0200=0A= Subject:=20[PATCH=203/3]=20Regexp=20tracing:=20add=20= set-regexp-trace-file=0A=0A---=0A=20src/search.c=20|=2048=20= ++++++++++++++++++++++++++++++++++++++++++++++++=0A=201=20file=20= changed,=2048=20insertions(+)=0A=0Adiff=20--git=20a/src/search.c=20= b/src/search.c=0Aindex=20c454d5e1ca9..b378db152a2=20100644=0A---=20= a/src/search.c=0A+++=20b/src/search.c=0A@@=20-34,6=20+34,10=20@@=20= Copyright=20(C)=201985-1987,=201993-1994,=201997-1999,=202001-2023=20= Free=20Software=0A=20=0A=20#include=20"regex-emacs.h"=0A=20=0A+#include=20= =0A+=0A+static=20FILE=20*regexp_trace_file=20=3D=20NULL;=0A+=0A=20= #define=20DEFAULT_REGEXP_CACHE_SIZE=2020=0A=20=0A=20/*=20If=20the=20= regexp=20is=20non-nil,=20then=20the=20buffer=20contains=20the=20compiled=20= form=0A@@=20-200,6=20+204,7=20@@=20compile_pattern=20(Lisp_Object=20= pattern,=20struct=20re_registers=20*regp,=0A=20{=0A=20=20=20struct=20= regexp_cache=20*cp,=20**cpp,=20**lru_nonbusy;=0A=20=0A+=20=20bool=20= cache_hit=20=3D=20false;=0A=20=20=20for=20(cpp=20=3D=20&searchbuf_head,=20= lru_nonbusy=20=3D=20NULL;=20;=20cpp=20=3D=20&cp->next)=0A=20=20=20=20=20= {=0A=20=20=20=20=20=20=20cp=20=3D=20*cpp;=0A@@=20-224,6=20+229,7=20@@=20= compile_pattern=20(Lisp_Object=20pattern,=20struct=20re_registers=20= *regp,=0A=20=09=20=20&&=20cp->buf.charset_unibyte=20=3D=3D=20= charset_unibyte)=0A=20=20=20=20=20=20=20=20=20{=0A=20=20=20=20=20=20=20=20= =20=20=20regexp_cache_hit++;=0A+=09=20=20cache_hit=20=3D=20true;=0A=20=20= =20=20=20=20=20=20=20=20=20break;=0A=20=20=20=20=20=20=20=20=20}=0A=20=0A= @@=20-243,6=20+249,26=20@@=20compile_pattern=20(Lisp_Object=20pattern,=20= struct=20re_registers=20*regp,=0A=20=09}=0A=20=20=20=20=20}=0A=20=0A+=20=20= if=20(regexp_trace_file)=20{=0A+=20=20=20=20fprintf(regexp_trace_file,=20= "%c=20\"",=20cache_hit=20?=20'H'=20:=20'M');=0A+=20=20=20=20ptrdiff_t=20= n=20=3D=20SBYTES=20(pattern);=0A+=20=20=20=20for=20(ptrdiff_t=20i=20=3D=20= 0;=20i=20<=20n;=20i++)=20{=0A+=20=20=20=20=20=20unsigned=20char=20c=20=3D=20= SREF=20(pattern,=20i);=0A+=20=20=20=20=20=20switch=20(c)=20{=0A+=20=20=20= =20=20=20case=20'"':=20case=20'\\':=20fprintf(regexp_trace_file,=20= "\\%c",=20c);=20break;=0A+=20=20=20=20=20=20case=20'\n':=20= fprintf(regexp_trace_file,=20"\\n");=20break;=0A+=20=20=20=20=20=20case=20= '\t':=20fprintf(regexp_trace_file,=20"\\t");=20break;=0A+=20=20=20=20=20=20= default:=0A+=09if=20(c=20<=2032=20||=20c=20=3D=3D=20127)=0A+=09=20=20= fprintf(regexp_trace_file,=20"\\x%02x",=20c);=0A+=09else=0A+=09=20=20= putc(c,=20regexp_trace_file);=0A+=09break;=0A+=20=20=20=20=20=20}=0A+=20=20= =20=20}=0A+=20=20=20=20fprintf(regexp_trace_file,=20"\"\n");=0A+=20=20}=0A= +=0A=20=20=20/*=20When=20we=20get=20here,=20cp=20(aka=20*cpp)=20contains=20= the=20compiled=20pattern,=0A=20=20=20=20=20=20either=20because=20we=20= found=20it=20in=20the=20cache=20or=20because=20we=20just=20compiled=20= it.=0A=20=20=20=20=20=20Move=20it=20to=20the=20front=20of=20the=20queue=20= to=20mark=20it=20as=20most=20recently=20used.=20=20*/=0A@@=20-3424,6=20= +3450,27=20@@=20DEFUN=20("set-regexp-cache-size",=20= Fset_regexp_cache_size,=20Sset_regexp_cache_size,=0A=20=20=20return=20= Qnil;=0A=20}=0A=20=0A+DEFUN=20("set-regexp-trace-file",=20= Fset_regexp_trace_file,=20Sset_regexp_trace_file,=0A+=20=20=20=20=20=20=20= 1,=201,=200,=0A+=20=20=20=20=20=20=20doc:=20/*=20Set=20the=20regexp=20= trace=20file=20to=20FILE.=20=20Internal=20use=20only.=0A+Use=20`nil'=20= as=20argument=20to=20stop=20tracing.=20=20*/)=0A+=20=20(Lisp_Object=20= file)=0A+{=0A+=20=20if=20(NILP=20(file))=20{=0A+=20=20=20=20fclose=20= (regexp_trace_file);=0A+=20=20=20=20regexp_trace_file=20=3D=20NULL;=0A+=20= =20}=20else=20{=0A+=20=20=20=20CHECK_STRING=20(file);=0A+=20=20=20=20if=20= (regexp_trace_file)=0A+=20=20=20=20=20=20Fset_regexp_trace_file=20= (Qnil);=0A+=20=20=20=20FILE=20*f=20=3D=20fopen=20(SSDATA=20(file),=20= "a");=0A+=20=20=20=20if=20(!f)=0A+=20=20=20=20=20=20report_file_error=20= ("opening=20regexp=20trace=20file",=20file);=0A+=20=20=20=20= regexp_trace_file=20=3D=20f;=0A+=20=20}=0A+=20=20return=20Qnil;=0A+}=0A+=0A= =20void=0A=20mark_regexp_cache=20(void)=0A=20{=0A@@=20-3514,6=20+3561,7=20= @@=20syms_of_search=20(void)=0A=20=20=20defsubr=20= (&Snewline_cache_check);=0A=20=20=20defsubr=20(&Sregexp_cache_size);=0A=20= =20=20defsubr=20(&Sset_regexp_cache_size);=0A+=20=20defsubr=20= (&Sset_regexp_trace_file);=0A=20=0A=20=20=20= pdumper_do_now_and_after_load=20(syms_of_search_for_pdumper);=0A=20}=0A= --=20=0A2.32.0=20(Apple=20Git-132)=0A=0A= --Apple-Mail=_72D6074B-8E17-43A7-BA13-992D511AF733-- From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 07:54:55 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 11:54:55 +0000 Received: from localhost ([127.0.0.1]:39482 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvzSB-0002Da-2C for submit@debbugs.gnu.org; Mon, 08 May 2023 07:54:55 -0400 Received: from mout01.posteo.de ([185.67.36.65]:48095) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pvzS8-0002D7-Uj for 63225@debbugs.gnu.org; Mon, 08 May 2023 07:54:54 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id E43132401B2 for <63225@debbugs.gnu.org>; Mon, 8 May 2023 13:54:46 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683546886; bh=TSKD8mLJ8anfL6nsJzsFEDN2+iROQjHFQN3qgqyU9x0=; h=From:To:Cc:Subject:Date:From; b=rJAHc+bmh+kaG7/nrLVDsjgrW6W/VYKgQWVW6QpnE9uoyQkFD+CaZ2STep3o6NU4u hOc9WZiCkd3PQhbTBcD0YJ+XN97rFvfGZ+eCSCNq8f58D4GtTdkTcw2xifdbuDE3EV sHeh1GAuEj6ZBgUyPgeRjE1ngFAwW9vQpQDFAAWfCp7nr4PQ/xgl69zqbY9baAzdwM R//PMifwzTevzNEl/FRqAtpwWRQzEccM7O2N8XmZhUrsrgq0FZQwt2RDS13rJLf83D oOZ4gz2h5vTu9R903i0iRIgphDi/WABEqGZCsPdOoLH7P584yJNQkDMCzy14lJP5X8 fsghGnVdShYZQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFKTf2JLsz9rxK; Mon, 8 May 2023 13:54:46 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> Date: Mon, 08 May 2023 11:58:05 +0000 Message-ID: <87zg6fjar6.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: > What I meant was that the consolidated root regexp could just match the i= nitial :BEGIN: line and then dispatch to different branches for parsers spe= cific to the drawer type. That would reduce complexity and time spent at th= e critical parser root. Let me elaborate on what I mean. The relevant `cond' branches are: (and (pcase mode (`planning (eq ?* (char-after (line-beginning-position 0)))) ((or `property-drawer `top-comment) (save-excursion (beginning-of-line 0) (not (looking-at-p "[[:blank:]]*$")))) (_ nil)) (looking-at-p org-property-drawer-re)) where org-property-drawer-re is (rx ;; Drawer begin line. bol (0+ (in " \t")) ":PROPERTIES:" (0+ (in " \t")) "\n" ;; Node properties. (*? (0+ (in " \t")) ":" (+ (not (in " \t\n:"))) ":" (* nonl) "\n") ;; Drawer end line. (0+ (in " \t")) ":END:" (0+ (in " \t")) eol) Note that this regexp is only matches when certain conditions are met and the beginning of the property drawer regexp matches for less general ":PROPERTIES:". and [now part of the giant rx] (rx line-start (0+ (any ?\s ?\t)) ":" (1+ (any ?- ?_ word)) ":" (0+ (any ?\s ?\t)) line-end) matches for more general ":[-_[:word:]]+:". If we make ":[-_[:word:]]+:" a new root, how will it be helpful? >> This will account for Org syntax change, so no. > > Don't dismiss it out of hand. I'm not trying to optimise a few regexps, b= ut to use examples to illustrate some useful principles that would help you= improve many of them yourself. > > When matching something terminated by a specific character, it's particul= arly useful if the regexp engine can be made to understand that the termina= tor doesn't occur in what precedes it, as that enables it to omit backtrack= ing points. For example, in "a*b", the engine doesn't need to save backtrac= king points for each 'a' matched since the sets {a} and {b} are obviously d= isjoint. > > In this case, the > > (group (+ (| wordchar (in "_-")))) > > part is unnecessarily slow because it's an or-pattern, which also > inhibits that optimisation. > Fortunately it can easily be rewritten as > > (group (+ (in "_-" word))) > > which solves both problems. But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same? >> Slight improvement in performance cannot justify syntax changes. > > Always question your assumptions. A slight change of spec may not be so b= ad after all if it buys speed and/or improves our understanding of the code= . Do you know what characters have 'word' syntax in org-mode? If not, bette= r be careful before using them in regexps. Honestly, I have no clue how syntax tables in Org mode are working. I once tried to alter syntax table inside code blocks and Org got completely broken. So, I simply do not dare to touch syntax-dependent matches. Ideally, Org should use dedicated, unmutable, syntax tables when parsing. The difficulty though is that we need to support various languages, including CJK syntax where syntax expectation quite different from Latin. Your suggestions about using (not (in ...)) in place of (in ...) are good, but I am afraid to break CJK cases where people can use unexpected set of characters. I was bitten by this in the past. And there is consideration about not breaking the syntax. It is not just about Elisp part - Org is a markup standard and for drawers in particular we have defined the syntax like https://orgmode.org/worg/org-syntax.html#Drawers :NAME: CONTENTS :end: NAME A string consisting of word-constituent characters, hyphens and underscores= (-_). To change this, we should weigh on the possible impact on the external Org parsers, not implemented in Elisp. > (Looks like org-tags-expand permanently adds @ and _ to the set of word c= hars. A bug, surely?) Yeah. Fixed now. https://git.savannah.gnu.org/cgit/emacs/org-mode.git/commit/?id=3D6e6354c07 >> (defvar org--item-re-cache nil >> "Results cache for `org-item-re'.") >> (defsubst org-item-re () >> ... >> It should not give much overhead. > > Maybe, but you still cons each time. (And remember that the plist-get equ= ality funarg is new in Emacs 29.) Sure it does. It is just one of the variable parts of Org syntax that might be changed. There are ways to make this into constant, but it is a fragile area of the code that I do not want to touch without a reason. (Especially given that I am not familiar with org-list.el) > Also make sure that if the same regexp is used in multiple places, it sho= uld always use the same `case-fold-search` value or they will be considered= different for cache purposes. I hope we do. If only Emacs had a way to define `case-fold-search' right within the regexp itself. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 09:53:13 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 13:53:13 +0000 Received: from localhost ([127.0.0.1]:39619 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw1Id-0005r2-ST for submit@debbugs.gnu.org; Mon, 08 May 2023 09:53:13 -0400 Received: from mout01.posteo.de ([185.67.36.65]:37639) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw1Ia-0005qn-Ov for 63225@debbugs.gnu.org; Mon, 08 May 2023 09:53:10 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 3D326240114 for <63225@debbugs.gnu.org>; Mon, 8 May 2023 15:53:03 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683553983; bh=QrrTkI6AAXTzt9bl8ZKVPP0qXDhe5Z7hZL1K2du8u5M=; h=From:To:Cc:Subject:Date:From; b=iahrhwfXd0SA+klrkR98Khn9SyjWwr0/5jnErSZkMYq2UZ9BdWfKSYsEhDWAPmnYh SM+2mL8l6oBOY0XNCIpVKw+AkufrIWrkK4alEHX4gh2+p7nJarqOmW1CSzeBsQBJLY MidYjQuYnLNPywjq4zsq9skG96V7lZp3F6qi2e3pMG868Nq+pYr29GfZ+SRZe/wFc1 94jxpM/uCd1jTorVoZpEc8KhI5fcz6GKUUCvofJ5TQLr4TqoMKO/1NWrzmLm/9ue5O hM68g0aFxJi5xoQtXOgtolB1OxsvwdqDokcHbGgJkxo1SDgpQhvdMU18ksjaJoUHs5 8Jl8oyeJ8Hy5w== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFN665716z9rxP; Mon, 8 May 2023 15:53:02 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> Date: Mon, 08 May 2023 13:56:18 +0000 Message-ID: <87h6sn9bb1.fsf@localhost> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mattias Engdeg=C3=A5rd writes: > 5 maj 2023 kl. 18.26 skrev Mattias Engdeg=C3=A5rd : > >> Stupid printf-debugging actually, nothing fancier than that. > > Here is some of that stupidity I promised. You probably want to use it wi= th > > (set-regexp-trace-file "re.log") > (unwind-protect > (do-something-interesting) > (set-regexp-trace-file nil)) Thanks! > so that you don't trace more than necessary. The file may become large, b= ut it's useful data for off-line analysis, scripted or just looking at it i= n an editor. > The first letter of each line indicates a regexp cache hit (H) or miss (M= ). I am not sure what I can make out of hits/misses, but I am at least able to look into frequency data, via sort re.log | uniq -c > re-freq.log It would be even nicer if apart from frequency, there was information about time taken to search for each regexp. --=-=-= Content-Type: text/plain Content-Disposition: inline; filename=re2-large.log Content-Transfer-Encoding: quoted-printable 624713 H "^\\(?4:[ \t]*\\)\\(?1::\\(?2:\\S-+\\):\\)\\(?:\\(?3:$\\)\\|[ \t]= +\\(?3:.*?\\)\\)\\(?5:[ \t]*\\)$" 293473 H "^\\*+ " 271943 H "\\(?:[_^][-{(*+.,[:alnum:]]\\)\\|[*+/=3D_~][^[:space:]]\\|\\<\\(= ?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr= \\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite= [*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?= \\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|= symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|T= extcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\= )?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:ent= ry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\= *\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle= \\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\= \|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:em= ac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\= \)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?= :elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|is= t-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(= ?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\= )?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev= \\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?= :bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|= upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\|\\[\\(?:cite[:/]\\|fn:\\= |\\(?:[0-9]\\|\\(?:%\\|/[0-9]*\\)\\]\\)\\|\\[\\)\\|@@\\|{{{\\|<\\(?:%%\\|<\= \|[0-9]\\|\\(?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:p= l\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\= |uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\= )\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:des= c\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Sm= artcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|s= hort\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel= \\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?= :l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*= \\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:= range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\= +\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)= ?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)= l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\= \(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref= \\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:searc= h\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\= \(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|noteci= te\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|= martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\)\\|\\$\\|\\\= \\\(?:[a-zA-Z[(]\\|\\\\[ \t]*$\\|_ +\\)\\|\\(?:call\\|src\\)_" 260619 H "\\(?:^\\|[^[:alnum:]]\\|\\c|\\)\\(foobar\\)\\(?:$\\|[^[:alnum:]]= \\|\\c|\\)" 222386 H "^[\t ]*#\\(?: \\|$\\)" 179559 H "^[ \t]*\\(\\(?:[-+*]\\|\\(?:[0-9]+\\|[A-Za-z]\\)[.)]\\)\\(?:[ \t= ]+\\|$\\)\\)\\(?:\\[@\\(?:start:\\)?\\([0-9]+\\|[A-Za-z]\\)\\][ \t]*\\)?\\(= ?:\\(\\[[ X-]\\]\\)\\(?:[ \t]+\\|$\\)\\)?\\(?:\\(.*\\)[ \t]+::\\(?:[ \t]+\\= |$\\)\\)?" 152627 H "\\([ \t]*\\([-+]\\|\\(\\([0-9]+\\)[.)]\\)\\)\\|[ \t]+\\*\\)\\([ = \t]+\\|$\\)" 145600 H "\\(?:^[\t ]*\\(\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)\\)= \\)" 125515 H "\\(\\([0-9]\\{4\\}\\)-\\([0-9]\\{2\\}\\)-\\([0-9]\\{2\\}\\)\\( += [^]+0-9>\x0d\n -]+\\)?\\( +\\([0-9]\\{1,2\\}\\):\\([0-9]\\{2\\}\\)\\)?\\)" 117890 H ":" 105789 H "^[ \t]*\n[ \t]*\n" 102357 H "^[\t ]*:PROPERTIES:[\t ]*\n\\(?:[\t ]*:[^\t\n :]+:.*\n\\)*?[\t ]= *:END:[\t ]*$" 100268 H "\\(?:^[\t ]*CLOCK: \\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2= \\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)\\(?:--\\(?:\\[\\([[:digit:]]\\= {4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)[\t ]+=3D>= [\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\\)?[\t ]*$\\)" 96479 H "[[<]\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\= \(?: .*?\\)?\\)[]>]\\|\\(?:<[0-9]+-[0-9]+-[0-9]+[^>\n]+?\\+[0-9]+[dwmy]>\\)= \\|\\(?:<%%\\(?:([^>\n]+)\\)>\\)" 95804 H "\\([<[]\\(%%\\)?.*?\\)[]>]\\(?:--\\([[<]\\([[:digit:]]\\{4\\}-[[= :digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)[]>]\\)\\)?" 95801 H "\\([.+]?\\+\\)\\([0-9]+\\)\\([hdwmy]\\)" 95801 H "\\(-\\)?-\\([0-9]+\\)\\([hdwmy]\\)" 95801 H "[012]?[0-9]:[0-5][0-9]\\(-\\([012]?[0-9]\\):\\([0-5][0-9]\\)\\)" 94944 H "^\\(?:\\*+ \\|\\[fn:[-_[:word:]]+\\]\\|%%(\\|[ \t]*\\(?:$\\||\\|= \\+\\(?:-+\\+\\)+[ \t]*$\\|#\\(?: \\|$\\|\\+\\(?:BEGIN_\\S-+\\|\\S-+\\(?:\\= [.*\\]\\)?:[ \t]*\\)\\)\\|:\\(?: \\|$\\|[-_[:word:]]+:[ \t]*$\\)\\|-\\{5,\\= }[ \t]*$\\|\\\\begin{\\([A-Za-z0-9*]+\\)}\\|\\(?:^[\t ]*CLOCK: \\(?:\\[\\([= [:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\= \)\\(?:--\\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\= }\\(?: .*?\\)?\\)\\]\\)[\t ]+=3D>[\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\= \)?[\t ]*$\\)\\|\\(?:[-+*]\\|\\(?:[0-9]+\\)[.)]\\)\\(?:[ \t]\\|$\\)\\)\\)" 89719 H "[-+*]" 70668 H "[ \t]*#\\+\\(?:\\(?1:\\(?:CAPTION\\|RESULTS\\)\\)\\(?:\\[\\(.*\\= )\\]\\)?\\|\\(?1:\\(?:DATA\\|HEADERS?\\|LABEL\\|NAME\\|PLOT\\|RES\\(?:NAME\= \|ULT\\)\\|\\(?:S\\(?:OURC\\|RCNAM\\)\\|TBLNAM\\)E\\)\\)\\|\\(?1:ATTR_[-_A-= Za-z0-9]+\\)\\):[ \t]*" 70409 H "\\(?1:^[ \t]*\\\\begin{[A-Za-z0-9*]+}\\)\\|\\(?2:^[\t ]*:[_[:wor= d:]-]+:[\t ]*$\\)\\|\\(?3:[ \t]*:\\( \\|$\\)\\)\\|\\(?7:^[\t ]*#\\+BEGIN:[\= t ]*\\)\\|\\(?4:[ \t]*#\\+\\)\\(?:BEGIN_\\(?5:[^[:space:]]+\\)\\|\\(?6:CALL= :\\)\\|\\(?8:[^[:space:]]+:\\)\\)\\|\\(?9:^\\[fn:\\([-_[:word:]]+\\)\\]\\)\= \|\\(?10:[ \t]*-----+[ \t]*$\\)\\|\\(?11:%%(\\)" 70409 H "[ \t]*$" 49780 H "^[ \t]*:END:[ \t]*$" 46940 H "[ \t]*|" 46800 H "[\t ]*\\+\\(?:-+\\+\\)+[\t ]*$" 43578 H "\\`[ \t\n\x0d]+" 43491 H "[ \t\n\x0d]+\\'" 43276 H "\\[#.\\][ \t]*" 43276 H "\\(CANCELLED\\|DO\\(?:ING\\|NE\\)\\|F\\(?:AILED\\|ROZEN\\)\\|HOL= D\\|MERGED\\|NEXT\\|REVIEW\\|SOMEDAY\\|T\\(?:ICKLER\\|ODO\\)\\|WAITING\\)\\= (?: \\|$\\)" 43276 H "\\(:[[:alnum:]_@#%:]+:\\)[ \t]*$" 43276 H "COMMENT\\(?: \\|$\\)" 41509 H "[ \t]*[A-Za-z0-9]" 33419 H "^[\t ]*:\\([_[:word:]-]+\\):[\t ]*$" 28976 H "\\(\\S-+\\)[ \t]*$" 27833 H "\\(?:^\\*\\{1,17\\} \\)" 20981 H "\\[\\[\\(\\(?:[^][\\]\\|\\\\\\(?:\\\\\\\\\\)*[][]\\|\\\\+[^][]\\= )+\\)]\\(?:\\[\\([^z-a]+?\\)]\\)?]" 20701 H "\\`file\\(?:\\+\\(.+\\)\\)?\\'" 19234 H "\\(\\\\+\\)\\(?:\\'\\|[][]\\)" 19064 H "[ \t]*\n[ \t]*" 19058 H "^\\([^:]*\\)\\(::?\\(.*\\)\\)?$" 19057 H "\\`\\.\\.?/" 19057 H "\\`\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?= :pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)= \\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]= \\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:d= esc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|= Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\= |short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:ab= el\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\= (?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\= \*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(= ?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:= \\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\= \)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\= \)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|= l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:amer= ef\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:sea= rch\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:= -\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|note= cite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\= \|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):" 17440 H "\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)" 15376 H "^[ \t]*$" 14348 H "^\\*\\{18,100\\}+ " 13471 H "\\(?:^\\*\\{1,7\\} \\)" 10158 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)/[^[:space:]]\\)" 9779 H "\\(?:^\\*\\{1,6\\} \\)" 9122 H "\\(?:^\\*\\{1,5\\} \\)" 7577 H "[ \t]*#\\+BEGIN_\\(\\S-+\\)" 7531 H "^[ \t]*\\\\begin{\\([A-Za-z0-9*]+\\)}" 7494 H "[ \t]*#\\+\\(\\S-+\\)\\[.*\\]:" 6442 H "[ \t]*\\(.*?\\)[ \t]*\\(?:|\\|$\\)" 5151 H "\\(?:^\\*\\{1,4\\} \\)" 3732 H "\\(\\S-\\)\\([_^]\\)\\(\\(?:{\\([^{}]*?\\|\\(?:[^{}]*?{[^{}]*?}\= \)+[^{}]*?\\|\\(?:[^{}]*?{\\(?:[^{}]*?{[^{}]*?}\\)+[^{}]*?}\\)+[^{}]*?\\)}\= \)\\|\\(?:(\\([^()]*?\\|\\(?:[^()]*?([^()]*?)\\)+[^()]*?\\|\\(?:[^()]*?(\\(= ?:[^()]*?([^()]*?)\\)+[^()]*?)\\)+[^()]*?\\))\\)\\|\\(?:\\*\\|[+-]?[[:alnum= :].,\\]*[[:alnum:]]\\)\\)" 3494 H "[[:blank:]]*$" 3359 H "[.)]" 3159 H "\\(?:^\\*\\{1,8\\} \\)" 2154 H "[ \t]*#\\+BEGIN\\(:\\|_\\S-+\\)" 2065 H "^[ \t]*|-" 1871 H "\\(?:[^[:space:]]\\(/\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\= )" 1643 H "\\(?:\\<\\(?:\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\= |short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?= :pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]= \\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\= |Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|note= cite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(= ?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\= \|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:i= te\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*= \\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)= ?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(= ?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s= \\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:= p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscap= e\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|= n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?= :-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)= \\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\= |df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s= \\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\)= :\\(\\(?:[^][ \t\n()<>]\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\\)+= \\(?:[^[:punct:] \t\n]\\|/\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\= \)\\)\\)" 1557 H "\\\\\\(?:\\(?1:_ +\\)\\|\\(?1:there4\\|sup[123]\\|frac[13][24]\\= |[a-zA-Z]+\\)\\(?2:$\\|{}\\|[^[:alpha:]]\\)\\)" 963 H "\\\\\\\\[ \t]*$" 876 H "^[ \t]*,*\\(,\\)\\(?:\\*\\|#\\+\\)" 838 M "\\(\\([0-9]\\{4\\}\\)-\\([0-9]\\{2\\}\\)-\\([0-9]\\{2\\}\\)\\( += [^]+0-9>\x0d\n -]+\\)?\\( +\\([0-9]\\{1,2\\}\\):\\([0-9]\\{2\\}\\)\\)?\\)" 838 M "\\([<[]\\(%%\\)?.*?\\)[]>]\\(?:--\\([[<]\\([[:digit:]]\\{4\\}-[[= :digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)[]>]\\)\\)?" 838 M "\\([.+]?\\+\\)\\([0-9]+\\)\\([hdwmy]\\)" 838 M "\\(-\\)?-\\([0-9]+\\)\\([hdwmy]\\)" 838 M "[012]?[0-9]:[0-5][0-9]\\(-\\([012]?[0-9]\\):\\([0-5][0-9]\\)\\)" 815 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\*[^[:space:]]\\)" 811 M "[[<]\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\= \(?: .*?\\)?\\)[]>]\\|\\(?:<[0-9]+-[0-9]+-[0-9]+[^>\n]+?\\+[0-9]+[dwmy]>\\)= \\|\\(?:<%%\\(?:([^>\n]+)\\)>\\)" 775 H "\\(?:[^[:space:]]\\(\\*\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)= \\)" 758 M "[ \t]*\n[ \t]*" 755 M "^\\([^:]*\\)\\(::?\\(.*\\)\\)?$" 755 M "\\`\\.\\.?/" 755 M "\\`\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?= :pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)= \\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]= \\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:d= esc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|= Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\= |short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:ab= el\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\= (?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\= \*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(= ?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:= \\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\= \)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\= \)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|= l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:amer= ef\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:sea= rch\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:= -\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|note= cite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\= \|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):" 755 M "\\(\\\\+\\)\\(?:\\'\\|[][]\\)" 752 H "[^ \x0d\t\n]" 751 H "\\(?:^\\*\\{1,9\\} \\)" 735 M "\\(\\S-+\\)[ \t]*$" 725 M "\\`file\\(?:\\+\\(.+\\)\\)?\\'" 714 M "\\[\\[\\(\\(?:[^][\\]\\|\\\\\\(?:\\\\\\\\\\)*[][]\\|\\\\+[^][]\\= )+\\)]\\(?:\\[\\([^z-a]+?\\)]\\)?]" 662 M "^\\*\\{18,100\\}+ " 641 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\+[^[:space:]]\\)" 635 M "[[:blank:]]*$" 601 H "^[ \t]*#\\+END_SRC[ \t]*$" 600 M "^[ \t]*$" 596 H "^[ \t]*#\\+BEGIN_SRC\\(?: +\\(\\S-+\\)\\)?\\(\\(?: +\\(?:-\\(?:l= \".+\"\\|[ikr]\\)\\|[-+]n\\(?: *[0-9]+\\)?\\)\\)+\\)?\\(.*\\)[ \t]*$" 586 M "\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)" 579 M "[ \t]*#\\+BEGIN_\\(\\S-+\\)" 578 M "^[ \t]*\\\\begin{\\([A-Za-z0-9*]+\\)}" 578 M "[ \t]*#\\+\\(\\S-+\\)\\[.*\\]:" 549 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)=3D[^[:space:]]\\)" 515 M "[ \t]*#\\+BEGIN\\(:\\|_\\S-+\\)" 507 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)/[^[:space:]]\\)" 482 M "[.)]" 438 M "\\(?:^[\t ]*\\(\\(?:\\(?:CLOSED\\|DEADLINE\\|SCHEDULED\\):\\)\\)= \\)" 430 M "^\\(?4:[ \t]*\\)\\(?1::\\(?2:\\S-+\\):\\)\\(?:\\(?3:$\\)\\|[ \t]= +\\(?3:.*?\\)\\)\\(?5:[ \t]*\\)$" 428 H "\\`(" 421 M "\\(?:^\\*\\{1,17\\} \\)" 411 M "^[\t ]*:PROPERTIES:[\t ]*\n\\(?:[\t ]*:[^\t\n :]+:.*\n\\)*?[\t ]= *:END:[\t ]*$" 404 H "\\[[0-9]*\\(%\\|/[0-9]*\\)\\]" 396 M "\\(\\S-\\)\\([_^]\\)\\(\\(?:{\\([^{}]*?\\|\\(?:[^{}]*?{[^{}]*?}\= \)+[^{}]*?\\|\\(?:[^{}]*?{\\(?:[^{}]*?{[^{}]*?}\\)+[^{}]*?}\\)+[^{}]*?\\)}\= \)\\|\\(?:(\\([^()]*?\\|\\(?:[^()]*?([^()]*?)\\)+[^()]*?\\|\\(?:[^()]*?(\\(= ?:[^()]*?([^()]*?)\\)+[^()]*?)\\)+[^()]*?\\))\\)\\|\\(?:\\*\\|[+-]?[[:alnum= :].,\\]*[[:alnum:]]\\)\\)" 395 H "\\`///*\\(.:\\)?/" 395 H "::\\(.*\\)\\'" 389 M "\\\\\\\\[ \t]*$" 340 M "^[ \t]*:END:[ \t]*$" 270 M "\\(?:^\\*\\{1,5\\} \\)" 260 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\*[^[:space:]]\\)" 257 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)~[^[:space:]]\\)" 253 M "\\(?:^\\*\\{1,6\\} \\)" 253 M "\\(?:[^[:space:]]\\(\\*\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)= \\)" 240 M "\\(?:^\\*\\{1,4\\} \\)" 239 M "\\(?:\\<\\(?:\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\= |short\\(?:pl\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?= :pl\\)?\\)\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]= \\*\\|[pt]\\)\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\= |Gls\\(?:desc\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|note= cite\\)\\|Smartcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(= ?:pl\\)?\\|short\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\= \|b\\(?:abel\\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:i= te\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*= \\|t\\(?:\\*\\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)= ?\\|ref\\(?:range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(= ?:ile\\(?:\\+\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s= \\|texts?\\)?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:= p\\|symbo\\)l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscap= e\\)\\)\\|l\\(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|= n\\(?:ameref\\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?= :-\\(?:search\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)= \\|it\\(?:-\\(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\= |df\\|notecite\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s= \\(?:hell\\|martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\)= :\\(\\(?:[^][ \t\n()<>]\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\\)+= \\(?:[^[:punct:] \t\n]\\|/\\|(\\(?:[^][ \t\n()<>]\\|([^][ \t\n()<>]*)\\)*)\= \)\\)\\)" 210 M "\\(?:^\\*\\{1,7\\} \\)" 193 H "\\(?:[^[:space:]]\\(=3D\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)= \\)" 192 M "\\\\\\(?:\\(?1:_ +\\)\\|\\(?1:there4\\|sup[123]\\|frac[13][24]\\= |[a-zA-Z]+\\)\\(?2:$\\|{}\\|[^[:alpha:]]\\)\\)" 186 H "<\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:p= l\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\= |uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\= )\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:des= c\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Sm= artcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|s= hort\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel= \\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?= :l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*= \\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:= range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\= +\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)= ?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)= l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\= \(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref= \\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:searc= h\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\= \(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|noteci= te\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|= martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\([^>\n]*\\(?= :\n[ \t]*[^> \t\n][^>\n]*\\)*\\)>" 161 H "\\(?:^\\*\\{1,3\\} \\)" 156 M "\\(?:^\\*\\{1,8\\} \\)" 154 H "^ATTR_" 149 H "\\(?:^\\*\\{1,10\\} \\)" 147 M "[^ \x0d\t\n]" 144 M "^[ \t]*#\\+END_SRC[ \t]*$" 141 M "^[ \t]*#\\+BEGIN_SRC\\(?: +\\(\\S-+\\)\\)?\\(\\(?: +\\(?:-\\(?:l= \".+\"\\|[ikr]\\)\\|[-+]n\\(?: *[0-9]+\\)?\\)\\)+\\)?\\(.*\\)[ \t]*$" 135 H "\\(?:[^[:space:]]\\(~\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\= )" 122 M "\\(?:^\\*\\{1,3\\} \\)" 118 M "[-+*]" 117 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)=3D[^[:space:]]\\)" 111 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)~[^[:space:]]\\)" 108 M "[ \t]*[A-Za-z0-9]" 104 M "\\[[0-9]*\\(%\\|/[0-9]*\\)\\]" 103 M "\\(?:[^[:space:]]\\(/\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\= )" 101 M "^[ \t]*\n[ \t]*\n" 93 H "\\(\\s.\\|\\s-\\|\\s(\\|\\s)\\|\\s\"\\|'\\|$\\)" 92 M "<\\(A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:p= l\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\= |uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\= )\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:des= c\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Sm= artcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|s= hort\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel= \\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?= :l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*= \\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:= range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\= +\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)= ?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)= l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\= \(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref= \\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:searc= h\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\= \(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|noteci= te\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|= martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\([^>\n]*\\(?= :\n[ \t]*[^> \t\n][^>\n]*\\)*\\)>" 92 H "[ \t]*END[ \t]*$" 89 M "\\`[ \t\n\x0d]+" 89 M "[ \t\n\x0d]+\\'" 83 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)\\+[^[:space:]]\\)" 83 M "\\(?:[^[:space:]]\\(~\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\= )" 77 M "^ATTR_" 77 M "\\`///*\\(.:\\)?/" 77 M "::\\(.*\\)\\'" 57 H "\\\\[a-zA-Z]+\\*?\\(\\(\\[[^][\n{}]*\\]\\)\\|\\({[^{}\n]*}\\)\\)= *" 52 M "\\`(" 52 M "\\(?:^\\*\\{1,2\\} \\)" 51 H "^[ \t]*#\\+END_src[ \t]*$" 50 H "[ \t]*#\\+TBLFM: +\\(.*\\)[ \t]*$" 46 M "\\(?:[^[:space:]]\\(=3D\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)= \\)" 46 M "[ \t]*END[ \t]*$" 39 M "^[ \t]*|-" 39 M "^[ \t]*\\($\\|[^| \t]\\)" 39 M "[ \t]*\\(.*?\\)[ \t]*\\(?:|\\|$\\)" 39 M "[ \t]*#\\+TBLFM: +\\(.*\\)[ \t]*$" 37 M "\\(?:^\\*\\{1,9\\} \\)" 36 M "^[ \t]*\\(\\(?:[-+*]\\|\\(?:[0-9]+\\|[A-Za-z]\\)[.)]\\)\\(?:[ \t= ]+\\|$\\)\\)\\(?:\\[@\\(?:start:\\)?\\([0-9]+\\|[A-Za-z]\\)\\][ \t]*\\)?\\(= ?:\\(\\[[ X-]\\]\\)\\(?:[ \t]+\\|$\\)\\)?\\(?:\\(.*\\)[ \t]+::\\(?:[ \t]+\\= |$\\)\\)?" 35 M "^[ \t]*#\\+END_QUOTE[ \t]*$" 33 H "^[ \t]*: ?" 32 H "\\\\end{equation}[ \t]*$" 29 H "^[ \t]*\\($\\|[^| \t]\\)" 29 H "\\(?:[^[:space:]]\\(\\+\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)= \\)" 26 M "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)_[^[:space:]]\\)" 23 M "^[ \t]*#\\+END_src[ \t]*$" 22 M "\\(?:[^[:space:]]\\(\\+\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)= \\)" 21 M "\\(\\s.\\|\\s-\\|\\s(\\|\\s)\\|\\s\"\\|'\\|$\\)" 21 M "\\(?:[^[:space:]]\\(_\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\= )" 21 H "\\(?:^\\*\\{1,2\\} \\)" 20 H "[ \t]*:\\( \\|$\\)" 19 M "\\(?:^\\*\\{1,10\\} \\)" 19 M "[\t ]*\\+\\(?:-+\\+\\)+[\t ]*$" 18 H "END[ \t]*$" 16 M "^[ \t]*#\\+END_quote[ \t]*$" 15 M "[ \t]*|" 15 H "\\(?:\\(?:^\\|[\"'({[:space:]-]\\)_[^[:space:]]\\)" 14 M "END[ \t]*$" 13 M "\\(?:^[\t ]*CLOCK: \\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2= \\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)\\(?:--\\(?:\\[\\([[:digit:]]\\= {4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\\)[\t ]+=3D>= [\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\\)?[\t ]*$\\)" 13 M "\\(?1:^[ \t]*\\\\begin{[A-Za-z0-9*]+}\\)\\|\\(?2:^[\t ]*:[_[:wor= d:]-]+:[\t ]*$\\)\\|\\(?3:[ \t]*:\\( \\|$\\)\\)\\|\\(?7:^[\t ]*#\\+BEGIN:[\= t ]*\\)\\|\\(?4:[ \t]*#\\+\\)\\(?:BEGIN_\\(?5:[^[:space:]]+\\)\\|\\(?6:CALL= :\\)\\|\\(?8:[^[:space:]]+:\\)\\)\\|\\(?9:^\\[fn:\\([-_[:word:]]+\\)\\]\\)\= \|\\(?10:[ \t]*-----+[ \t]*$\\)\\|\\(?11:%%(\\)" 13 M "[ \t]*$" 13 M "[ \t]*#\\+\\(?:\\(?1:\\(?:CAPTION\\|RESULTS\\)\\)\\(?:\\[\\(.*\\= )\\]\\)?\\|\\(?1:\\(?:DATA\\|HEADERS?\\|LABEL\\|NAME\\|PLOT\\|RES\\(?:NAME\= \|ULT\\)\\|\\(?:S\\(?:OURC\\|RCNAM\\)\\|TBLNAM\\)E\\)\\)\\|\\(?1:ATTR_[-_A-= Za-z0-9]+\\)\\):[ \t]*" 13 H "<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\\= )>>" 13 H "<<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\= \)>>>" 11 M "\\\\[a-zA-Z]+\\*?\\(\\(\\[[^][\n{}]*\\]\\)\\|\\({[^{}\n]*}\\)\\)= *" 11 M "<<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\= \)>>>" 11 H "\\(?:[^[:space:]]\\(_\\)\\(?:[!\"'),-.:;?[\\}[:space:]]\\|$\\)\\= )" 11 H "[ \t]*#\\+\\(\\S-*\\):" 10 M "^[ \t]*: ?" 10 M "[ \t]*:\\( \\|$\\)" 10 M "<<\\([^<>\n\x0d \t]\\|[^<>\n\x0d \t][^<>\n\x0d]*[^<>\n\x0d \t]\\= )>>" 9 H "^[ \t]*#\\+END_QUOTE[ \t]*$" 6 M "^\\(?:\\*+ \\|\\[fn:[-_[:word:]]+\\]\\|%%(\\|[ \t]*\\(?:$\\||\\|= \\+\\(?:-+\\+\\)+[ \t]*$\\|#\\(?: \\|$\\|\\+\\(?:BEGIN_\\S-+\\|\\S-+\\(?:\\= [.*\\]\\)?:[ \t]*\\)\\)\\|:\\(?: \\|$\\|[-_[:word:]]+:[ \t]*$\\)\\|-\\{5,\\= }[ \t]*$\\|\\\\begin{\\([A-Za-z0-9*]+\\)}\\|\\(?:^[\t ]*CLOCK: \\(?:\\[\\([= [:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\}\\(?: .*?\\)?\\)\\]\= \)\\(?:--\\(?:\\[\\([[:digit:]]\\{4\\}-[[:digit:]]\\{2\\}-[[:digit:]]\\{2\\= }\\(?: .*?\\)?\\)\\]\\)[\t ]+=3D>[\t ]+[[:digit:]]+:[[:digit:]][[:digit:]]\= \)?[\t ]*$\\)\\|\\(?:[-+*]\\|\\(?:[0-9]+\\)[.)]\\)\\(?:[ \t]\\|$\\)\\)\\)" 6 M "[ \t]*#\\+\\(\\S-*\\):" 6 H "^[ \t]*#\\+END_quote[ \t]*$" 5 M "^[\t ]*:\\([_[:word:]-]+\\):[\t ]*$" 5 M "\\([ \t]*\\([-+]\\|\\(\\([0-9]+\\)[.)]\\)\\)\\|[ \t]+\\*\\)\\([ = \t]+\\|$\\)" 5 M "\\(?:^\\*\\{1\\} \\)" 5 H "\\`\\(?:\\)?/\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(= ?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)= \\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:= [[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)*\\(?:\\(?:-\\|[[:alnum:]]+\\)\\(?:\\= (?::\\)\\(?:\\(?:[^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(?:[%._[:alnum:]-]+\\|\\= (?:\\[\\)\\(?:\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)\\(?:]\\)?\\)?\\)?= \\)?\\)?\\'" 5 H "\\`\\(.+\\.\\(?:7z\\|CAB\\|LZH\\|MSU\\|ZIP\\|a\\(?:pk\\|r\\)\\|c= \\(?:ab\\|pio\\|rate\\)\\|de\\(?:b\\|pot\\)\\|e\\(?:pub\\|xe\\)\\|iso\\|jar= \\|lzh\\|m\\(?:su\\|tree\\)\\|od[bfgpst]\\|pax\\|r\\(?:ar\\|pm\\)\\|shar\\|= t\\(?:ar\\|bz\\|gz\\|lz\\|xz\\|zst\\)\\|warc\\|x\\(?:ar\\|p[is]\\)\\|zip\\)= \\(?:\\.\\(?:Z\\|bz2\\|gz\\|l\\(?:rz\\|z\\(?:ma\\|[4o]\\)?\\)\\|uu\\|xz\\|z= st\\)\\)?\\)\\(/.*\\)\\'" 5 H "\\`/:" 5 H "\\.gpg\\(~\\|\\.~[0-9]+~\\)?\\'" 5 H "\\(?:^/\\)\\(\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(= ?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)= \\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:= [[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)+\\)?\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\= \)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+= \\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:= \\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?::\\)\\([^\n\x0d]*\\'\\)" 5 H "\\(?:\\.tzst\\|\\.zst\\|\\.dz\\|\\.txz\\|\\.xz\\|\\.lzma\\|\\.lz= \\|\\.g?z\\|\\.\\(?:tgz\\|svgz\\|sifz\\)\\|\\.tbz2?\\|\\.bz2\\|\\.Z\\)\\(?:= ~\\|\\.~[-[:alnum:]:#@^._]+\\(?:~[[:digit:]]+\\)?~\\)?\\'" 3 M "^[ \t]*#\\+END_equation[ \t]*$" 3 M "\\`\\(?:\\)?/\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(= ?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)= \\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:= [[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)*\\(?:\\(?:-\\|[[:alnum:]]+\\)\\(?:\\= (?::\\)\\(?:\\(?:[^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(?:[%._[:alnum:]-]+\\|\\= (?:\\[\\)\\(?:\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)\\(?:]\\)?\\)?\\)?= \\)?\\)?\\'" 3 M "\\`\\(.+\\.\\(?:7z\\|CAB\\|LZH\\|MSU\\|ZIP\\|a\\(?:pk\\|r\\)\\|c= \\(?:ab\\|pio\\|rate\\)\\|de\\(?:b\\|pot\\)\\|e\\(?:pub\\|xe\\)\\|iso\\|jar= \\|lzh\\|m\\(?:su\\|tree\\)\\|od[bfgpst]\\|pax\\|r\\(?:ar\\|pm\\)\\|shar\\|= t\\(?:ar\\|bz\\|gz\\|lz\\|xz\\|zst\\)\\|warc\\|x\\(?:ar\\|p[is]\\)\\|zip\\)= \\(?:\\.\\(?:Z\\|bz2\\|gz\\|l\\(?:rz\\|z\\(?:ma\\|[4o]\\)?\\)\\|uu\\|xz\\|z= st\\)\\)?\\)\\(/.*\\)\\'" 3 M "\\`/:" 3 M "\\\\end{equation}[ \t]*$" 3 M "\\.gpg\\(~\\|\\.~[0-9]+~\\)?\\'" 3 M "\\(?:^/\\)\\(\\(?:\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\\)\\(?::\\)\\(= ?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+\\|\\(?:\\[\\)= \\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:\\(?:#\\)\\(?:= [[:digit:]]+\\)\\)?\\)?\\)\\(?:|\\)\\)+\\)?\\(?:\\(-\\|[[:alnum:]]\\{2,\\}\= \)\\(?::\\)\\(?:\\([^/:|[:blank:]]+\\)\\(?:@\\)\\)?\\(\\(?:[%._[:alnum:]-]+= \\|\\(?:\\[\\)\\(?:\\(?:[[:alnum:]]*:\\)+[.[:alnum:]]*\\)?\\(?:]\\)\\)\\(?:= \\(?:#\\)\\(?:[[:digit:]]+\\)\\)?\\)?\\)\\(?::\\)\\([^\n\x0d]*\\'\\)" 3 M "\\(?:\\.tzst\\|\\.zst\\|\\.dz\\|\\.txz\\|\\.xz\\|\\.lzma\\|\\.lz= \\|\\.g?z\\|\\.\\(?:tgz\\|svgz\\|sifz\\)\\|\\.tbz2?\\|\\.bz2\\|\\.Z\\)\\(?:= ~\\|\\.~[-[:alnum:]:#@^._]+\\(?:~[[:digit:]]+\\)?~\\)?\\'" 3 M "[A-Za-z]" 3 M "[0-9]+" 3 M "[ \t]*#\\+BEGIN_\\(\\S-+\\)[ \t]*\\(.*\\)[ \t]*$" 3 H "\\+$" 2 M "^[ \t]*#\\+END_EXAMPLE[ \t]*$" 2 M "^[ \t]*#\\+BEGIN_EXAMPLE\\(?: +\\(.*\\)\\)?" 2 M "\\[cite\\(?:/\\([/_[:alnum:]-]+\\)\\)?:[\t\n ]*" 2 M "\\+$" 2 M "@\\([!#-+./:<>-@^-`{-~[:word:]-]+\\)" 2 H "@\\([!#-+./:<>-@^-`{-~[:word:]-]+\\)" 1 M "^\\*+ " 1 M "^[\t ]*#\\(?: \\|$\\)" 1 M "^[ \t]*,*\\(,\\)\\(?:\\*\\|#\\+\\)" 1 M "^[ \t]*#\\+END_export[ \t]*$" 1 M "^[ \t]*#\\+END_EXPORT[ \t]*$" 1 M "^[ \t]*#\\+END_COMMENT[ \t]*$" 1 M "^[ \t]*#\\+CATEGORY:" 1 M "\\\\end{array}[ \t]*$" 1 M "\\[fn:\\(?:\\(?1:[-_[:word:]]+\\)?\\(:\\)\\|\\(?1:[-_[:word:]]+\= \)\\]\\)" 1 M "\\[#.\\][ \t]*" 1 M "\\.[^.]*\\'" 1 M "\\(CANCELLED\\|DO\\(?:ING\\|NE\\)\\|F\\(?:AILED\\|ROZEN\\)\\|HOL= D\\|MERGED\\|NEXT\\|REVIEW\\|SOMEDAY\\|T\\(?:ICKLER\\|ODO\\)\\|WAITING\\)\\= (?: \\|$\\)" 1 M "\\(?:~\\|\\.~[-[:alnum:]:#@^._]+\\(?:~[[:digit:]]+\\)?~\\)\\'" 1 M "\\(?:^\\|[^[:alnum:]]\\|\\c|\\)\\(foobar\\)\\(?:$\\|[^[:alnum:]]= \\|\\c|\\)" 1 M "\\(?:^\\*\\{1,11\\} \\)" 1 M "\\(?:[_^][-{(*+.,[:alnum:]]\\)\\|[*+/=3D_~][^[:space:]]\\|\\<\\(= ?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|cr= \\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\|uto\\(?:cite= [*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?= \\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:desc\\|\\(?:p\\|= symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Smartcites?\\|T= extcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\= )?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel\\|ib\\(?:ent= ry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\= *\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*\\|ext\\|itle= \\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:range\\)?\\)\= \|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\+\\(?:\\(?:em= ac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)?\\|fullcite\= \)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)l\\)?\\|h\\(?= :elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\\(?:abel\\|is= t-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref\\|ews\\|o\\(= ?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:search\\|tree\\)\\= )?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\\(?:log\\|rev= \\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|notecite\\|rint\\(?= :bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|martcites?\\|= upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\):\\|\\[\\(?:cite[:/]\\|fn:\\= |\\(?:[0-9]\\|\\(?:%\\|/[0-9]*\\)\\]\\)\\|\\[\\)\\|@@\\|{{{\\|<\\(?:%%\\|<\= \|[0-9]\\|\\(?:A\\(?:CR\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:p= l\\)?\\)\\|cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|short\\(?:pl\\)?\\)\\= |uto\\(?:cite[*s]?\\|ref\\)\\)\\|C\\(?:ite\\(?:a\\(?:l\\(?:[pt]\\*\\|[pt]\\= )\\|uthor\\*?\\)\\|[pt]\\*\\|[pst]\\)?\\|ref\\(?:range\\)?\\)\\|Gls\\(?:des= c\\|\\(?:p\\|symbo\\)l\\)?\\|Notecite\\|P\\(?:arencites?\\|notecite\\)\\|Sm= artcites?\\|Textcites?\\|a\\(?:cr\\(?:full\\(?:pl\\)?\\|long\\(?:pl\\)?\\|s= hort\\(?:pl\\)?\\)\\|ttachment\\|uto\\(?:cite[*s]?\\|ref\\)\\)\\|b\\(?:abel= \\|ib\\(?:entry\\|liography\\(?:style\\)?\\|tex\\)\\)\\|c\\(?:ite\\(?:a\\(?= :l\\(?:[pt]\\*\\|[pt]\\)\\|uthor\\*?\\)\\|date\\*?\\|num\\|p\\*\\|t\\(?:\\*= \\|ext\\|itle\\*?\\)\\|url\\|year\\(?:\\*\\|par\\)?\\|[*pst]\\)?\\|ref\\(?:= range\\)?\\)\\|doi\\|e\\(?:l\\(?:feed\\|isp\\)\\|qref\\)\\|f\\(?:ile\\(?:\\= +\\(?:\\(?:emac\\|sy\\)s\\)\\)?\\|notecite\\|oot\\(?:cite\\(?:s\\|texts?\\)= ?\\|fullcite\\)\\|tp\\|ullcite\\)\\|gls\\(?:desc\\|link\\|\\(?:p\\|symbo\\)= l\\)?\\|h\\(?:elp\\|ttps?\\)\\|i\\(?:d\\|n\\(?:dex\\|fo\\|kscape\\)\\)\\|l\= \(?:abel\\|ist-of-\\(?:\\(?:figur\\|tabl\\)es\\)\\)\\|mailto\\|n\\(?:ameref= \\|ews\\|o\\(?:bibliography\\*?\\|cite\\|t\\(?:ecite\\|much\\(?:-\\(?:searc= h\\|tree\\)\\)?\\)\\)\\)\\|org\\(?:-\\(?:contact\\|ql-search\\)\\|it\\(?:-\= \(?:log\\|rev\\)\\)?\\)\\|p\\(?:a\\(?:geref\\|rencite[*s]?\\)\\|df\\|noteci= te\\|rint\\(?:bibliography\\|glossaries\\|index\\)\\)\\|ref\\|s\\(?:hell\\|= martcites?\\|upercites?\\)\\|te\\(?:l\\|xtcites?\\)\\|w3m\\)\\)\\|\\$\\|\\\= \\\(?:[a-zA-Z[(]\\|\\\\[ \t]*$\\|_ +\\)\\|\\(?:call\\|src\\)_" 1 M "\\(:[[:alnum:]_@#%:]+:\\)[ \t]*$" 1 M "[ \t]*#\\+BEGIN_EXPORT\\(?:[ \t]+\\(\\S-+\\)\\)?[ \t]*$" 1 M "COMMENT\\(?: \\|$\\)" 1 M ":" 1 H "^[ \t]*#\\+END_EXPORT[ \t]*$" 1 H "\\\\end{array}[ \t]*$" 1 H "\\[fn:\\(?:\\(?1:[-_[:word:]]+\\)?\\(:\\)\\|\\(?1:[-_[:word:]]+\= \)\\]\\)" 1 H "\\(?:^\\*\\{1\\} \\)" 1 H "[ \t]*#\\+BEGIN_EXPORT\\(?:[ \t]+\\(\\S-+\\)\\)?[ \t]*$" --=-=-= Content-Type: text/plain -- Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at --=-=-=-- From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 14:21:13 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 18:21:13 +0000 Received: from localhost ([127.0.0.1]:41536 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw5U1-0008OW-Bh for submit@debbugs.gnu.org; Mon, 08 May 2023 14:21:13 -0400 Received: from mail-lf1-f53.google.com ([209.85.167.53]:61789) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw5Tz-0008OG-Ks for 63225@debbugs.gnu.org; Mon, 08 May 2023 14:21:12 -0400 Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-4f004cc54f4so5546650e87.3 for <63225@debbugs.gnu.org>; Mon, 08 May 2023 11:21:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683570065; x=1686162065; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=d3VjR8jpTDfNxLuMadQs8i6z7R7+ssbF1meYFj4BsTw=; b=qoC3r7YvbywzavMY9Kh8xBeZwoXaoZrg5W4R3CA56E7KCkKeCQQt9Bi0tWJ1nx7b+/ YPxG8NUwZdtMSVvBH44mWGtGVc2dy6/VwBDP867hqRN/OjZcXDxFCbN/W8tRm7MrE1JO rCqm3w5B025fa9c0lPj6ehok0G+utNJPxnn8y88NmBvQ6wAv61P0gviYYQhKPCEDx9w7 oVdvoBkw/GgyBRx0jSfflAgCiys6ExIE/qNtCQ7gcdsemXZKVvIelDWTe9yt7++m68k9 GiAaUOrh6naZ8CbLZ+bG1s9do/kkS0YfH4dJ2ekJBSVzeDE5kM9HpU4EXlm97nVkFRfQ dj3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683570065; x=1686162065; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=d3VjR8jpTDfNxLuMadQs8i6z7R7+ssbF1meYFj4BsTw=; b=UuxQ1CLjwjuS8lNwse/WNjnkTvgfpl2pvcQuR05NdVoNIdg0+SglAIWrKbSi21GZUQ ArZBNrpm32IBDnE7XCb6e0H/y/BQc8yfxssww2B9VvmhenAvezmyI0jh6bGTGyRe0yk1 QX0EM83ieN0l1q5YUBQ54Ou27MoUXo9/oSw7TC1dOpY5RPYYrVfAnbX+PjYlRKZf7U3T jVSulfzwEotUI2d3F+5z+zVkp7f/D9WWbnjoSX9YTWBq3A08eczqyD5qf7OkzUV9lAwZ ZbNPBwQ/8k382hd2j3fxYB8uG2+R17D5wE/rzaKu2rsjvycOXKJAHrYiIu/JmNiiAGzm pS8Q== X-Gm-Message-State: AC+VfDyU14q1TYD2ZKeEjnHqZ8ps+rIscWYCdvV05IdvIdkHAhWzLZQr CiVDddYdXnXSS+BpCObZ1pQ= X-Google-Smtp-Source: ACHHUZ6MC/dd2MXNcNpkLBmjZKuWX1O+yUo6/LPk/6qi+3jmVYRtvBTL4u5R5k2mk8fMQKwtlNC+Fg== X-Received: by 2002:a19:ee0c:0:b0:4dd:a61c:8f74 with SMTP id g12-20020a19ee0c000000b004dda61c8f74mr15845lfb.51.1683570065393; Mon, 08 May 2023 11:21:05 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id c10-20020ac244aa000000b004f145c9b900sm61293lfm.254.2023.05.08.11.21.04 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 May 2023 11:21:04 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87zg6fjar6.fsf@localhost> Date: Mon, 8 May 2023 20:21:03 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 8 maj 2023 kl. 13.58 skrev Ihor Radchenko : > (save-excursion > (beginning-of-line 0) > (not (looking-at-p "[[:blank:]]*$")))) I wonder if that last part isn't better written as (save-excursion (forward-line 0) ; faster than beginning-of-line (skip-chars-forward "[:blank:]") ; faster than looking-at-p (not (eolp))) ; very cheap which doesn't use regexps at all. Worth a try. But yes, I sort of understand what you are getting at (except the = business with the MODE parameter which is still a bit mysterious to me). > [now part of the giant rx] >=20 > (rx line-start (0+ (any ?\s ?\t)) > ":" (1+ (any ?- ?_ word)) ":" > (0+ (any ?\s ?\t)) line-end) Any reason you don't capture the part between the colons here, so that = you don't need to match it later on? > But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same? "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they = don't generate the same regexp bytecode -- the former is faster. (In = this case rx makes a literal translation to those strings but we should = probably make it optimise to the faster regexp.) There is a regexp disassembler for the really curious but it doesn't = come with Emacs. > Your suggestions about using (not (in ...)) in place of (in ...) are > good, but I am afraid to break CJK cases where people can use = unexpected > set of characters. I was bitten by this in the past. In this case there's no need since you could gain some speed by the = simple rewrite (or -> in) above, but there may be others where = conditions are different. >> Maybe, but you still cons each time. (And remember that the plist-get = equality funarg is new in Emacs 29.) >=20 > Sure it does. > It is just one of the variable parts of Org syntax that might be > changed. There are ways to make this into constant, but it is a = fragile > area of the code that I do not want to touch without a reason. > (Especially given that I am not familiar with org-list.el) So it's fine to use elisp constructs new in Emacs 29 in Org? Then the = line ;; Package-Requires: ((emacs "26.1")) in org.el should probably be updated, right? > I hope we do. If only Emacs had a way to define `case-fold-search' = right > within the regexp itself. I would like that too, but changing that isn't easy. By the way, it seems that org-element-node-property-parser binds = case-fold-search without actually using it. Bug? From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 15:32:47 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 19:32:47 +0000 Received: from localhost ([127.0.0.1]:41687 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6bG-0004a2-UJ for submit@debbugs.gnu.org; Mon, 08 May 2023 15:32:47 -0400 Received: from mail-lf1-f46.google.com ([209.85.167.46]:45208) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6bF-0004Zn-I5 for 63225@debbugs.gnu.org; Mon, 08 May 2023 15:32:46 -0400 Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-4ecb137af7eso5601055e87.2 for <63225@debbugs.gnu.org>; Mon, 08 May 2023 12:32:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683574359; x=1686166359; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=g36BUA8pw/XNwF6Jf8u9IUwnqKBXHHW2IxMefdACuv0=; b=e9nZmU3fQzr5Gcm+X3NWwnix1AmEToeN6p9SSJW9GPpdgv7kklb+8iMJ3AdZqQ6l3x rCiIXmCRxNVGUCjDAd4Ehn9bomjpRe5ze2M36smqf6qXBja+fjIJX5RoUuK8dNtzLR2D FXfjQoL1LhgSBVeqRygdCjgqbJYonaKDCzY4t3Eyw/2NUSWax66QxYhYw28jRO6H0r7f Ds4uKttPZwCPjfJQ4sicbiZdNlfPOHji+Z/xwGELQ2l56JJExgpXJ08D2kIGORsy/Ml0 hQcnaZUBUdct2ahqR/vfoDVNPU8UJY2EtgG+MXwsC2Vk67lrp42uvKIVEg3jx0zw401b /Y0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683574359; x=1686166359; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=g36BUA8pw/XNwF6Jf8u9IUwnqKBXHHW2IxMefdACuv0=; b=OLzJRcA6TyopmFUvEXWVKXoROmf+x/zP9RxJ5XVIpbjDknaq6c8F3ndSDDhEU3U27S cTxr8nBAXXU+QqjN4/bqrLBmVe7V3ZJuCW8O6RVyXxtHj1X3IlJr0oNVTIskvxYmauOW R++GCc+/ROaTFbuc1gB+a3WsX3Dl3q3q+i0lmgvwNtgGkzMup9QpgZIaLTwrJ3+UQBUg 0qyIV0jOjqmmXgmax0MT5ouAhOvx9mxu8mhM301OQY5hVG4KVqTZZGhDnO7XwCeDtPk9 33AhMRZKqP+YdAL78l/BbDdpUHelifYJARlacUtu/f6fN/uhzghAM1if428XKtHu5u8w K2jQ== X-Gm-Message-State: AC+VfDydc7Fsv2ylVqw2/ZPEFFVOb50DgGOTch3SiqniouXAyO7lgDq9 68yI5Glbrq5iJy4KZG9ev7s= X-Google-Smtp-Source: ACHHUZ5xAuZTttY4NBkGnM6mmGlVxfg+jOjobjN8lIDU8NTU++ntWVA8H8rqHL6AtsqIH2+d2VrM7A== X-Received: by 2002:a2e:8182:0:b0:2ac:80f6:544a with SMTP id e2-20020a2e8182000000b002ac80f6544amr54667ljg.24.1683574359399; Mon, 08 May 2023 12:32:39 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id r9-20020a2e94c9000000b002ad9f224bfasm21021ljh.51.2023.05.08.12.32.38 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 May 2023 12:32:38 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <87h6sn9bb1.fsf@localhost> Date: Mon, 8 May 2023 21:32:37 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <1473BC99-6A16-482F-B77E-0E7B25B4844E@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87h6sn9bb1.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 8 maj 2023 kl. 15.56 skrev Ihor Radchenko : > I am not sure what I can make out of hits/misses, but I am at least = able > to look into frequency data, via sort re.log | uniq -c > re-freq.log I'm mostly curious about the regexp cache behaviour. What cache size did = you use in this run? Hardly 20, given the low miss rate? It would be interesting to see what = sequence of regexps most commonly cause thrashing. > It would be even nicer if apart from frequency, there was information > about time taken to search for each regexp. That's a bit messier but could be done if really needed. From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 15:35:56 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 19:35:56 +0000 Received: from localhost ([127.0.0.1]:41695 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6eK-0004fT-0Y for submit@debbugs.gnu.org; Mon, 08 May 2023 15:35:56 -0400 Received: from mout01.posteo.de ([185.67.36.65]:48463) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6eH-0004fB-N4 for 63225@debbugs.gnu.org; Mon, 08 May 2023 15:35:54 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 3DEF624026A for <63225@debbugs.gnu.org>; Mon, 8 May 2023 21:35:47 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683574547; bh=yymRBrh0iD78PAARKFoXC5QcTw583CV00WIyjx129zk=; h=From:To:Cc:Subject:Date:From; b=LgitHieUwRWNa3V9jlPnusb0clKNayEEpR3mVpOE6WaExZeBYA6+InSGOaHY0HFax dYbl/jC14yXo8zI0Y/omHwtYUzyGjZYZtsAKgXACP05ppSIVT5BcPfgjJe6aqsGnkx +LAYxfzevjFJQRtxE/B7WnQgk3/dACo3btfDAhupAlw33DzbgYJCc9m4+26sueuiRc QAAiTMSBpViW45WeMQJGjeAb0WEtvxTM+WKVEcxW4TcaEKWh+6uag3x5BAUnaYHLlN DsLjM9kMZFTe/uoBmTfjTCCmkBo+qBm8enp9eAmmBevuLcBRHIujM26deaoA/02/cg Sb7P4uZPrd7nA== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFWjT1LwVz9rxQ; Mon, 8 May 2023 21:35:41 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> Date: Mon, 08 May 2023 19:38:59 +0000 Message-ID: <875y923964.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> (save-excursion >> (beginning-of-line 0) >> (not (looking-at-p "[[:blank:]]*$")))) > > I wonder if that last part isn't better written as > > (save-excursion > (forward-line 0) ; faster than beginning-of-line Fair point. It is probably worth looking through Org sources and replacing all those `beginning-of-line' and `end-of-line' calls. I doubt that we even intend to use fields for real. > (skip-chars-forward "[:blank:]") ; faster than looking-at-p > (not (eolp))) ; very cheap Hmm. I feel confused. Does this imply that simple regexps like (looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol)))) should better be implemented as the following? (and (bolp) (skip-chars-forward " \t") (eq (char-after) ?#) (forward-char) (or (eolp) (eq (char-after) ?\s))) (I now start thinking that it might be more efficient to create a bunch or char tables and step over them to match "regexp", just like any finite automaton would) > But yes, I sort of understand what you are getting at (except the busines= s with the MODE parameter which is still a bit mysterious to me). MODE parameter is used because we constrain what kinds of syntax elements are allowed inside other. For example, see `org-element-object-restrictions'. And within `org-element--current-element', MODE is used, for example, to constrain planning/property drawer to be only the first child of a parent heading, parsed earlier. >> [now part of the giant rx] >>=20 >> (rx line-start (0+ (any ?\s ?\t)) >> ":" (1+ (any ?- ?_ word)) ":" >> (0+ (any ?\s ?\t)) line-end) > > Any reason you don't capture the part between the colons here, so that yo= u don't need to match it later on? That would demand all the callers of `org-element-drawer-parser' to set match data appropriately. Which is doable, but headache for maintenance. We sometimes do call parsers explicitly not from inside `org-element--current-element'. >> But why? Aren't (in word ?_ ?-) and (or word ?_ ?-) not the same? > > "[-_[:word:]]" and "\\w\\|[_-]" indeed match the same thing but they don'= t generate the same regexp bytecode -- the former is faster. (In this case = rx makes a literal translation to those strings but we should probably make= it optimise to the faster regexp.) > > There is a regexp disassembler for the really curious but it doesn't come= with Emacs. I really hope that I did not need to do all these workarounds specific to current implementation pitfalls of Emacs regexp compiler. >>> Maybe, but you still cons each time. (And remember that the plist-get e= quality funarg is new in Emacs 29.) >>=20 >> Sure it does. >> It is just one of the variable parts of Org syntax that might be >> changed. There are ways to make this into constant, but it is a fragile >> area of the code that I do not want to touch without a reason. >> (Especially given that I am not familiar with org-list.el) > > So it's fine to use elisp constructs new in Emacs 29 in Org? Then the line > > ;; Package-Requires: ((emacs "26.1")) > > in org.el should probably be updated, right? Nope. It is just that the link I shared is for WIP branch I am developing using Emacs master. I will ensure backwards compatibility later. For example, by converting `plist-get' to `assoc'. >> I hope we do. If only Emacs had a way to define `case-fold-search' right >> within the regexp itself. > > I would like that too, but changing that isn't easy. I am sure that it is easy. For example, regexp command may optionally accept a vector with first element being regexp and second element setting a flag for `case-fold-search'. Of course, it is just one, maybe not the best way to implement this. My suggestion in https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D63225#56 is also compatible with this kind of approach. > By the way, it seems that org-element-node-property-parser binds case-fol= d-search without actually using it. Bug? It actually does nothing given that `org-property-re' does not contain letters. I will remove it. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 15:41:37 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 19:41:37 +0000 Received: from localhost ([127.0.0.1]:41732 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6jp-0004qo-7K for submit@debbugs.gnu.org; Mon, 08 May 2023 15:41:37 -0400 Received: from mout01.posteo.de ([185.67.36.65]:39923) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6jn-0004qb-Dc for 63225@debbugs.gnu.org; Mon, 08 May 2023 15:41:36 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 7D0E424026B for <63225@debbugs.gnu.org>; Mon, 8 May 2023 21:41:29 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683574889; bh=/8y63xNg6o66rTT1qKAdopXkpLHw5Qpskrzf6K7MEuw=; h=From:To:Cc:Subject:Date:From; b=opVxJQ0SfdE1zuIeKr9kruHXxz0ta/HqWaUFv1QQ9OWKWsUqB/QKBZQTq6k8rB2TB O7enDCXSkHWaayrdf8fssty1zmyQDdLzCeBkHRnPkfuHZ1QT+BK+Rjbvk5z4lvk6Tc g8v54AdAAMrKjEAclM8cDoLjy4nyqxMC5+DSy3oNFVRkqQjUcOUiG1RuAGl6UpHT92 EboMfqPNE/dU9L5k8hp0Y7LczxXQlq8ekrE1cqPAw8Lj80885eR38uT8V1/7lu9Vei pgWIOk2A0iETBtruGxqnf49BkIBB3/ayvFlqUGOqivMZtV2MY1iIwiNEW/7iXrVrd8 8UQXCSI1uo9xQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFWr85PmCz6txG; Mon, 8 May 2023 21:41:28 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <1473BC99-6A16-482F-B77E-0E7B25B4844E@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87h6sn9bb1.fsf@localhost> <1473BC99-6A16-482F-B77E-0E7B25B4844E@gmail.com> Date: Mon, 08 May 2023 19:44:47 +0000 Message-ID: <87354638wg.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: > 8 maj 2023 kl. 15.56 skrev Ihor Radchenko : > >> I am not sure what I can make out of hits/misses, but I am at least able >> to look into frequency data, via sort re.log | uniq -c > re-freq.log > > I'm mostly curious about the regexp cache behaviour. What cache size > did you use in this run? 50 > Hardly 20, given the low miss rate? It would be interesting to see what s= equence of regexps most commonly cause thrashing. Here is the log: https://0x0.st/HZgH.log >> It would be even nicer if apart from frequency, there was information >> about time taken to search for each regexp. > > That's a bit messier but could be done if really needed. >From this discussion, I am, so far, having an impression that Elisp regexps can various non-obvious pitfalls that may need to be considered. However, Org uses so many regexps that optimizing them all is not a viable option, especially when the optimization may involve changing the syntax. Having the data on the major bottlenecks would at least allow us to focus on the regexps that really slow things down in practice. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Mon May 08 15:53:53 2023 Received: (at 63225) by debbugs.gnu.org; 8 May 2023 19:53:53 +0000 Received: from localhost ([127.0.0.1]:41744 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6vh-0005NA-IJ for submit@debbugs.gnu.org; Mon, 08 May 2023 15:53:53 -0400 Received: from mail-lf1-f49.google.com ([209.85.167.49]:48390) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pw6vd-0005Mt-Sr for 63225@debbugs.gnu.org; Mon, 08 May 2023 15:53:51 -0400 Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-4efe9a98736so5688476e87.1 for <63225@debbugs.gnu.org>; Mon, 08 May 2023 12:53:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683575624; x=1686167624; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=jwdHg4wQ2zODUVs/Mfpom8mL+2A/2JfOj2EK7C1PFvM=; b=mHtx2TYfCWhBw/kVON03Ehy26T6H4+z96dYgteS3CJD0AY3PpnFZQr2GbIs/KaiYFo TrxHNDgrKC9w/ZPQ8IEPhKC+CcsLwwouMCD69HnUf5j0LcGmSTpn9sw8riWNSEsPXFyS 7Dk0iZkAWvwWGgHAYaB/so9iGjkLKkswZK3xLkB/0r556nRV3I4wdvr7J6NEuvBht3IV con0XyZQFb6l2cq1bF6Stc0DsNnBwRmRx7x/yQN17fFCOY7ZxqQkFZ5W2Sn0ZQi1mTwK c7qpb92AbLPdNssuOfcRzR/Nxs80nyzRb51ZuK0VKqi9VGeu04elX1NOC2VBL8MNqS51 pVGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683575624; x=1686167624; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=jwdHg4wQ2zODUVs/Mfpom8mL+2A/2JfOj2EK7C1PFvM=; b=hXg1m43gBW3nvDx8NQL3f6XO6taC5XyJ+3eWCBoVv/7wx0tmOku+rQV9+D+y22oCh2 Pd+eOiOtsMn6Nfslj9eGwhXRzZorTwUqp7X++XzVLuV9p1sQhiq3ssAZUYLPzFmUAVbe c6yES+lk+NJaCjK5GuR+PInkG4iHdLrPgwREIEwItoJSDRBVPbzqM/jCsBSwIO0f8Qzm KtdZ3vFQIa1Qfk19koxUiv6Jv8uokJs82/oypAph9Y4WwDE/Itz/5OY7HZT2D4znaSQ3 HdnFBisxr9KTbOuHxe2oYZU1DKNmLZ8JlLUCBf14yO3/UtVX4wMGXV5EheO70W1x5oGI 6EUQ== X-Gm-Message-State: AC+VfDwYyF0YIR5+kArnieksvkiot4Su0n+bhPVEp0vj0zTPPc6NJTYY re0U1h0gxltQhwMdRdhkIdU= X-Google-Smtp-Source: ACHHUZ7etGlUYBku3iveHfflLNl3erFhZ7mf333BfjGnlta5opFBLHZP+s6sM/U0d5i/Mj+gOv3fhg== X-Received: by 2002:ac2:5d47:0:b0:4eb:341c:ecc5 with SMTP id w7-20020ac25d47000000b004eb341cecc5mr74517lfd.12.1683575623871; Mon, 08 May 2023 12:53:43 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id b3-20020ac247e3000000b004edb0882ce7sm83482lfp.133.2023.05.08.12.53.42 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 08 May 2023 12:53:43 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <875y923964.fsf@localhost> Date: Mon, 8 May 2023 21:53:42 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> <875y923964.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 8 maj 2023 kl. 21.38 skrev Ihor Radchenko : > Does this imply that simple regexps like >=20 > (looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " = eol)))) >=20 > should better be implemented as the following? Obviously this isn't practical or beneficial for any but the simplest = patterns. Benchmark if you are concerned. It is useful to know when (not) to use regexps. > (I now start thinking that it might be more efficient to create a = bunch > or char tables and step over them to match "regexp", just like any > finite automaton would) I wish elisp were fast enough that we could do that in general. > I really hope that I did not need to do all these workarounds specific = to > current implementation pitfalls of Emacs regexp compiler. Some of them. We program for the system we have. Emacs is a very slowly = moving target. >> I would like that too, but changing that isn't easy. >=20 > I am sure that it is easy. I didn't mean technically. Code is easy to write. > It actually does nothing given that `org-property-re' does not contain > letters. I will remove it. It doesn't even matter what org-property-re contains because it isn't = used with a changed case-fold-search (which is bound using let, not = let*). From debbugs-submit-bounces@debbugs.gnu.org Tue May 09 04:33:20 2023 Received: (at 63225) by debbugs.gnu.org; 9 May 2023 08:33:20 +0000 Received: from localhost ([127.0.0.1]:42361 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwImd-0003at-Ng for submit@debbugs.gnu.org; Tue, 09 May 2023 04:33:20 -0400 Received: from mout01.posteo.de ([185.67.36.65]:42549) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwImc-0003ag-7M for 63225@debbugs.gnu.org; Tue, 09 May 2023 04:33:19 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id E8F3D240192 for <63225@debbugs.gnu.org>; Tue, 9 May 2023 10:33:11 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683621191; bh=7pLCMG3Okis5kf8eJMbioJAZPiZeTUXwNG5057TTc/4=; h=From:To:Cc:Subject:Date:From; b=QT968cQC50A5TbS4tiIzKcXiOb2LDN9HTB6VszH/M8hD8k9pFD8Tuqtu+FbQUsW0m dJtLKuiewE2RJRsy8xdpSQHLBme7pKfHQS+psOXtAwm+LM1nezWxrDxFczk4JdmOGc Jt+3joLQdJ06qV9ZvLgNKuE97MisRePB/C+4XLK/LVLK9lwU6zZ6iaTwNXhqby+vSh 1mUpaBW/FE/kJRSeX4lCvno0+AwmsiB56+oRRRNvEHpsx60LE28y/vJZvpcJ87m/jM bIqIN5oqCjfe8Sf01lXqZKlGmztWbhERmftH/1NlAQ74AARUCqYSfiklkaM7wqoSdt xZJR+QLL/zg/g== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFryb0jvRz9rxB; Tue, 9 May 2023 10:33:11 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Using char table-based finite-state machines as a replacement for re-search-forward (was: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)) In-Reply-To: <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> <875y923964.fsf@localhost> <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> Date: Tue, 09 May 2023 08:36:25 +0000 Message-ID: <87ednp296e.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: >> (I now start thinking that it might be more efficient to create a bunch >> or char tables and step over them to match "regexp", just like any >> finite automaton would) > > I wish elisp were fast enough that we could do that in general. I tried the following simple implementation for searching via char tables: (defun yant/make-char-re (string) "Create a regexp matching STRING using char tables. Return entry char table. The values are either char tables or return value = for terminals. This is a dumb version not trying to account for substring cycl= es." (let (root (table (make-char-table 'regexp-table)) (idx 0)) (setq root table) (set-char-table-range root t root) (seq-map (lambda (char) (let ((next-table (make-char-table 'regexp-table root))) (aset table char next-table) (setq table next-table))) string) (set-char-table-range table t t) root)) (defun yant/search-forward (regexp-table) "Move point after next string matching REGEXP-TABLE." (let ((pos (point-marker))) (while (and (< pos (point-max)) (char-table-p (setq regexp-table (aref regexp-table (char-after pos)))) (move-marker pos (1+ pos)))) (unless (char-table-p regexp-table) (goto-char pos)))) DEFUN ("auto-search-forward", Fauto_search_forward, Sauto_search_forward, 1= , 1, 0, doc: /* Search forward using CHAR-TABLE. (WIP) */) (Lisp_Object table) { CHECK_TYPE (CHAR_TABLE_P (table), Qchar_table_p, table); ptrdiff_t pos_byte =3D PT_BYTE; ptrdiff_t pos_char =3D PT; if (pos_byte < BEGV_BYTE || pos_byte >=3D ZV_BYTE) return Qnil; table =3D CHAR_TABLE_REF (table, FETCH_CHAR (pos_byte)); while (pos_byte < ZV_BYTE && CHAR_TABLE_P (table)) { pos_byte +=3D next_char_len(pos_byte); pos_char ++; table =3D CHAR_TABLE_REF (table, FETCH_CHAR (pos_byte)); } if (pos_byte >=3D ZV_BYTE) return Qnil; else { SET_PT (pos_char); return Qt; } } (setq yant/test (yant/make-char-re ":ID:")) (benchmark-run 10 (org-with-wide-buffer (goto-char 1) (while (yant/search-f= orward yant/test)))) (benchmark-run 10 (org-with-wide-buffer (goto-char 1) (while (auto-search-f= orward yant/test)))) (benchmark-run 10 (org-with-wide-buffer (goto-char (point-min)) (while (re-= search-forward ":ID:" nil t)))) - yant/search-forward :: (24.677463432 0 0.0) - re-search-forward :: (0.569693718 0 0.0) - auto-search-forward :: (1.177098129 0 0.0) The lisp version suffers from (1) Extra redirection when checking function symbol value; (2) moving markers, as there is no efficient way to move point forward from Elisp; `forward-char' is even slower. auto-search-forward is twice slower than re-search-forward, but I expect this to change if we ask for a more complex regexp. auto-search-forward will not care about the complexity of the finite-state machine provided. --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Tue May 09 07:59:11 2023 Received: (at 63225) by debbugs.gnu.org; 9 May 2023 11:59:11 +0000 Received: from localhost ([127.0.0.1]:42579 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwLzq-0006Ih-IS for submit@debbugs.gnu.org; Tue, 09 May 2023 07:59:10 -0400 Received: from mout01.posteo.de ([185.67.36.65]:50779) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwLzo-0006IR-Mz for 63225@debbugs.gnu.org; Tue, 09 May 2023 07:59:09 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 47882240069 for <63225@debbugs.gnu.org>; Tue, 9 May 2023 13:59:02 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683633542; bh=zh6PCkTSu/aGD4L6cgWpl4wuILXotDa4cV4/sddJiHM=; h=From:To:Cc:Subject:Date:From; b=VeEzdmoAyw4uMLUWCOOrWDRgWa1UfifO9n/fxZNzrBSHko50/zM07Govsak7seGzJ JSEIJ3mMaxF04xYXe78Bj6EafBUHEa1im2WR97FIdsCs1o6aWj8MP8NB7MoCrAb1kw rbCxELA617meQHdIDGV7Tsv3cmC+9ZhMFdmpJ+h6rsmYfffen0okzm7emh/TmOXMIa DOyveCuRU5gIY82ohdx4n0sqBhzeSToBQKkAARsFkFXpmT58Dcd25GAeZ9h15XW4YX Sa3jj8ffa2oodzsSfBRFjC3mvCnG8winFaR3cF2As4DDON2LTpIhHOwNR5ztXpDGCx LTwzf9vSXnd9A== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QFxX557Q3z6tvr; Tue, 9 May 2023 13:59:01 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> <875y923964.fsf@localhost> <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> Date: Tue, 09 May 2023 12:02:19 +0000 Message-ID: <871qjp1zn8.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: > 8 maj 2023 kl. 21.38 skrev Ihor Radchenko : > >> Does this imply that simple regexps like >>=20 >> (looking-at-p (rx (seq bol (zero-or-more (any "\t ")) "#" (or " " eol)))) >>=20 >> should better be implemented as the following? > > Obviously this isn't practical or beneficial for any but the simplest pat= terns. Benchmark if you are concerned. > It is useful to know when (not) to use regexps. I did some benchmarking and your code does provide >2x improvement. Most of it is coming from using `forward-line' instead of `beginning-of-line'. I benchmarked different variations: (defun yant/test1 () (save-excursion (forward-line 0) ; faster than beginning-of-line (skip-chars-forward "[:blank:]") ; faster than looking-at-p (not (eolp)))) ; very cheap (defun yant/test2 () (save-excursion (beginning-of-line 0) (not (looking-at-p "[[:blank:]]*$")))) (defun yant/test3 () (save-excursion (beginning-of-line 0) (skip-chars-forward "[:blank:]") ; faster than looking-at-p (not (eolp)))) (defun yant/test4 () (save-excursion (forward-line 0) ; faster than beginning-of-line (not (looking-at-p "[[:blank:]]*$")))) - forward-line + skip-chars-forward :: (2.980 2 0.648) - beginning-of-line + looking-at-p :: (7.189 2 0.653) - beginning-of-line + skip-chars-forward :: (6.833 2 0.634) - forward-line + looking-at-p :: (3.180 2 0.663) >> I really hope that I did not need to do all these workarounds specific to >> current implementation pitfalls of Emacs regexp compiler. > > Some of them. We program for the system we have. Emacs is a very slowly m= oving target. Will it make sense to use a combination of char-after and skip-chars-forward every single time? I am thinking about something among the lines of (defconst org-fancy-re (propertize (rx bol (1+ "*") " ") 'org-re-direct '(progn (and (bolp) (> (skip-chars-forward "*") 0) (prog1 (eq ?\s (char-after)) (forward-char)))))) (define-inline org-fancy-looking-at-p (regexp) "Like `looking-at-p', but try `org-re-direct' property." (let ((sexp (and (inline-const-p regexp) (get-text-property 'org-re-direct (inline-const-val regexp))))) (if sexp (inline-quote (save-excursion ,sexp)) (inline-quote (looking-at-p ,regexp))))) >>> I would like that too, but changing that isn't easy. >>=20 >> I am sure that it is easy. > > I didn't mean technically. Code is easy to write. May you elaborate what is the blocker then? --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Tue May 09 11:05:59 2023 Received: (at 63225) by debbugs.gnu.org; 9 May 2023 15:06:00 +0000 Received: from localhost ([127.0.0.1]:44139 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwOud-000477-KW for submit@debbugs.gnu.org; Tue, 09 May 2023 11:05:59 -0400 Received: from mail-lf1-f45.google.com ([209.85.167.45]:60675) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwOua-00046o-DT for 63225@debbugs.gnu.org; Tue, 09 May 2023 11:05:57 -0400 Received: by mail-lf1-f45.google.com with SMTP id 2adb3069b0e04-4f14ec8d72aso4729508e87.1 for <63225@debbugs.gnu.org>; Tue, 09 May 2023 08:05:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683644750; x=1686236750; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=yGRULUaHPZro32JEMz/DvJ/OMtTsLYEsLqrNO8mx8KA=; b=KfmOHF7B5HB9dMeegxLNWitFqttMERNJ8HExyTSEoEeb1/fUO+GkT1UVHfeLTaudS6 2Bj0VtwHkFpedQadhsIzgymxJWMqqSF+Qj766UdZM5Rftb4kd114fIIEGEEuUYe3mmTR y2oEh3tszehUbItjIcM4emgWbWwBTA38vHXUVGrbXiuC4U8Gwohq83tB/r5dEJZmMf49 b4PQjVRcHdRVroxwXeCVf+2UxDZ1WqY7W2ugip5St131buU5IYJvE6Po+8es30eZ88Aj uBaO93Yc4paLx5HMNVSl12pRHMp8dcr20uO+lUdYfavcdK+DMyfFSHIo8isdAW5PDJUq b19A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683644750; x=1686236750; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=yGRULUaHPZro32JEMz/DvJ/OMtTsLYEsLqrNO8mx8KA=; b=PRJ8w5dl2PMKIskip2Pl8uV2fmTYl5mgrxHTES5x01dTAfPulnNwsfDZLVZDJv9Vgs VLdVj9WEi7X/0jzqD5j7nXN9+e03DZ9el5uW43LlQ3usM8+4AHFpJA0Bf80vz+IDAXfS BSGH6fnslgzC0v6qnUJzDVpSYVixR3Y30iY1ZFzj1icnAzdeGiwX03dFZeOcK3UmDQnH sHie8hZWYdvAHN1dKeEBSXy3cEXufFfXp9J5carS9GOG1/oSGECxhIkbD2G236GUfUFO duvLycvreg1uVHs9Quf/KLgrK4WpxqKDQ/pMJaFecNJ61Wz8aNtj7fz05ICJ9jdQjaey vtnA== X-Gm-Message-State: AC+VfDyCluarA+bmTir0hbusyB4efDmU5TkLPYmTMW4uAH5zTAGPf3BM 6VkwgoTZ5BHssbE/QjaWGWM= X-Google-Smtp-Source: ACHHUZ4a9qHFvTfL/FhbQ3/6MY4SkFhRaCmvu2MKmAMgq/3ZRZ3cy5sELyMCHDIMAT3o72UnW9O9Yg== X-Received: by 2002:ac2:50d3:0:b0:4f0:80:d0c0 with SMTP id h19-20020ac250d3000000b004f00080d0c0mr881499lfm.63.1683644749853; Tue, 09 May 2023 08:05:49 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id u20-20020ac248b4000000b004edca9174bbsm376698lfg.148.2023.05.09.08.05.48 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 May 2023 08:05:48 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <871qjp1zn8.fsf@localhost> Date: Tue, 9 May 2023 17:05:47 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <83EDC4A9-5F1F-4A75-8271-BAFCC8943E53@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> <875y923964.fsf@localhost> <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> <871qjp1zn8.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 9 maj 2023 kl. 14.02 skrev Ihor Radchenko : > - forward-line + skip-chars-forward :: (2.980 2 0.648) > - beginning-of-line + looking-at-p :: (7.189 2 0.653) > - beginning-of-line + skip-chars-forward :: (6.833 2 0.634) > - forward-line + looking-at-p :: (3.180 2 0.663) Thanks for measuring. (The regexp cache usage is a secondary cost to = looking-at-p that isn't covered by your benchmark.) You may want to try the small improvement to skip-chars-forward that = just arrived on master. > Will it make sense to use a combination of char-after and > skip-chars-forward every single time? Maybe, depending on how complex that combination would be. Applications = under regexp cache pressure (like Org) may gain more from it, but of = course it's also a question of programming convenience. > May you elaborate what is the blocker then? Mostly time, but also coming up with a design that is compatible and = reasonably future-safe, and convincing people that it's a good way = forward (assuming it actually is). Emacs is a collaborative effort, = after all. From debbugs-submit-bounces@debbugs.gnu.org Tue May 09 11:53:08 2023 Received: (at 63225) by debbugs.gnu.org; 9 May 2023 15:53:08 +0000 Received: from localhost ([127.0.0.1]:44186 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwPeF-0005U0-JM for submit@debbugs.gnu.org; Tue, 09 May 2023 11:53:07 -0400 Received: from mout01.posteo.de ([185.67.36.65]:39679) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwPeD-0005TT-OS for 63225@debbugs.gnu.org; Tue, 09 May 2023 11:53:06 -0400 Received: from submission (posteo.de [185.67.36.169]) by mout01.posteo.de (Postfix) with ESMTPS id 94BCA2402CA for <63225@debbugs.gnu.org>; Tue, 9 May 2023 17:52:59 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=posteo.net; s=2017; t=1683647579; bh=V7iSldnoj/v8RKYqczXqKoQdRyGpSEWIf96Ic7vR7ng=; h=From:To:Cc:Subject:Date:From; b=kOVhBQ1WnAf0CvPxFjLBbDk5oV2pYvDd+qNpoQ/EhFlNXaXbmPS8PcI9AMODrSjj5 wdbu26+uX+B+AjaoIDC8b3p2wGzMfiS23XwsFgfyW6CZr2RWA/8zrQpYTlGGhXvW5f 0xXl1nqJfJZDT3DG0KxrjVZRfy3li2ERQwkPW2T4i3zbD/UYFf0FHZsqf6P9HxgRa9 YBnmpAe4pNHBn5QjXHeCicymDenotZsSuzNFYsutOq69T9einZNO4gc15d7AGg/grl DmHlH9PceYMrlltbe/cs0rZZR5SS5RZuqm3RHYmQ0eqBtmfDAZw/THJqHUUb3Jgn/5 v47X5cbV354xQ== Received: from customer (localhost [127.0.0.1]) by submission (posteo.de) with ESMTPSA id 4QG2jy4TjSz6tx0; Tue, 9 May 2023 17:52:54 +0200 (CEST) From: Ihor Radchenko To: Mattias =?utf-8?Q?Engdeg=C3=A5rd?= Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) In-Reply-To: <83EDC4A9-5F1F-4A75-8271-BAFCC8943E53@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> <875y923964.fsf@localhost> <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> <871qjp1zn8.fsf@localhost> <83EDC4A9-5F1F-4A75-8271-BAFCC8943E53@gmail.com> Date: Tue, 09 May 2023 15:56:09 +0000 Message-ID: <877cthbism.fsf@localhost> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Mattias Engdeg=C3=A5rd writes: > 9 maj 2023 kl. 14.02 skrev Ihor Radchenko : > >> - forward-line + skip-chars-forward :: (2.980 2 0.648) >> - beginning-of-line + looking-at-p :: (7.189 2 0.653) >> - beginning-of-line + skip-chars-forward :: (6.833 2 0.634) >> - forward-line + looking-at-p :: (3.180 2 0.663) > > You may want to try the small improvement to skip-chars-forward that just= arrived on master. I did not get anything meaningful here. Likely because my benchmark is not very stable (the above results did not stay the same for different Emacs session for example, except relative numbers). [in the same order] (4.171 2 0.420) (6.740 2 0.419) (5.977 1 0.210) (4.262 2 0.431) >> May you elaborate what is the blocker then? > > Mostly time, but also coming up with a design that is compatible and reas= onably future-safe, and convincing people that it's a good way forward (ass= uming it actually is). Emacs is a collaborative effort, after all. Then it is not a blocker, but rather "let's discuss it first in a dedicated, clearly marked thread". --=20 Ihor Radchenko // yantar92, Org mode contributor, Learn more about Org mode at . Support Org development at , or support my work at From debbugs-submit-bounces@debbugs.gnu.org Tue May 09 11:57:41 2023 Received: (at 63225) by debbugs.gnu.org; 9 May 2023 15:57:41 +0000 Received: from localhost ([127.0.0.1]:44196 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwPib-0005fM-Rq for submit@debbugs.gnu.org; Tue, 09 May 2023 11:57:41 -0400 Received: from mail-lf1-f45.google.com ([209.85.167.45]:61843) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1pwPiV-0005f3-K0 for 63225@debbugs.gnu.org; Tue, 09 May 2023 11:57:35 -0400 Received: by mail-lf1-f45.google.com with SMTP id 2adb3069b0e04-4f004cc54f4so6890245e87.3 for <63225@debbugs.gnu.org>; Tue, 09 May 2023 08:57:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1683647845; x=1686239845; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:from:to:cc:subject :date:message-id:reply-to; bh=M9RoL6fEjvkds7kQJjoJIsr3dem5BZUBzDjrTtRUBlc=; b=lquPxbs55D2s/vWwA52e/x/uJBPIodkDpvzQAE1Ayell/4oBmORpjAhFMBaBXklPxR d3PVTeyZPeLDh+N8+C+59L45cwHyTp/YDa1Wx5bF7Lw5IA/bppHbYSRQHX2t4FxxhRPO 3ld5tp8duzEIvi0S6MXeNonPLB7AsWXlP9NU637pFhpZaLytngCOI40AHrCWvwAmSJE/ o0cYKBho8jeE7+3lFLoWmYLBbHXTiXahx4o7BIHfUBHtxhXXN9P4TKQxyIj1C+ua6Xk7 1reG0pSdW3kZKOr2Eu8WM1NXXl2s9OXKyC3FGUt9x7Rz4b7HvyxCdMNsH//Kzk3CAJJl 94fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683647845; x=1686239845; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:sender:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=M9RoL6fEjvkds7kQJjoJIsr3dem5BZUBzDjrTtRUBlc=; b=YiQW4TxMNd3LGlp6o2fjDCCjmT42n1TNAOTZaCEiLlMUHZ0626ABV5Dj46k1qw9a/E HZt+8bvt6K9H5chAFLN/jiiMhqQa+YCqr+00vtrJ681OI+qEnzVYGWFeIc8Wa7oQYPKn xwQ3anoOk5Ce9gJfTxhze4AaJrVu8MPJrz8lMKsyd24b1smA280ZvPAHJp1FY4KqfdJD a+W7j+lRbfxCKLFGK1jnRnhWGXnrqVXd0WZ2n1j2c1p4/4pLzhPBJt8raJZ5WMi6mlp3 nHJ2h/Z9KDsXgWTBLBu7JE1g3WPSRbzMW2u9kagwb7cbmGZHV4KrS0w+YPRJ48MbDXdf PHVw== X-Gm-Message-State: AC+VfDxctnLrtZUImHVk8o2+9vikfLZKAMSADxXisyOKOJ8ktyGfnbmB JIYIo/hC8BC0+cDC/rL0O30= X-Google-Smtp-Source: ACHHUZ6zKui5kpCxJxkhQzyxfiJKSt4SkzHyjo6UtYVFzCwNOlfSsjUeP2qQz5MZvMjdTOpGKT2JfQ== X-Received: by 2002:ac2:43cf:0:b0:4dc:537c:9229 with SMTP id u15-20020ac243cf000000b004dc537c9229mr923773lfl.30.1683647845295; Tue, 09 May 2023 08:57:25 -0700 (PDT) Received: from smtpclient.apple (c188-150-165-235.bredband.tele2.se. [188.150.165.235]) by smtp.gmail.com with ESMTPSA id y1-20020a197501000000b004eff0b9f3fasm390599lfe.112.2023.05.09.08.57.24 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 09 May 2023 08:57:24 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.15\)) Subject: Re: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c) From: =?utf-8?Q?Mattias_Engdeg=C3=A5rd?= In-Reply-To: <877cthbism.fsf@localhost> Date: Tue, 9 May 2023 17:57:23 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <4D1DBA49-1824-41B4-BEC7-1241E97B6C9C@gmail.com> References: <63882A45-BD02-40D5-92FA-70175267BA3B@acm.org> <874jou7lsf.fsf@localhost> <37EED5F9-F1FE-46B6-B4FA-0B268B945123@gmail.com> <87wn1qqvj0.fsf@localhost> <34F4849A-CB39-4C96-9CC1-11ED723706DA@gmail.com> <87wn1psqny.fsf@localhost> <6DAF37F9-B236-4C33-8E30-0FCA47CCBCC5@gmail.com> <87zg6lfobh.fsf@localhost> <281B22C2-CD69-4495-A97C-E754446CA9A6@gmail.com> <87o7n1v1w3.fsf@localhost> <878E8D66-A548-42E6-B077-6068A8B131D8@gmail.com> <87ednvul22.fsf@localhost> <87y1m1oa01.fsf@localhost> <74CD5EF4-5424-40BA-8F80-D0FD89CB890F@gmail.com> <87zg6fjar6.fsf@localhost> <875y923964.fsf@localhost> <67AAB661-8B27-4D09-BF0D-6B76ABB54477@gmail.com> <871qjp1zn8.fsf@localhost> <83EDC4A9-5F1F-4A75-8271-BAFCC8943E53@gmail.com> <877cthbism.fsf@localhost> To: Ihor Radchenko X-Mailer: Apple Mail (2.3654.120.0.1.15) X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 63225 Cc: 63225@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) 9 maj 2023 kl. 17.56 skrev Ihor Radchenko : > I did not get anything meaningful here. Likely because my benchmark is > not very stable (the above results did not stay the same for different > Emacs session for example, except relative numbers). Oh well. It should get rid of some stupid consing though, that was the = main objective.