GNU bug report logs -
#60410
[PATCH 0/7] mumi: Boolean prefixes in xapian indexing and others
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 60410 in the body.
You can then email your comments to 60410 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:19:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Arun Isaac <arunisaac <at> systemreboot.net>
:
New bug report received and forwarded. Copy sent to
guix-patches <at> gnu.org
.
(Thu, 29 Dec 2022 20:19:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi Ricardo,
This is a patchset that has been sleeping for some time in my local
git repo. So, I thought it was about time to send it over!
The main change is that some xapian prefixes should be indexed as
boolean prefixes. This makes the use of an implicit AND operator
unneccessary and lets xapian do the natural thing of ordering results
by relevance. I believe this improves the search significantly. Also,
since we retrieve search results by relevance, we can offload limiting
of search results to xapian. Thus, we improve performance as well.
For this patchset to be useful, mumi's xapian index will have to be
rebuilt. In general, it is good to periodically rebuilt the xapian
index from scratch.
Regards,
Arun
Arun Isaac (7):
xapian: Index several terms as boolean and without positions.
xapian: Declare some prefixes as boolean.
xapian: Do not override the default OR implicit query operator.
messages: Remove unused set intersection feature in search-bugs.
messages: Offload limiting search results to xapian.
cache: Specify that cache! returns the cached value.
xapian: Preserve order of search results.
mumi/cache.scm | 3 +-
mumi/messages.scm | 29 ++++--------
mumi/xapian.scm | 109 +++++++++++++++++++++++++++++++---------------
3 files changed, 86 insertions(+), 55 deletions(-)
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 60410 <at> debbugs.gnu.org (full text, mbox):
* mumi/xapian.scm (index-files): Index bug number, submitter, authors,
owner, severity, tags, status, file and msgids as boolean terms. Index
bug number, severity, tags, status, file and msgids without position
information.
---
mumi/xapian.scm | 65 ++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 51 insertions(+), 14 deletions(-)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index 68169e8..06a54cd 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -1,6 +1,6 @@
;;; mumi -- Mediocre, uh, mail interface
;;; Copyright © 2020, 2022 Ricardo Wurmus <rekado <at> elephly.net>
-;;; Copyright © 2020 Arun Isaac <arunisaac <at> systemreboot.net>
+;;; Copyright © 2020, 2022 Arun Isaac <arunisaac <at> systemreboot.net>
;;;
;;; This program is free software: you can redistribute it and/or
;;; modify it under the terms of the GNU Affero General Public License
@@ -119,20 +119,57 @@ messages and index their contents in the Xapian database at DBPATH."
(term-generator (make-term-generator #:stem (make-stem "en")
#:document doc)))
;; Index fields with a suitable prefix. This allows for
- ;; searching separate fields as in subject:foo,
- ;; from:bar, etc.
- (index-text! term-generator bugid #:prefix "B")
- (index-text! term-generator submitter #:prefix "A")
- (index-text! term-generator authors #:prefix "XA")
+ ;; searching separate fields as in subject:foo, from:bar,
+ ;; etc. We do not keep track of the within document
+ ;; frequencies of terms that will be used for boolean
+ ;; filtering. We do not generate position information for
+ ;; fields that will not need phrase searching or NEAR
+ ;; searches.
+ (index-text! term-generator
+ bugid
+ #:prefix "B"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ submitter
+ #:prefix "A"
+ #:wdf-increment 0)
+ (index-text! term-generator
+ authors
+ #:prefix "XA"
+ #:wdf-increment 0)
(index-text! term-generator subjects #:prefix "S")
- (index-text! term-generator (or (bug-owner bug) "") #:prefix "XO")
- (index-text! term-generator (or (bug-severity bug) "normal") #:prefix "XS")
- (index-text! term-generator (or (bug-tags bug) "") #:prefix "XT")
- (index-text! term-generator (cond
- ((bug-done bug) "done")
- (else "open")) #:prefix "XSTATUS")
- (index-text! term-generator file #:prefix "F")
- (index-text! term-generator msgids #:prefix "XU")
+ (index-text! term-generator
+ (or (bug-owner bug) "")
+ #:prefix "XO"
+ #:wdf-increment 0)
+ (index-text! term-generator
+ (or (bug-severity bug) "normal")
+ #:prefix "XS"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ (or (bug-tags bug) "")
+ #:prefix "XT"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ (cond
+ ((bug-done bug) "done")
+ (else "open"))
+ #:prefix "XSTATUS"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ file
+ #:prefix "F"
+ #:wdf-increment 0
+ #:positions? #f)
+ (index-text! term-generator
+ msgids
+ #:prefix "XU"
+ #:wdf-increment 0
+ #:positions? #f)
;; Index subject and body without prefixes for general
;; search.
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 60410 <at> debbugs.gnu.org (full text, mbox):
Some prefixes will only ever be used to filter the rest of the query
and not for matching approximately using relevance weighting
schemes. Such prefixes should be indexed as boolean prefixes.
* mumi/xapian.scm (parse-query*): Support boolean prefixes.
(search): Declare author, msgid, owner, severity, status, submitter
and tag as boolean prefixes.
---
mumi/xapian.scm | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index 06a54cd..7bf84d3 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -249,7 +249,7 @@ messages and index their contents in the Xapian database at DBPATH."
(invalid (pk invalid "")))
token))
-(define* (parse-query* querystring #:key stemmer stemming-strategy (prefixes '()))
+(define* (parse-query* querystring #:key stemmer stemming-strategy (prefixes '()) (boolean-prefixes '()))
(let ((queryparser (new-QueryParser))
(date-range-processor (new-DateRangeProcessor 0 "date:" 0))
(mdate-range-processor (new-DateRangeProcessor 1 "mdate:" 0)))
@@ -261,6 +261,10 @@ messages and index their contents in the Xapian database at DBPATH."
((field . prefix)
(QueryParser-add-prefix queryparser field prefix)))
prefixes)
+ (for-each (match-lambda
+ ((field . prefix)
+ (QueryParser-add-boolean-prefix queryparser field prefix)))
+ boolean-prefixes)
(QueryParser-add-rangeprocessor queryparser date-range-processor)
(QueryParser-add-rangeprocessor queryparser mdate-range-processor)
(let ((query (QueryParser-parse-query queryparser querystring
@@ -324,14 +328,14 @@ intact."
;; prefixes for field search.
(query (parse-query* querystring*
#:stemmer (make-stem "en")
- #:prefixes '(("submitter" . "A")
- ("author" . "XA")
- ("subject" . "S")
- ("owner" . "XO")
- ("severity" . "XS")
- ("tag" . "XT")
- ("status" . "XSTATUS")
- ("msgid" . "XU"))))
+ #:prefixes '(("subject" . "S"))
+ #:boolean-prefixes '(("author" . "XA")
+ ("msgid" . "XU")
+ ("owner" . "XO")
+ ("severity" . "XS")
+ ("status" . "XSTATUS")
+ ("submitter" . "A")
+ ("tag" . "XT"))))
(enq (enquire db query)))
;; Collapse on mergedwith value
(Enquire-set-collapse-key enq 2 1)
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 60410 <at> debbugs.gnu.org (full text, mbox):
An implicit AND operator is overly restrictive. It was only necessary
because prefixes that should have been indexed as boolean prefixes
were not.
* mumi/xapian.scm (parse-query*): Do not override the default OR
implicit query operator.
---
mumi/xapian.scm | 1 -
1 file changed, 1 deletion(-)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index 7bf84d3..ae01acc 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -253,7 +253,6 @@ messages and index their contents in the Xapian database at DBPATH."
(let ((queryparser (new-QueryParser))
(date-range-processor (new-DateRangeProcessor 0 "date:" 0))
(mdate-range-processor (new-DateRangeProcessor 1 "mdate:" 0)))
- (QueryParser-set-default-op queryparser (Query-OP-AND))
(QueryParser-set-stemmer queryparser stemmer)
(when stemming-strategy
(QueryParser-set-stemming-strategy queryparser stemming-strategy))
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:03 GMT)
Full text and
rfc822 format available.
Message #17 received at 60410 <at> debbugs.gnu.org (full text, mbox):
* mumi/messages.scm (search-bugs): Remove unused set intersection
feature.
---
mumi/messages.scm | 18 +++++++-----------
1 file changed, 7 insertions(+), 11 deletions(-)
diff --git a/mumi/messages.scm b/mumi/messages.scm
index fb305bb..75ac3b1 100644
--- a/mumi/messages.scm
+++ b/mumi/messages.scm
@@ -1,6 +1,6 @@
;;; mumi -- Mediocre, uh, mail interface
;;; Copyright © 2017, 2018, 2019, 2020, 2021 Ricardo Wurmus <rekado <at> elephly.net>
-;;; Copyright © 2018, 2019 Arun Isaac <arunisaac <at> systemreboot.net>
+;;; Copyright © 2018, 2019, 2022 Arun Isaac <arunisaac <at> systemreboot.net>
;;;
;;; This program is free software: you can redistribute it and/or
;;; modify it under the terms of the GNU Affero General Public License
@@ -250,16 +250,12 @@ PATCH-SET. If PATCH-SET is not provided, return all patches."
message-numbers)
"\n")))
-(define* (search-bugs query #:key (sets '()) (max 400))
- "Return a list of all bugs matching the given QUERY string.
-Intersect the result with the id sets in the list SETS."
- (let* ((ids (map string->number
- (search query)))
- (filtered (match sets
- (() ids)
- (_ (apply lset-intersection eq? ids sets)))))
- (status-with-cache (if (> (length filtered) max)
- (take filtered max) filtered))))
+(define* (search-bugs query #:key (max 400))
+ "Return a list of all bugs matching the given QUERY string."
+ (let ((ids (map string->number
+ (search query))))
+ (status-with-cache (if (> (length ids) max)
+ (take ids max) ids))))
(define (recent-bugs amount)
"Return up to AMOUNT bugs with most recent activity."
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:03 GMT)
Full text and
rfc822 format available.
Message #20 received at 60410 <at> debbugs.gnu.org (full text, mbox):
* mumi/messages.scm (search-bugs): Offload limiting search results to
max to xapian.
---
mumi/messages.scm | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/mumi/messages.scm b/mumi/messages.scm
index 75ac3b1..b3ae962 100644
--- a/mumi/messages.scm
+++ b/mumi/messages.scm
@@ -252,10 +252,8 @@ PATCH-SET. If PATCH-SET is not provided, return all patches."
(define* (search-bugs query #:key (max 400))
"Return a list of all bugs matching the given QUERY string."
- (let ((ids (map string->number
- (search query))))
- (status-with-cache (if (> (length ids) max)
- (take ids max) ids))))
+ (status-with-cache (map string->number
+ (search query #:pagesize max))))
(define (recent-bugs amount)
"Return up to AMOUNT bugs with most recent activity."
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:03 GMT)
Full text and
rfc822 format available.
Message #23 received at 60410 <at> debbugs.gnu.org (full text, mbox):
* mumi/cache.scm (cache!): Specify in the docstring that cache!
returns the cached value.
---
mumi/cache.scm | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mumi/cache.scm b/mumi/cache.scm
index 13b21f9..98a7856 100644
--- a/mumi/cache.scm
+++ b/mumi/cache.scm
@@ -1,5 +1,6 @@
;;; mumi -- Mediocre, uh, mail interface
;;; Copyright © 2020 Ricardo Wurmus <rekado <at> elephly.net>
+;;; Copyright © 2022 Arun Isaac <arunisaac <at> systemreboot.net>
;;;
;;; This program is free software: you can redistribute it and/or
;;; modify it under the terms of the GNU Affero General Public License
@@ -34,7 +35,7 @@ expired or return #F."
(define* (cache! key value
#:optional (ttl (%config 'cache-ttl)))
"Store VALUE for the given KEY and mark it to expire after TTL
-seconds."
+seconds. Return VALUE."
(let ((t (current-time)))
(hash-set! %cache key `(#:expires ,(+ t ttl) #:value ,value))
value))
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Thu, 29 Dec 2022 20:25:04 GMT)
Full text and
rfc822 format available.
Message #26 received at 60410 <at> debbugs.gnu.org (full text, mbox):
Xapian orders search results by relevance. Preserve this order.
* mumi/xapian.scm (search): Reverse search results after consing to
preserve the original order.
* mumi/messages.scm (status-with-cache): Do not sort bugs by their bug
number. Preserve the order of bugs passed to this function.
---
mumi/messages.scm | 13 ++++---------
mumi/xapian.scm | 21 +++++++++++----------
2 files changed, 15 insertions(+), 19 deletions(-)
diff --git a/mumi/messages.scm b/mumi/messages.scm
index b3ae962..fd52571 100644
--- a/mumi/messages.scm
+++ b/mumi/messages.scm
@@ -64,15 +64,10 @@
(define (status-with-cache ids)
"Invoke GET-STATUS, but only on those IDS that have not been cached
yet. Return new results alongside cached results."
- (let* ((cached (filter-map cached? ids))
- (uncached-ids (lset-difference eq?
- ids
- (map bug-num cached)))
- (new (filter-map bug-status uncached-ids )))
- ;; Cache new things
- (map (lambda (bug) (cache! (bug-num bug) bug)) new)
- ;; Return everything from cache
- (sort (append cached new) (lambda (a b) (< (bug-num a) (bug-num b))))))
+ (map (lambda (id)
+ (or (cached? id)
+ (cache! id (bug-status id))))
+ ids))
(define (extract-name address)
(or (assoc-ref address 'name)
diff --git a/mumi/xapian.scm b/mumi/xapian.scm
index ae01acc..7ca5bb8 100644
--- a/mumi/xapian.scm
+++ b/mumi/xapian.scm
@@ -339,16 +339,17 @@ intact."
;; Collapse on mergedwith value
(Enquire-set-collapse-key enq 2 1)
;; Fold over the results, return bug id.
- (mset-fold (lambda (item acc)
- (cons
- (document-data (mset-item-document item))
- acc))
- '()
- ;; Get an Enquire object from the database with the
- ;; search results. Then, extract the MSet from the
- ;; Enquire object.
- (enquire-mset enq
- #:maximum-items pagesize))))))
+ (reverse
+ (mset-fold (lambda (item acc)
+ (cons
+ (document-data (mset-item-document item))
+ acc))
+ '()
+ ;; Get an Enquire object from the database with the
+ ;; search results. Then, extract the MSet from the
+ ;; Enquire object.
+ (enquire-mset enq
+ #:maximum-items pagesize)))))))
(define* (index! #:key full?)
"Index all Debbugs log files corresponding to the selected
--
2.38.1
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Sat, 31 Dec 2022 18:12:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 60410 <at> debbugs.gnu.org (full text, mbox):
Hi Arun,
thank you for your patches! I applied them all and then ran
./pre-inst-env scripts/mumi fetch
but got this error:
worker error: (keyword-argument-error #f Unrecognized keyword () (#:positions?))
> + ;; searching separate fields as in subject:foo, from:bar,
> + ;; etc. We do not keep track of the within document
> + ;; frequencies of terms that will be used for boolean
> + ;; filtering. We do not generate position information for
> + ;; fields that will not need phrase searching or NEAR
> + ;; searches.
> + (index-text! term-generator
> + bugid
> + #:prefix "B"
> + #:wdf-increment 0
> + #:positions? #f)
I made sure to update to guile-xapian 0.2.1, the latest commit, as far
as I can tell.
--
Ricardo
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Sat, 31 Dec 2022 23:03:01 GMT)
Full text and
rfc822 format available.
Message #32 received at 60410 <at> debbugs.gnu.org (full text, mbox):
Hi Ricardo,
> worker error: (keyword-argument-error #f Unrecognized keyword ()
> (#:positions?))
Oops! It looks like I have been working with some unpublished
guile-xapian code. I have pushed those guile-xapian commits, released
guile-xapian 0.3.0 and updated the Guix guile-xapian package. Hopefully,
it should work now. Could you try again?
Thanks,
Arun
Reply sent
to
Ricardo Wurmus <rekado <at> elephly.net>
:
You have taken responsibility.
(Sun, 01 Jan 2023 12:15:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Arun Isaac <arunisaac <at> systemreboot.net>
:
bug acknowledged by developer.
(Sun, 01 Jan 2023 12:15:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 60410-done <at> debbugs.gnu.org (full text, mbox):
Hi Arun,
>> worker error: (keyword-argument-error #f Unrecognized keyword ()
>> (#:positions?))
>
> Oops! It looks like I have been working with some unpublished
> guile-xapian code. I have pushed those guile-xapian commits, released
> guile-xapian 0.3.0 and updated the Guix guile-xapian package. Hopefully,
> it should work now. Could you try again?
Thank you, thisk works!
I applied the changes.
--
Ricardo
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Sun, 01 Jan 2023 23:22:01 GMT)
Full text and
rfc822 format available.
Message #40 received at 60410 <at> debbugs.gnu.org (full text, mbox):
Hi Arun,
> Some prefixes will only ever be used to filter the rest of the query
> and not for matching approximately using relevance weighting
> schemes. Such prefixes should be indexed as boolean prefixes.
[…]
> @@ -324,14 +328,14 @@ intact."
> ;; prefixes for field search.
> (query (parse-query* querystring*
> #:stemmer (make-stem "en")
> - #:prefixes '(("submitter" . "A")
> - ("author" . "XA")
> - ("subject" . "S")
> - ("owner" . "XO")
> - ("severity" . "XS")
> - ("tag" . "XT")
> - ("status" . "XSTATUS")
> - ("msgid" . "XU"))))
> + #:prefixes '(("subject" . "S"))
> + #:boolean-prefixes '(("author" . "XA")
> + ("msgid" . "XU")
> + ("owner" . "XO")
> + ("severity" . "XS")
> + ("status" . "XSTATUS")
> + ("submitter" . "A")
> + ("tag" . "XT"))))
This breaks two tests, which allow searching for submitters with partial
names, e.g. “Ricardo” instead of my full name and email address.
I think we should move submitter, author, and owner back to the list of
regular prefixes.
--
Ricardo
Information forwarded
to
guix-patches <at> gnu.org
:
bug#60410
; Package
guix-patches
.
(Mon, 02 Jan 2023 17:02:02 GMT)
Full text and
rfc822 format available.
Message #43 received at 60410 <at> debbugs.gnu.org (full text, mbox):
Hi Ricardo,
> I think we should move submitter, author, and owner back to the list of
> regular prefixes.
You're right. Sorry, I missed that.
Regards,
Arun
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 31 Jan 2023 12:24:06 GMT)
Full text and
rfc822 format available.
bug unarchived.
Request was from
Felix Lechner <felix.lechner <at> lease-up.com>
to
control <at> debbugs.gnu.org
.
(Thu, 08 Feb 2024 17:26:02 GMT)
Full text and
rfc822 format available.
bug reassigned from package 'guix-patches' to 'mumi'.
Request was from
Felix Lechner <felix.lechner <at> lease-up.com>
to
control <at> debbugs.gnu.org
.
(Thu, 08 Feb 2024 17:26:03 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Felix Lechner <felix.lechner <at> lease-up.com>
to
control <at> debbugs.gnu.org
.
(Thu, 08 Feb 2024 17:26:03 GMT)
Full text and
rfc822 format available.
bug unarchived.
Request was from
Felix Lechner <felix.lechner <at> lease-up.com>
to
control <at> debbugs.gnu.org
.
(Fri, 23 Feb 2024 13:25:03 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Felix Lechner <felix.lechner <at> lease-up.com>
to
control <at> debbugs.gnu.org
.
(Fri, 23 Feb 2024 13:25:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 117 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.