From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 13 03:03:04 2024
Received: (at submit) by debbugs.gnu.org; 13 Sep 2024 07:03:04 +0000
Received: from localhost ([127.0.0.1]:42334 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1sp0Kd-00013d-4U
	for submit@debbugs.gnu.org; Fri, 13 Sep 2024 03:03:04 -0400
Received: from lists.gnu.org ([209.51.188.17]:37922)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <aurtzy@gmail.com>) id 1sp0Ka-00013E-9N
 for submit@debbugs.gnu.org; Fri, 13 Sep 2024 03:03:01 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <aurtzy@gmail.com>) id 1sp0KR-0002Sc-8S
 for guix-patches@gnu.org; Fri, 13 Sep 2024 03:02:51 -0400
Received: from mail-qk1-x734.google.com ([2607:f8b0:4864:20::734])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <aurtzy@gmail.com>) id 1sp0KN-0006Nm-2o
 for guix-patches@gnu.org; Fri, 13 Sep 2024 03:02:51 -0400
Received: by mail-qk1-x734.google.com with SMTP id
 af79cd13be357-7a9a23fc16fso177824485a.2
 for <guix-patches@gnu.org>; Fri, 13 Sep 2024 00:02:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726210965; x=1726815765; darn=gnu.org;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:from:to:cc:subject:date:message-id:reply-to;
 bh=s8q6RMG7mBPp0xPIrEGPqxMFvj+I+gDDY2ZFoB3zFxc=;
 b=TndsZxoTB8DWCAtcz4hG4OJ+5pj2Ic8WGeqhSvGzHBnMjSkM+6xGYH3p2khgpvMmAH
 9Zp5R7VvJUgLWwJCWltiMJTFEdb0Cw53wh+PKmP5tEAwxe5AsFhm4ErQJR4lPjmXn30r
 oYjLutxBPpm9Oo3+XRry/wt2ihiWmSiHD6t7uurHHV/xnD3giklxGVkEe/JDDE2ZYHqx
 0xiFBlBQCm+TWfBcooRfy3BIeYKliRO4m9uwocIwAMbWvNA0pElTEkDM8b50O/7C7wFm
 d7LNXOffyaAWQ8VtZWHwnGAqrPJrhotpnIQ9SGwvjJwvwzI5UL3VFU2PENMSkjAAp24N
 CO+Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726210965; x=1726815765;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=s8q6RMG7mBPp0xPIrEGPqxMFvj+I+gDDY2ZFoB3zFxc=;
 b=RKMRXLIGWGhqdGwCvGLr2IINngQER3Iya5/uDd6jFOZOv1gub44ibI7IxTN0CS8oc7
 ePetZabppU8cXGb5xYqlB+uGrLkSqZ/a2wYkxEVUQOl51oxK5mXAw/bbFNRs23Sy1QQP
 npHARTzc7JpHr2P1L7ix4bq0FAgoilRjtF/MFxD/kaU6xmpcU+V0+N9ssn8aSv7jy3bb
 glF+WJ2EbUZeQv6wcmzTjlMB/P0HdO15wLjlnTzSRddEMPhTCYVO9APbCv0/43OBtmji
 YXu9l7MF3MoocbujLuVcgK4yjWj4d/xzEixBE+o7KUwyvb48L+t+cmqtb9658wkByRJn
 hw0g==
X-Gm-Message-State: AOJu0Yzji2WxdmNM/ZLkT5iw1tFzCB3UJ8AjdUY67g/ZO64FPUbt1fDn
 UXyp7Cpb+kAr8SSZb+Gusa0xl6V+58MH008L3hW5Hf3M5UsvLNIP+Z0XHw==
X-Google-Smtp-Source: AGHT+IGkMfwdsBaXeRfbeCfpu3+fwBuOOvlQXNZiBT1V8WQkjK4yrTwTr+Sh9eBLqL2pZ6z8dS+bsw==
X-Received: by 2002:a05:620a:298e:b0:7a9:b250:d57a with SMTP id
 af79cd13be357-7a9e5eea278mr989105085a.1.1726210964967; 
 Fri, 13 Sep 2024 00:02:44 -0700 (PDT)
Received: from localhost.localdomain ([2600:4808:a053:7600::e413])
 by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-6c53474d632sm62716226d6.89.2024.09.13.00.02.44
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 13 Sep 2024 00:02:44 -0700 (PDT)
From: aurtzy <aurtzy@gmail.com>
To: guix-patches@gnu.org
Subject: [PATCH] ui: Add more nuance to relevance scoring.
Date: Fri, 13 Sep 2024 03:02:25 -0400
Message-ID: <c882a1a5d8085e513c5c3d8bc997e3dd8f4460bb.1726210587.git.aurtzy@gmail.com>
X-Mailer: git-send-email 2.46.0
MIME-Version: 1.0
X-Debbugs-Cc: Christopher Baines <guix@cbaines.net>, Josselin Poiret <dev@jpoiret.xyz>, Ludovic Courtès <ludo@gnu.org>, Mathieu Othacehe <othacehe@gnu.org>, Simon Tournier <zimon.toutoune@gmail.com>, Tobias Geerinckx-Rice <me@tobias.gr>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=2607:f8b0:4864:20::734;
 envelope-from=aurtzy@gmail.com; helo=mail-qk1-x734.google.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -1.3 (-)
X-Debbugs-Envelope-To: submit
Cc: aurtzy <aurtzy@gmail.com>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

Fixes <https://issues.guix.gnu.org/70689>.

* guix/ui.scm (char-set:word-border): New variable.
(relevance): Update docstring.
[whole-word-score, exact-match-score]: New variables.
[score]: Score whole words such that matching a whole word will always put an
object higher than another if the other does not match any whole words.  Exact
matches are given similar treatment.  Score matches slightly higher than the
baseline if they have one word boundary, with the assumption that they are
more likely to be part of compound words rather than simply substrings.  Only
count a maximum of one scored match per field to limit putting too much weight
on terms that happen to be very common.
[score][string-ref-border?]: New procedure.

Change-Id: I8e3d7a20bf296485355d1c191fe3fee5ef6490c8
---

Hello!

This is an attempt to improve guix's search functionality for cases like the
linked issue.

Elaborating on some parts of my implementation:

I opted to switch to counting a maximum of one match per field, which helps
with cases where a common subword matches /many/ times in packages with longer
descriptions, pushing more relevant packages down.  In multi-term searches,
the unique terms - which are naturally rarer - also contribute to a larger
percentage of the score as a result of these changes.

Having matches with only one word boundary be scored as 2 instead of 1 was
done with the reasoning that a term is more likely to be part of a compound
word name (and thus more relevant) if it is a prefix or suffix; for example,
"gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser.

In an effort to minimize regressions in scoring, I've compiled a set of test
cases and their expected results, which - if useful - might also be usable in
future work:

| Keyword(s) with poor  | Expectations                                  |
| results before        |                                               |
|-----------------------+-----------------------------------------------|
| dig                   | ~bind~ near top.                              |
| rsh                   | ~inetutils~ near top.                         |
| c                     | C language related.                           |
| c compiler            | Compiler-related C stuff.                     |
| r                     | R language related.                           |
| tor                   | Tor related; ~torbrowser~ somewhere near top. |
| gcc                   | ~gcc-toolchain~ near top.                     |
|-----------------------+-----------------------------------------------|
| Keyword(s) with mixed |                                               |
| results before        |                                               |
|-----------------------+-----------------------------------------------|
| gl                    | GL related.                                   |
| sh                    | Shell-related.                                |
|-----------------------+-----------------------------------------------|
| Keywords(s) with good |                                               |
| results before        |                                               |
|-----------------------+-----------------------------------------------|
| gcc toolchain         | ~gcc-toolchain~ near top.                     |
| python                | ~python~ at top.                              |
| python language       | ~python~ at top.                              |
| python minimal        | ~python{,2}-minimal~ and friends near top.    |
| sync files            | File synchronization related.                 |
| sdl2                  | ~sdl2~ at top.                                |

However, some of these cases might be a bit too abstract, so I'm not sure how
sufficient this testing is.  Note that I only did minimal testing with =guix
system search= and =guix home search= which - while seemingly fine - could be
more rigorous (am I forgetting any other commands?).

Going over the results of these changes on the test cases:

There were notable improvements searching:
- =rsh=: ~inetutils~ now shows up at the top when searching =rsh=, with
  another relevant (but previously buried) ~emacs-tramp~ at second place.
- =c=: Searches for =c= return results related to the language now, whereas
  before it was a lot of unrelated packages that simply had the most =c=
  characters.
- =dig=: While not the first result, ~bind~ is now displayed as tied for 3rd
  in relevance score, showing up within 10 packages.
- =r=: Previously in a similar situation as C.  Now ~r~ shows up at the top,
  with other R-related packages under it.
- =gl=: The =gl= test case's results are slightly improved.  Before, there
  were some non-relevant packages with the =gl= substring near the top, which
  is no longer the case.
- =sh=: As a common subword, searching =sh= led to a mix of relevant and less
  relevant results at the top.  A good majority are now shell-related.
- =tor=: ~tor~ shows up on the top in both cases, but before with lots of
  non-relevant packages under it; the previously buried ~torbrowser~ now
  accompanies other more relevant results near the top.
- =gcc=: ~gcc-toolchain~ is now a top result, compared to ~gccgo~ at the top
  before (and even ~gdc-toolchain~ also being higher; upstream name being
  "gcc" seems to have caused that).


There are slight regressions with searching:
- =sync files=: The new algorithm has a few less relevant results at the top
  compared to before, but otherwise seems like a shuffling of the old results.
- =sdl2=: ~sdl2~'s top rank is overtaken by two libraries.


If I didn't mention a test case from the table, it's probably because results
were at least consistent or better (and I think I've written too much to read
already).

Closing this message on an unrelated note for future work: I stumbled on an
interesting idea while looking for test cases which suggested reducing the
score of a programming library when its language is not included in search
terms.  It's out of scope for the current issue, but I thought I'd mention it
anyways for potential further improvements.

Cheers,

aurtzy

 guix/ui.scm | 54 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 45 insertions(+), 9 deletions(-)

diff --git a/guix/ui.scm b/guix/ui.scm
index 966f0611f6..420f1f7501 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -19,6 +19,7 @@
 ;;; Copyright © 2018 Steve Sprang <scs@stevesprang.com>
 ;;; Copyright © 2022 Taiju HIGASHI <higashi@taiju.info>
 ;;; Copyright © 2022 Liliana Marie Prikler <liliana.prikler@gmail.com>
+;;; Copyright © 2024 aurtzy <aurtzy@gmail.com>
 ;;;
 ;;; This file is part of GNU Guix.
 ;;;
@@ -1678,22 +1679,57 @@ (define* (package->recutils p port #:optional (width (terminal-columns))
 ;;; Searching.
 ;;;
 
+(define char-set:word-border (char-set-union char-set:digit
+                                             char-set:punctuation
+                                             char-set:symbol
+                                             char-set:whitespace))
+
 (define (relevance obj regexps metrics)
-  "Compute a \"relevance score\" for OBJ as a function of its number of
-matches of REGEXPS and accordingly to METRICS.  METRICS is list of
-field/weight pairs, where FIELD is a procedure that returns a string or list
-of strings describing OBJ, and WEIGHT is a positive integer denoting the
-weight of this field in the final score.
+  "Compute a \"relevance score\" for OBJ as a function of its matches of REGEXPS and
+accordingly to METRICS.  METRICS is list of field/weight pairs, where FIELD is a
+procedure that returns a string or list of strings describing OBJ, and WEIGHT is a
+positive integer denoting the weight of this field in the final score.
 
 A score of zero means that OBJ does not match any of REGEXPS.  The higher the
 score, the more relevant OBJ is to REGEXPS."
+  ;; Ensure that objects with whole word matches always score greater than (or equal
+  ;; to) objects that only match substrings.
+  (define whole-word-score (apply + (map (match-lambda
+                                           ((_ . weight) weight))
+                                         metrics)))
+  (define exact-match-score (* whole-word-score 2))
+
   (define (score regexp str)
+    (define (string-ref-border? k)
+      (if (<= 0 k (1- (string-length str)))
+          (char-set-contains? char-set:word-border (string-ref str k))
+          #t))
+
     (fold-matches regexp str 0
                   (lambda (m score)
-                    (+ score
-                       (if (string=? (match:substring m) str)
-                           5             ;exact match
-                           1)))))
+                    (cond
+                     ((string=? (match:substring m) str)
+                      exact-match-score)
+                     ((>= score whole-word-score)
+                      ;; No need to compute further if score is already max
+                      ;; possible score
+                      score)
+                     (else
+                      (let ((start-border?
+                             (string-ref-border? (1- (match:start m))))
+                            (end-border?
+                             (string-ref-border? (match:end m))))
+                        (max score
+                             (cond
+                              ((and start-border? end-border?)
+                               whole-word-score)
+                              ((or start-border? end-border?)
+                               ;; If the match only has one border, it could still be
+                               ;; part of a compound word, and thus be more likely to
+                               ;; be relevant than if it was just a substring.
+                               2)
+                              (else
+                               1)))))))))
 
   (define (regexp->score regexp)
     (let ((score-regexp (lambda (str) (score regexp str))))

base-commit: b6d5a7f5836739dab884b49a64ca354794dd845f
-- 
2.45.2


From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 13 10:25:28 2024
Received: (at 73220) by debbugs.gnu.org; 13 Sep 2024 14:25:28 +0000
Received: from localhost ([127.0.0.1]:43919 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1sp7El-0001yU-UG
	for submit@debbugs.gnu.org; Fri, 13 Sep 2024 10:25:28 -0400
Received: from mail-wr1-f51.google.com ([209.85.221.51]:54720)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@gmail.com>) id 1sp7Ek-0001y1-43
 for 73220@debbugs.gnu.org; Fri, 13 Sep 2024 10:25:26 -0400
Received: by mail-wr1-f51.google.com with SMTP id
 ffacd0b85a97d-374bd059b12so675576f8f.1
 for <73220@debbugs.gnu.org>; Fri, 13 Sep 2024 07:25:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726237451; x=1726842251; darn=debbugs.gnu.org;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:from:to:cc:subject:date:message-id:reply-to;
 bh=YjwUwsr9pxU7mCEhFkg98Ymg5vreMKnsBqb+7RbbJI8=;
 b=XmLaYGHTb7AzeA874u2nOeZm2DsQN1NAbdyRHa/PNIXctw8LOrR3dz+DQTEggNY6GK
 bM2x9YAeae6LmDjt/6azQsdMa1HxrnzoRNppoAbrQlhqEoMarxzYnIcMwmLCZ2rkNKQp
 jgaMK1a8V/6F8pkYlDP4D3YLfajNn7VzIybXplQFTjThyaV79V8bZzigF869bhxhxtNW
 oPOVsRQQvSP9hTgzrJIKij8eN47h+5i7flB00nkhq3C0p7vPPck8qbKJorVzUgwnFhY1
 DVOW+Mc2WkDf8xBSFWClSjc6qJetf0BdTl5GUgZ23DIithq+LpVdYczSaIVPwbF6lWHC
 SAIw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726237451; x=1726842251;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=YjwUwsr9pxU7mCEhFkg98Ymg5vreMKnsBqb+7RbbJI8=;
 b=uNHBvJ69oN7w4yrePGLpFZDlO3azAFwaLH1Qvd7GfVT7iy2kYe1FQ0EDOg2aqD1skh
 0v3auHYbHOoEhFcodZizPpVbRFkObDNJFacXqBe7SUwRD9BZfQP5pYSNqn0XZQis+MK5
 hzw4AjRN4/UlL+IiB7p+SjIa547DeOmP8knzYphmOb7+E4tpu/Uyij2icCaxIZpl/bfn
 UZ5KCtM/1h+K9dAkG4un1JBXgmF9PGD6UWy0ohgVbJba7jHyRZxTFyKF9bZgJgOtSPNQ
 J+xsW527glb4boPCMQQoE0/AjV0yp1cg+HrZdfF4nyjipQBmEVw6PJaogrE8PUrZxhvH
 8G3g==
X-Gm-Message-State: AOJu0Yxa7RTY31LiiQXE1R9uXT7vXgbNkITMo1quvn1MCGcERiSe0YMT
 rHgVBrhs8RVfh24jLdc6CdsoJ5oxe1GtDE9VwlZaUalD8Odm1VS6KSNZEg==
X-Google-Smtp-Source: AGHT+IH+/pR6D6dDuasi/OysXTEvTURn1FfHzQ+kzA9XvrfRCSo5iO06XEhgB7h44bUiSqUIIASjxQ==
X-Received: by 2002:adf:f68a:0:b0:374:cbdd:4813 with SMTP id
 ffacd0b85a97d-378d61f12e3mr1629130f8f.31.1726237450443; 
 Fri, 13 Sep 2024 07:24:10 -0700 (PDT)
Received: from lili.univ-paris-diderot.fr
 (roam-nat-fw-prg-194-254-61-42.net.univ-paris-diderot.fr. [194.254.61.42])
 by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-37895675cb5sm17251269f8f.53.2024.09.13.07.24.09
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 13 Sep 2024 07:24:10 -0700 (PDT)
From: Simon Tournier <zimon.toutoune@gmail.com>
To: 73220@debbugs.gnu.org
Subject: [PATCH v2] ui: Add partial match relevance scoring.
Date: Fri, 13 Sep 2024 16:24:06 +0200
Message-ID: <fdb82e6274c5d0bbc3470b09ca73cccf4abb5a9a.1726237401.git.zimon.toutoune@gmail.com>
X-Mailer: git-send-email 2.46.0
MIME-Version: 1.0
X-Debbugs-Cc: Christopher Baines <guix@cbaines.net>, Josselin Poiret <dev@jpoiret.xyz>, Ludovic Courtès <ludo@gnu.org>, Mathieu Othacehe <othacehe@gnu.org>, Simon Tournier <zimon.toutoune@gmail.com>, Tobias Geerinckx-Rice <me@tobias.gr>
Content-Transfer-Encoding: 8bit
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 73220
Cc: aurtzy@gmail.com, Simon Tournier <zimon.toutoune@gmail.com>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

* guix/ui.scm (char-set:delimiters): New variable.
(revelance)[string-match-term?]: New procedure.
[score]: Use it.

Change-Id: If2edc0e08b338a0064f73425db60d688c0535fb0
---
 guix/ui.scm | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/guix/ui.scm b/guix/ui.scm
index 966f0611f6..a8d1d120a4 100644
--- a/guix/ui.scm
+++ b/guix/ui.scm
@@ -1678,6 +1678,14 @@ (define* (package->recutils p port #:optional (width (terminal-columns))
 ;;; Searching.
 ;;;
 
+(define char-set:delimiters (char-set-xor
+                             (char-set #\-) ;remove from punctuation
+                             (char-set-union (char-set #\nul)
+                                             (char-set #\newline)
+                                             char-set:punctuation
+                                             char-set:symbol
+                                             char-set:whitespace)))
+
 (define (relevance obj regexps metrics)
   "Compute a \"relevance score\" for OBJ as a function of its number of
 matches of REGEXPS and accordingly to METRICS.  METRICS is list of
@@ -1687,13 +1695,28 @@ (define (relevance obj regexps metrics)
 
 A score of zero means that OBJ does not match any of REGEXPS.  The higher the
 score, the more relevant OBJ is to REGEXPS."
+  (define (string-match-term? regex-match str)
+    (let* ((start (match:start regex-match))
+           (char:start (if (= 0 start)
+                           #\nul
+                           (string-ref str (1- start))))
+           (end (match:end regex-match))
+           (char:end (if (= end (string-length str))
+                         #\nul
+                         (string-ref str end))))
+      (and (char-set-contains? char-set:delimiters char:start)
+           (char-set-contains? char-set:delimiters char:end))))
+
   (define (score regexp str)
     (fold-matches regexp str 0
                   (lambda (m score)
                     (+ score
-                       (if (string=? (match:substring m) str)
-                           5             ;exact match
-                           1)))))
+                       (cond
+                        ((string=? (match:substring m) str)
+                         5)             ;exact match
+                        ((string-match-term? m str)
+                         3)             ;XXX
+                        (else 1))))))
 
   (define (regexp->score regexp)
     (let ((score-regexp (lambda (str) (score regexp str))))

base-commit: 98bc13b9ea5f22a60de6c289d59072638001e08e
prerequisite-patch-id: 912de410e3d8a0796e83bfa50047debb0030b624
prerequisite-patch-id: 9c72d45734a13bd80021b14b562ed1b6238aa7ca
prerequisite-patch-id: 952cbe8dad322348d00f15125b512d34aaad8009
prerequisite-patch-id: fa6543fd5e6ec54a5036335aa5fa2b3a52675610
prerequisite-patch-id: cd68729ed441ec8235fde738e1f19669b570b099
prerequisite-patch-id: 53c5439602662bd61a3729aedf9327dfee5e9956
prerequisite-patch-id: a7edcd751c7a127f76b9c8e33ee425b6e800cfd7
prerequisite-patch-id: 29c1b2b9fcc017cff904ff3c1a32f65a6d54bad8
prerequisite-patch-id: 71757f95077bb7812f9d5a4e942c15b152ec7ac9
-- 
2.45.2


From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 13 10:25:36 2024
Received: (at 73220) by debbugs.gnu.org; 13 Sep 2024 14:25:36 +0000
Received: from localhost ([127.0.0.1]:43925 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1sp7Et-0001z8-RR
	for submit@debbugs.gnu.org; Fri, 13 Sep 2024 10:25:36 -0400
Received: from mail-wr1-f43.google.com ([209.85.221.43]:49450)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <zimon.toutoune@gmail.com>) id 1sp7Ep-0001yI-BZ
 for 73220@debbugs.gnu.org; Fri, 13 Sep 2024 10:25:32 -0400
Received: by mail-wr1-f43.google.com with SMTP id
 ffacd0b85a97d-375e5c12042so1276585f8f.3
 for <73220@debbugs.gnu.org>; Fri, 13 Sep 2024 07:25:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726237456; x=1726842256; darn=debbugs.gnu.org;
 h=content-transfer-encoding:mime-version:message-id:date:references
 :in-reply-to:subject:cc:to:from:from:to:cc:subject:date:message-id
 :reply-to; bh=418tI8extvxgjJA/mIb3Bf8zCJFgf+oj4oXhkehEexU=;
 b=IXGLHW+RtwLOp5WdNcJ4i98rj5zy8O3AT5g1e/JSYIw+tN/dyyH4uTB3NUw0jo1cm4
 8NUWfsflVCYxXNzoHgCCY3d/HD6F5nFgAE1HEqx5pT37UjfLWPDQlV//HH6PpwzyGbjr
 tySw/Uj3JljhDA6yelcK2N6CnwV1AkBP+SyUanc0kn5gWjPNKyBCCTHpAxpNG3wPUOJi
 ZKSs79M8AGbab5Dp+EKH/nDFglMQGBE0gVxL0xoK+R5GeNeamEvhsxkK7dJ5cm/pnQdj
 R4qcPr8S/thj4sRcZnnkYgvSq9TEPp9kPgX6Phn/3wGvjHapw9epSOf1r2Xn4LEtHwdc
 xiIw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726237456; x=1726842256;
 h=content-transfer-encoding:mime-version:message-id:date:references
 :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=418tI8extvxgjJA/mIb3Bf8zCJFgf+oj4oXhkehEexU=;
 b=bdUgYCYoHuo5cCrqhB/aTz1smNEIcaw3Mq9NOXZsSTXI18gME8A0qZZJxfk07IQ/b6
 bZmCyyHQecZrTEXet1PCWnZRV4rM//NWu9VaU8FdnqbaBlV4AtgPPB8fMHJ3Tksz1fjK
 0sZwfnOk59eABryzOBhjludRhuvuBZVyMjgt1tQPSteyIRQnonU8eFGV9Mm+DOM3lD/7
 j8Ey7ttRX84YHh8KyzdocDM7O3sHU/EKkatKu54RQpTLDDGSkrb3PXinpVBSibmkRv7T
 +Ej91y+hGIX4NZME5U7waeHW5+0mJ+qld5Nn8Mmy/mpUFNT7llCHoBZw3tp43eN7YPzR
 zyaQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCXOYFwwb2GsALkJArPXzyyTQiuAtHbH3Ho8Rkhb2qG0qbFVI665dbUykoQZGVTnp8UkLRfRiw==@debbugs.gnu.org
X-Gm-Message-State: AOJu0Yxzh1dl+05yi4wFYyTrVxXjGhj+AO0O7957xcGu0AZSxs4t8Uwp
 XrCOs69o0Ley++FpyvL8sE/X4P/UKmQpil8y3CLnu5pRTUCG79YX
X-Google-Smtp-Source: AGHT+IEqqtE6nKiQa1QDMx7Dps6y1cpsqeTJtRCNbHnDzCcK2Td9ggsmxDhl88CGLcPbyBs2U3QnHg==
X-Received: by 2002:adf:ef87:0:b0:374:be0f:45c4 with SMTP id
 ffacd0b85a97d-378c2d12457mr4055583f8f.28.1726237456237; 
 Fri, 13 Sep 2024 07:24:16 -0700 (PDT)
Received: from lili (roam-nat-fw-prg-194-254-61-46.net.univ-paris-diderot.fr.
 [194.254.61.46]) by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-378956654f4sm17147455f8f.43.2024.09.13.07.24.15
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 13 Sep 2024 07:24:15 -0700 (PDT)
From: Simon Tournier <zimon.toutoune@gmail.com>
To: aurtzy <aurtzy@gmail.com>, 73220@debbugs.gnu.org
Subject: Re: [bug#73220] [PATCH] ui: Add more nuance to relevance scoring.
In-Reply-To: <c882a1a5d8085e513c5c3d8bc997e3dd8f4460bb.1726210587.git.aurtzy@gmail.com>
References: <c882a1a5d8085e513c5c3d8bc997e3dd8f4460bb.1726210587.git.aurtzy@gmail.com>
Date: Fri, 13 Sep 2024 16:12:19 +0200
Message-ID: <87a5gbve0s.fsf@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 73220
Cc: Josselin Poiret <dev@jpoiret.xyz>, aurtzy <aurtzy@gmail.com>,
 Mathieu Othacehe <othacehe@gnu.org>,
 Ludovic =?utf-8?Q?Court=C3=A8s?= <ludo@gnu.org>,
 Tobias Geerinckx-Rice <me@tobias.gr>, Christopher Baines <guix@cbaines.net>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

Hi,

On Fri, 13 Sep 2024 at 03:02, aurtzy <aurtzy@gmail.com> wrote:

> Fixes <https://issues.guix.gnu.org/70689>.

Thanks!


> | Keyword(s) with poor  | Expectations                                  |
> | results before        |                                               |
> |-----------------------+-----------------------------------------------|
> | dig                   | ~bind~ near top.                              |

Hum, indeed and I do not know if we can improve here.  Well, it=E2=80=99s h=
ard
to improve for short terms, BTW.

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8
name: go-go-uber-org-dig
relevance: 104

name: rust-num-bigint-dig
relevance: 78

name: rust-num-bigint-dig
relevance: 78
--8<---------------cut here---------------end--------------->8---

Compared to current:

--8<---------------cut here---------------start------------->8---
$ guix search dig | recsel -p name,relevance | head -8
name: sysdig
relevance: 24

name: texlive-pedigree-perl
relevance: 13

name: ruby-net-http-digest-auth
relevance: 13
--8<---------------cut here---------------end--------------->8---

Indeed, 17th position is better than 609th.  But if you add a term as
=E2=80=99dns=E2=80=99, bang! :-)  Well, BTW the description of =E2=80=99bin=
d=E2=80=99 could be a bit
improved because the word network does not appear.  Anyway. :-)


Hum, why this:

    guix search ' dig$' dig | recsel -p name,relevance | head -8

does not return the package =E2=80=99bind=E2=80=99?


> | rsh                   | ~inetutils~ near top.                         |

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8
name: inetutils
relevance: 26

name: emacs-tramp
relevance: 26

name: rust-borsh-schema-derive-internal
relevance: 22
--8<---------------cut here---------------end--------------->8---

Compared to current:

--8<---------------cut here---------------start------------->8---
$ guix search rsh | recsel -p name,relevance | head -8
name: go-sigs-k8s-io-yaml
relevance: 14

name: python-pymarshal
relevance: 13

name: emacs-powershell
relevance: 13
--8<---------------cut here---------------end--------------->8---


> | c                     | C language related.                           |
> | c compiler            | Compiler-related C stuff.                     |

This cannot be improved.


> | r                     | R language related.                           |

Usually, I add the prefix ^r\- and I do not have issue with search for r
packages.  For instance, search ^r\- keyword and it works well.

    $ guix search ^r\- cyto | recsel -CP name | cut -f1 -d'-' | uniq -c
         29 r

Somehow, I do not think we can improve here.  I mean, the improvement is
to document the usage of prefixes.  Similarly for ghc (haskell), ocaml,
python, etc.


> | tor                   | Tor related; ~torbrowser~ somewhere near top. |

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 208

name: tor-client
relevance: 169

name: torsocks
relevance: 103
--8<---------------cut here---------------end--------------->8---

Compared to current:

--8<---------------cut here---------------start------------->8---
$ guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 47

name: ghc-storablevector
relevance: 29

name: tor-client
relevance: 28
--8<---------------cut here---------------end--------------->8---

However, the position move from 225th to 19th.

    $ guix search tor | recsel -P name | grep -n torbrowser
    225:torbrowser

    $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser
    19:torbrowser

Similarly as =E2=80=99dig=E2=80=99, the description of =E2=80=99torbrowser=
=E2=80=99 package could be
improvement.  Because =E2=80=99guix search tor browser=E2=80=99 returns not=
hing.


> | gcc                   | ~gcc-toolchain~ near top.                     |

Indeed, something is unexpected.  Well, first:

    $ guix search gcc | recsel -CP name | uniq | head -8
    gccgo
    gfortran-toolchain
    gdc-toolchain
    gcc-toolchain
    gcc-cross-x86_64-w64-mingw32-toolchain
    gcc-cross-or1k-elf-toolchain
    gcc-cross-i686-w64-mingw32-toolchain
    gcc-cross-avr-toolchain

    $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8
         18 llvm
         12 gcc-toolchain
          6 libgccjit
          6 gccgo
          3 isl
          2 libstdc++-doc
          2 java-commons-cli
          2 gdc-toolchain

Other said, the packages with multi-versions decrease the experience.
Well, that had already by =E2=80=9Cimproved=E2=80=9D [1] with some SEO. ;-)=
  Indeed,
maybe the relevance should be improved.

Second, gccgo has a relevance score of 22 with the only term =E2=80=99gcc=
=E2=80=99,
compared to gcc-toolchain scoring at 15.

    gccgo        gcc-toolchain
  4 * 1 * 1      4 * 1 * 1=20=20
+ 2 * 5 * 1    + 2 * 1 * 1=20=20
+ 1 * 0        + 1 * 0=20=20=20=20=20=20
+ 3 * 1 * 1    + 3 * 1 * 1=20=20
+ 2 * 0        + 2 * 1 * 3=20=20
+ 1 * 5 * 1    + 1 * 0=20=20=20=20=20=20
=3D 22           =3D 15=20=20=20=20=20=20=20=20=20

This is unexpected.  And, IMHO that=E2=80=99s bug!  In the description of
gcc-toolchain, the term =E2=80=99gcc=E2=80=99 appears 3 times but it only s=
core with =E2=80=991=E2=80=99
instead of =E2=80=995=E2=80=99.

As the patch try to address, the main issue is:

  (define (score regexp str)
    (fold-matches regexp str 0
                  (lambda (m score)
                    (+ score
                       (if (string=3D? (match:substring m) str)
                           5             ;exact match
                           1)))))

Here the exact match does not consider a substring exact match.  For
instance, one would consider that the term =E2=80=99gcc=E2=80=99 exactly ma=
tches in
=E2=80=9Csome GCC thing=E2=80=9D.  Considering the current implementation, =
that=E2=80=99s not
the case.  For instance, a snippet as the procedure =E2=80=99scoring=E2=80=
=99:

--8<---------------cut here---------------start------------->8---
scheme@(guix-user)> ,use(ice-9 regex)
scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase))
scheme@(guix-user)> (define str "some GCC thing")
scheme@(guix-user)> (fold-matches regexp str 0
    (lambda (m res)
      (+ res
        (if (string=3D? (match:substring m) str)
          5 1))))
$2 =3D 1
--8<---------------cut here---------------end--------------->8---


See v2 for my proposal fixing this.

Please note that this v2 gives the same ranking for torbrowser.  And
also improve the situation with gcc-toolchain.

--8<---------------cut here---------------start------------->8---
$ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain
1:gcc-toolchain
2:gcc-toolchain
3:gcc-toolchain
4:gcc-toolchain
5:gcc-toolchain
6:gcc-toolchain
7:gcc-toolchain
8:gcc-toolchain
9:gcc-toolchain
10:gcc-toolchain
11:gcc-toolchain
12:gcc-toolchain

$ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser
7:torbrowser

$ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind
44:bind
--8<---------------cut here---------------end--------------->8---

However, inetutils is still at 44th with the only one term =E2=80=99rsh=E2=
=80=99.  I
would suggest to do some tweak with the description.


Bah maybe it is then a bit slower on cold caches?  Hum?!  Well, I have
not investigated, neither with your patch. :-) Well, that something that
could be investigated; especially the performance of =E2=80=99char-set=E2=
=80=99
operations.


1: https://issues.guix.gnu.org/43342


> I opted to switch to counting a maximum of one match per field, which hel=
ps
> with cases where a common subword matches /many/ times in packages with l=
onger
> descriptions, pushing more relevant packages down.  In multi-term searche=
s,
> the unique terms - which are naturally rarer - also contribute to a larger
> percentage of the score as a result of these changes.

> Having matches with only one word boundary be scored as 2 instead of 1 was
> done with the reasoning that a term is more likely to be part of a compou=
nd
> word name (and thus more relevant) if it is a prefix or suffix; for examp=
le,
> "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser.

[...]

> Closing this message on an unrelated note for future work: I stumbled on =
an
> interesting idea while looking for test cases which suggested reducing the
> score of a programming library when its language is not included in search
> terms.  It's out of scope for the current issue, but I thought I'd mentio=
n it
> anyways for potential further improvements.

Well, years ago I thought about implementing TF-IDF [2,3].  Other ideas
[4] are floating around.  Then, we spent some time for making =E2=80=9Cguix
search=E2=80=9D faster [5] and today my TODO is about having an extension
relying on Guile-Xapian.

Therefore, I would prefer keep the =E2=80=99relevance=E2=80=99 more or less=
 predictable
by only counting the number of occurrences and apply some weights.
Else, for what my opinion is worth, the direction would not be to
re-invent an algorithm but maybe implement some already well-known ones.
TF-IDF [3] is one or Okapi-BM25 is another one, etc.  In all in all,
that what Xapian provides. ;-) And it does it very well!  That=E2=80=99s wh=
y I
would be tempted to have a Guix extension relying on Guile-Xapin for
indexing and searching (fast!).

Cheers,
simon


2: Re: Organizing packages
zimoun <zimon.toutoune@gmail.com>
Tue, 16 Jul 2019 19:04:26 +0200
id:CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg@mail.gmail.com
https://lists.gnu.org/archive/html/guix-devel/2019-07
https://yhetil.org/guix/CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg=
@mail.gmail.com

3: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

4: Inverted index to accelerate guix package search
Arun Isaac <arunisaac@systemreboot.net>
Sun, 12 Jan 2020 20:33:51 +0530
id:cu7h810emy0.fsf@systemreboot.net
https://lists.gnu.org/archive/html/guix-devel/2020-01
https://yhetil.org/guix/cu7h810emy0.fsf@systemreboot.net

5: [bug#39258] Faster guix search using an sqlite cache
Arun Isaac <arunisaac@systemreboot.net>
Fri, 24 Jan 2020 01:21:57 +0530
id:cu7pnfaar36.fsf@systemreboot.net
https://issues.guix.gnu.org/39258
https://issues.guix.gnu.org/msgid/cu7pnfaar36.fsf@systemreboot.net
https://yhetil.org/guix/cu7pnfaar36.fsf@systemreboot.net

6: https://en.wikipedia.org/wiki/Okapi_BM25


From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 13 20:19:16 2024
Received: (at 73220) by debbugs.gnu.org; 14 Sep 2024 00:19:16 +0000
Received: from localhost ([127.0.0.1]:44448 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces@debbugs.gnu.org>)
	id 1spGVP-0003Sm-Ap
	for submit@debbugs.gnu.org; Fri, 13 Sep 2024 20:19:16 -0400
Received: from mail-io1-f50.google.com ([209.85.166.50]:57454)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <aurtzy@gmail.com>) id 1spGVL-0003ST-SB
 for 73220@debbugs.gnu.org; Fri, 13 Sep 2024 20:19:13 -0400
Received: by mail-io1-f50.google.com with SMTP id
 ca18e2360f4ac-82aa7c3b3dbso119773839f.2
 for <73220@debbugs.gnu.org>; Fri, 13 Sep 2024 17:19:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1726273076; x=1726877876; darn=debbugs.gnu.org;
 h=in-reply-to:content-language:references:cc:to:subject:from
 :user-agent:mime-version:date:message-id:from:to:cc:subject:date
 :message-id:reply-to;
 bh=6GPgmKyzzi1lDpSL8D4us6W7hS0SaSCVJCOIo0qydQE=;
 b=Ek1WkDVsHj2W7PIXPw7QT5e0es6Q0EWa1WTLCBKXCt/P+ED93IRmxfHNCCFq5GmoKD
 8GT5aA0wE4iQ5w/ahG6n6v5H0xevCXfDZDvtPHVG9gqwA0qfkROpq/uu0arz260HOICN
 MdBV+yxwFzo5+6h0MqO2aAeFjJZJteYTSE3RDsko/zrtq7FRMSztzu6UGUzzI+2z+61Z
 pScB/uFpBWZIfseaRIa9fc19Ag500tKC2f664rU8VXxLUhGyEeWBSV9K5R1blS+/eTfg
 XffIaZqwyxM4N3srBTIMCwEjBbZTDX8GEVXDmcB+Jl2WHFggq+suSxRo04riJ9cjfxEQ
 7/Cg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1726273076; x=1726877876;
 h=in-reply-to:content-language:references:cc:to:subject:from
 :user-agent:mime-version:date:message-id:x-gm-message-state:from:to
 :cc:subject:date:message-id:reply-to;
 bh=6GPgmKyzzi1lDpSL8D4us6W7hS0SaSCVJCOIo0qydQE=;
 b=tlvnBTZ+Q2GztIJ9LoXYVjfhtFSGpX8h2/QaC6ok5xARQzF9sHR9z2zzNZo9c+j8fQ
 KW4E612O7aeCBlkslkOFCWblAVsLFpyFba01A9Xtr3RydH/C66WQLYiS4nwYbWcFX7t7
 ikyr4LPaxDZ7OmbuaMXsgLLU347ndcIZ5Vg6TVYd/3HHn9YF/bAXuWyo774qlakO8T81
 6Dtoz24fj+0OGA/NxAr1yxNew9ykE+LGvUZI335o1QBOSqOdpapbbcy5E0LK99u1Qaeg
 u+s8znlBrrPzlMhLoJDRpllXFjX0pVBUoToSHw46p2deF1jv09z9s8oryxJ/GcovJ3v7
 8g8A==
X-Gm-Message-State: AOJu0Yw1Ox10BDlgn4El4NkWuDU7IqFT4jwpIziMKj0BR8jvXkA+8oAF
 hnYd3vUDtnh9eXMnL50zTuZXSV3wsYUZ3B/NSo7E9+pBUN3SN5UkirVpCw==
X-Google-Smtp-Source: AGHT+IGkjlmBm56aCYVJaLvoZWyOIfonJSlncbxV3hl3XBCiLhvGcPgSfEBNaFbS7Y0ov8ZOm2Xndw==
X-Received: by 2002:a05:6e02:1c2d:b0:39a:e8cf:80d0 with SMTP id
 e9e14a558f8ab-3a0848fc70cmr81091915ab.14.1726273075476; 
 Fri, 13 Sep 2024 17:17:55 -0700 (PDT)
Received: from ?IPV6:2600:4808:a053:7600::e413? ([2600:4808:a053:7600::e413])
 by smtp.gmail.com with ESMTPSA id
 e9e14a558f8ab-3a092e72306sm993095ab.67.2024.09.13.17.17.53
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Fri, 13 Sep 2024 17:17:54 -0700 (PDT)
Content-Type: multipart/alternative;
 boundary="------------21s0DpniSld8nE19l6UNkpfL"
Message-ID: <4eea8048-fb10-40b5-a16b-09c96932ccb0@gmail.com>
Date: Fri, 13 Sep 2024 20:17:52 -0400
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
From: aurtzy <aurtzy@gmail.com>
Subject: Re: [PATCH] ui: Add more nuance to relevance scoring.
To: 73220@debbugs.gnu.org
References: <87a5gbve0s.fsf@gmail.com>
Content-Language: en-US
In-Reply-To: <87a5gbve0s.fsf@gmail.com>
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 73220
Cc: aurtzy <aurtzy@gmail.com>, Simon Tournier <zimon.toutoune@gmail.com>
X-BeenThere: debbugs-submit@debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit@debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request@debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request@debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces@debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces@debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

This is a multi-part message in MIME format.
--------------21s0DpniSld8nE19l6UNkpfL
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

Hi Simon,

On 9/13/24 10:12, Simon Tournier wrote:

>> | tor                   | Tor related; ~torbrowser~ somewhere near top. |
> --8<---------------cut here---------------start------------->8---
> $ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8
> name: tor
> relevance: 208
>
> name: tor-client
> relevance: 169
>
> name: torsocks
> relevance: 103
> --8<---------------cut here---------------end--------------->8---
>
> Compared to current:
>
> --8<---------------cut here---------------start------------->8---
> $ guix search tor | recsel -p name,relevance | head -8
> name: tor
> relevance: 47
>
> name: ghc-storablevector
> relevance: 29
>
> name: tor-client
> relevance: 28
> --8<---------------cut here---------------end--------------->8---
>
> However, the position move from 225th to 19th.
>
>      $ guix search tor | recsel -P name | grep -n torbrowser
>      225:torbrowser
>
>      $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser
>      19:torbrowser
>
> Similarly as ’dig’, the description of ’torbrowser’ package could be
> improvement.  Because ’guix search tor browser’ returns nothing.

Does ~torbrowser~ not appear as the third result in all three cases for 
you when running =guix search tor browser=?

Otherwise, if you meant =guix search tor= to find ~torbrowser~: perhaps 
it should be higher ranked, but it could be argued that patch v1's 
behavior is still more optimal in this aspect considering all results 
above ~torbrowser~ it are indeed related to Tor.

>> | Keyword(s) with poor  | Expectations                                  |
>> | results before        |                                               |
>> |-----------------------+-----------------------------------------------|
>> | dig                   | ~bind~ near top.                              |
> Hum, indeed and I do not know if we can improve here.  Well, it’s hard
> to improve for short terms, BTW.
>
> --8<---------------cut here---------------start------------->8---
> $ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8
> name: go-go-uber-org-dig
> relevance: 104
>
> name: rust-num-bigint-dig
> relevance: 78
>
> name: rust-num-bigint-dig
> relevance: 78
> --8<---------------cut here---------------end--------------->8---
>
> Compared to current:
>
> --8<---------------cut here---------------start------------->8---
> $ guix search dig | recsel -p name,relevance | head -8
> name: sysdig
> relevance: 24
>
> name: texlive-pedigree-perl
> relevance: 13
>
> name: ruby-net-http-digest-auth
> relevance: 13
> --8<---------------cut here---------------end--------------->8---
>
> Indeed, 17th position is better than 609th.  But if you add a term as
> ’dns’, bang! :-)  Well, BTW the description of ’bind’ could be a bit
> improved because the word network does not appear.  Anyway. :-)

[...]

>> | rsh                   | ~inetutils~ near top.                         |
> --8<---------------cut here---------------start------------->8---
> $ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8
> name: inetutils
> relevance: 26
>
> name: emacs-tramp
> relevance: 26
>
> name: rust-borsh-schema-derive-internal
> relevance: 22
> --8<---------------cut here---------------end--------------->8---
>
> Compared to current:
>
> --8<---------------cut here---------------start------------->8---
> $ guix search rsh | recsel -p name,relevance | head -8
> name: go-sigs-k8s-io-yaml
> relevance: 14
>
> name: python-pymarshal
> relevance: 13
>
> name: emacs-powershell
> relevance: 13
> --8<---------------cut here---------------end--------------->8---

[...]

>> | gcc                   | ~gcc-toolchain~ near top.                     |
> Indeed, something is unexpected.  Well, first:
>
>      $ guix search gcc | recsel -CP name | uniq | head -8
>      gccgo
>      gfortran-toolchain
>      gdc-toolchain
>      gcc-toolchain
>      gcc-cross-x86_64-w64-mingw32-toolchain
>      gcc-cross-or1k-elf-toolchain
>      gcc-cross-i686-w64-mingw32-toolchain
>      gcc-cross-avr-toolchain
>
>      $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8
>           18 llvm
>           12 gcc-toolchain
>            6 libgccjit
>            6 gccgo
>            3 isl
>            2 libstdc++-doc
>            2 java-commons-cli
>            2 gdc-toolchain
>
> Other said, the packages with multi-versions decrease the experience.
> Well, that had already by “improved” [1] with some SEO. ;-)  Indeed,
> maybe the relevance should be improved.
>
> Second, gccgo has a relevance score of 22 with the only term ’gcc’,
> compared to gcc-toolchain scoring at 15.
>
>      gccgo        gcc-toolchain
>    4 * 1 * 1      4 * 1 * 1
> + 2 * 5 * 1    + 2 * 1 * 1
> + 1 * 0        + 1 * 0
> + 3 * 1 * 1    + 3 * 1 * 1
> + 2 * 0        + 2 * 1 * 3
> + 1 * 5 * 1    + 1 * 0
> = 22           = 15
>
> This is unexpected.  And, IMHO that’s bug!  In the description of
> gcc-toolchain, the term ’gcc’ appears 3 times but it only score with ’1’
> instead of ’5’.
>
> As the patch try to address, the main issue is:
>
>    (define (score regexp str)
>      (fold-matches regexp str 0
>                    (lambda (m score)
>                      (+ score
>                         (if (string=? (match:substring m) str)
>                             5             ;exact match
>                             1)))))
>
> Here the exact match does not consider a substring exact match.  For
> instance, one would consider that the term ’gcc’ exactly matches in
> “some GCC thing”.  Considering the current implementation, that’s not
> the case.  For instance, a snippet as the procedure ’scoring’:
>
> --8<---------------cut here---------------start------------->8---
> scheme@(guix-user)> ,use(ice-9 regex)
> scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase))
> scheme@(guix-user)> (define str "some GCC thing")
> scheme@(guix-user)> (fold-matches regexp str 0
>      (lambda (m res)
>        (+ res
>          (if (string=? (match:substring m) str)
>            5 1))))
> $2 = 1
> --8<---------------cut here---------------end--------------->8---
>
>
> See v2 for my proposal fixing this.
>
> Please note that this v2 gives the same ranking for torbrowser.  And
> also improve the situation with gcc-toolchain.
>
> --8<---------------cut here---------------start------------->8---
> $ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain
> 1:gcc-toolchain
> 2:gcc-toolchain
> 3:gcc-toolchain
> 4:gcc-toolchain
> 5:gcc-toolchain
> 6:gcc-toolchain
> 7:gcc-toolchain
> 8:gcc-toolchain
> 9:gcc-toolchain
> 10:gcc-toolchain
> 11:gcc-toolchain
> 12:gcc-toolchain
>
> $ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser
> 7:torbrowser
>
> $ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind
> 44:bind
> --8<---------------cut here---------------end--------------->8---
>
> However, inetutils is still at 44th with the only one term ’rsh’.  I
> would suggest to do some tweak with the description.

And including a relevant part of your message from #70689:

> Again, considering the case at hand: If instead of 3 randomly picked in
> v2 of #73220, we would pick 7, then inetutils is ranked first.
>
> Yeah, maybe 3 isn’t enough… And maybe 7 is a good choice.
What do you think about setting the value to the sum of all weights in 
~metrics~ as I did in patch v1? My logic is that an object is almost 
always going to be relevant if it contains a whole word match compared 
to "maybe relevant" if it only matches substrings, so it would be 
reasonable to thus show most of the objects with whole word matches 
first. This improves or maintains consistency of relevant results in the 
test cases with shorter terms, and also reduces the need for guesswork 
with choosing arbitrary numbers that may or may not work.

Note that I also gave the same treatment to exact match scores, although 
not as extremely weighed (only double the whole word score in v1).

In the case of ~inetutils~, for example, this formula guarantees that if 
I were to search =rsh= - which is a common subword, but itself has a 
very unique meaning - ~inetutils~ /always/ shows up at or near the top 
along with other rsh-related packages, assuming no exact matches.

In other words, the intention would be to have the calculations set up 
such that they implicitly "categorize" object rankings into a (rough) 
hierarchy of the following:

|--------------------------------------------| | Objects with at least 
one exact match | |--------------------------------------------| | 
Objects with at least one whole word match | 
|--------------------------------------------| | Objects with only 
substring matches | |--------------------------------------------|

>> I opted to switch to counting a maximum of one match per field, which helps
>> with cases where a common subword matches /many/ times in packages with longer
>> descriptions, pushing more relevant packages down.  In multi-term searches,
>> the unique terms - which are naturally rarer - also contribute to a larger
>> percentage of the score as a result of these changes.
>> Having matches with only one word boundary be scored as 2 instead of 1 was
>> done with the reasoning that a term is more likely to be part of a compound
>> word name (and thus more relevant) if it is a prefix or suffix; for example,
>> "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser.
> [...]
>
>> Closing this message on an unrelated note for future work: I stumbled on an
>> interesting idea while looking for test cases which suggested reducing the
>> score of a programming library when its language is not included in search
>> terms.  It's out of scope for the current issue, but I thought I'd mention it
>> anyways for potential further improvements.
> Well, years ago I thought about implementing TF-IDF [2,3].  Other ideas
> [4] are floating around.  Then, we spent some time for making “guix
> search” faster [5] and today my TODO is about having an extension
> relying on Guile-Xapian.
>
> Therefore, I would prefer keep the ’relevance’ more or less predictable
> by only counting the number of occurrences and apply some weights.
> Else, for what my opinion is worth, the direction would not be to
> re-invent an algorithm but maybe implement some already well-known ones.
> TF-IDF [3] is one or Okapi-BM25 is another one, etc.  In all in all,
> that what Xapian provides. ;-) And it does it very well!  That’s why I
> would be tempted to have a Guix extension relying on Guile-Xapin for
> indexing and searching (fast!).

Yes, I had thought about trying something like TF-IDF while looking into 
the issue, but it seemed much less trivial than changes to a scoring 
function. The count-once-per-field change was supposed to at least 
tangentially mimic this behavior and reduce bias towards objects that 
happen to have very long descriptions but aren't very relevant. It's 
also needed for my "categorization" math to hold.

> Hum, why this:
>
>      guix search ' dig$' dig | recsel -p name,relevance | head -8
>
> does not return the package ’bind’?

It appears the ~regexp/newline~ flag needs to be set for ~make-regexp~. 
A quick test adding it here [1] seemed to work.


My main concern with v2 is that I don't think whole words are weighed 
heavily enough, but it provides a simpler solution that still offers 
improvement, so I'm happy either way.

Thanks for the feedback!

[1] 
https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/package.scm#n897

Cheers,

aurtzy

--------------21s0DpniSld8nE19l6UNkpfL
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hi Simon,<br>
    </p>
    <p>On 9/13/24 10:12, Simon Tournier wrote:<span
      style="white-space: pre-wrap">
</span></p>
    <blockquote type="cite">
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">| tor                   | Tor related; ~torbrowser~ somewhere near top. |
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">--8&lt;---------------cut here---------------start-------------&gt;8---
$ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 208

name: tor-client
relevance: 169

name: torsocks
relevance: 103
--8&lt;---------------cut here---------------end---------------&gt;8---

Compared to current:

--8&lt;---------------cut here---------------start-------------&gt;8---
$ guix search tor | recsel -p name,relevance | head -8
name: tor
relevance: 47

name: ghc-storablevector
relevance: 29

name: tor-client
relevance: 28
--8&lt;---------------cut here---------------end---------------&gt;8---

However, the position move from 225th to 19th.

    $ guix search tor | recsel -P name | grep -n torbrowser
    225:torbrowser

    $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser
    19:torbrowser

Similarly as ’dig’, the description of ’torbrowser’ package could be
improvement.  Because ’guix search tor browser’ returns nothing.
</pre>
    </blockquote>
    <p>Does ~torbrowser~ not appear as the third result in all three
      cases for you when running =guix search tor browser=?</p>
    <p>Otherwise, if you meant =guix search tor= to find ~torbrowser<span
      style="white-space: pre-wrap">~: perhaps it should be higher ranked, but it could be argued that patch v1's behavior is still more optimal in this aspect considering all results above ~torbrowser~ it are indeed related to Tor.</span></p>
    <blockquote type="cite">
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">| Keyword(s) with poor  | Expectations                                  |
| results before        |                                               |
|-----------------------+-----------------------------------------------|
| dig                   | ~bind~ near top.                              |
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">Hum, indeed and I do not know if we can improve here.  Well, it’s hard
to improve for short terms, BTW.

--8&lt;---------------cut here---------------start-------------&gt;8---
$ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8
name: go-go-uber-org-dig
relevance: 104

name: rust-num-bigint-dig
relevance: 78

name: rust-num-bigint-dig
relevance: 78
--8&lt;---------------cut here---------------end---------------&gt;8---

Compared to current:

--8&lt;---------------cut here---------------start-------------&gt;8---
$ guix search dig | recsel -p name,relevance | head -8
name: sysdig
relevance: 24

name: texlive-pedigree-perl
relevance: 13

name: ruby-net-http-digest-auth
relevance: 13
--8&lt;---------------cut here---------------end---------------&gt;8---

Indeed, 17th position is better than 609th.  But if you add a term as
’dns’, bang! :-)  Well, BTW the description of ’bind’ could be a bit
improved because the word network does not appear.  Anyway. :-)
</pre>
    </blockquote>
    <p><span style="white-space: pre-wrap">[...]
</span></p>
    <blockquote type="cite">
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">| rsh                   | ~inetutils~ near top.                         |
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">--8&lt;---------------cut here---------------start-------------&gt;8---
$ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8
name: inetutils
relevance: 26

name: emacs-tramp
relevance: 26

name: rust-borsh-schema-derive-internal
relevance: 22
--8&lt;---------------cut here---------------end---------------&gt;8---

Compared to current:

--8&lt;---------------cut here---------------start-------------&gt;8---
$ guix search rsh | recsel -p name,relevance | head -8
name: go-sigs-k8s-io-yaml
relevance: 14

name: python-pymarshal
relevance: 13

name: emacs-powershell
relevance: 13
--8&lt;---------------cut here---------------end---------------&gt;8---
</pre>
    </blockquote>
    <p><span style="white-space: pre-wrap">[...]
</span></p>
    <blockquote type="cite">
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">| gcc                   | ~gcc-toolchain~ near top.                     |
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">Indeed, something is unexpected.  Well, first:

    $ guix search gcc | recsel -CP name | uniq | head -8
    gccgo
    gfortran-toolchain
    gdc-toolchain
    gcc-toolchain
    gcc-cross-x86_64-w64-mingw32-toolchain
    gcc-cross-or1k-elf-toolchain
    gcc-cross-i686-w64-mingw32-toolchain
    gcc-cross-avr-toolchain

    $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8
         18 llvm
         12 gcc-toolchain
          6 libgccjit
          6 gccgo
          3 isl
          2 libstdc++-doc
          2 java-commons-cli
          2 gdc-toolchain

Other said, the packages with multi-versions decrease the experience.
Well, that had already by “improved” [1] with some SEO. ;-)  Indeed,
maybe the relevance should be improved.

Second, gccgo has a relevance score of 22 with the only term ’gcc’,
compared to gcc-toolchain scoring at 15.

    gccgo        gcc-toolchain
  4 * 1 * 1      4 * 1 * 1  
+ 2 * 5 * 1    + 2 * 1 * 1  
+ 1 * 0        + 1 * 0      
+ 3 * 1 * 1    + 3 * 1 * 1  
+ 2 * 0        + 2 * 1 * 3  
+ 1 * 5 * 1    + 1 * 0      
= 22           = 15         

This is unexpected.  And, IMHO that’s bug!  In the description of
gcc-toolchain, the term ’gcc’ appears 3 times but it only score with ’1’
instead of ’5’.

As the patch try to address, the main issue is:

  (define (score regexp str)
    (fold-matches regexp str 0
                  (lambda (m score)
                    (+ score
                       (if (string=? (<a class="moz-txt-link-freetext"
      href="match:substring">match:substring</a> m) str)
                           5             ;exact match
                           1)))))

Here the exact match does not consider a substring exact match.  For
instance, one would consider that the term ’gcc’ exactly matches in
“some GCC thing”.  Considering the current implementation, that’s not
the case.  For instance, a snippet as the procedure ’scoring’:

--8&lt;---------------cut here---------------start-------------&gt;8---
scheme@(guix-user)&gt; ,use(ice-9 regex)
scheme@(guix-user)&gt; (define regexp (make-regexp "gcc" regexp/icase))
scheme@(guix-user)&gt; (define str "some GCC thing")
scheme@(guix-user)&gt; (fold-matches regexp str 0
    (lambda (m res)
      (+ res
        (if (string=? (<a class="moz-txt-link-freetext"
      href="match:substring">match:substring</a> m) str)
          5 1))))
$2 = 1
--8&lt;---------------cut here---------------end---------------&gt;8---


See v2 for my proposal fixing this.

Please note that this v2 gives the same ranking for torbrowser.  And
also improve the situation with gcc-toolchain.

--8&lt;---------------cut here---------------start-------------&gt;8---
$ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain
1:gcc-toolchain
2:gcc-toolchain
3:gcc-toolchain
4:gcc-toolchain
5:gcc-toolchain
6:gcc-toolchain
7:gcc-toolchain
8:gcc-toolchain
9:gcc-toolchain
10:gcc-toolchain
11:gcc-toolchain
12:gcc-toolchain

$ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser
7:torbrowser

$ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind
44:bind
--8&lt;---------------cut here---------------end---------------&gt;8---

However, inetutils is still at 44th with the only one term ’rsh’.  I
would suggest to do some tweak with the description.
</pre>
    </blockquote>
    <p>And including a relevant part of your message from #70689:</p>
    <p>
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">Again, considering the case at hand: If instead of 3 randomly picked in
v2 of #73220, we would pick 7, then inetutils is ranked first.

Yeah, maybe 3 isn’t enough… And maybe 7 is a good choice.</pre>
      </blockquote>
      What do you think about setting the value to the sum of all
      weights in ~metrics~ as I did in patch v1? My logic is that an
      object is almost always going to be relevant if it contains a
      whole word match compared to "maybe relevant" if it only matches
      substrings, so it would be reasonable to thus show most of the
      objects with whole word matches first. This improves or maintains
      consistency of relevant results in the test cases with shorter
      terms, and also reduces the need for guesswork with choosing
      arbitrary numbers that may or may not work.</p>
    <p>Note that I also gave the same treatment to exact match scores,
      although not as extremely weighed (only double the whole word
      score in v1).<br>
    </p>
    <p>In the case of ~inetutils~, for example, this formula guarantees
      that if I were to search =rsh= - which is a common subword, but
      itself has a very unique meaning - ~inetutils~ /always/ shows up
      at or near the top along with other rsh-related packages, assuming
      no exact matches.<span style="white-space: pre-wrap">
</span></p>
    <p><span style="white-space: pre-wrap">In other words, the intention would be to have the calculations set up such that they implicitly "categorize" object rankings into a (rough) hierarchy of the following:</span></p>
    <p><span style="white-space: pre-wrap">
<font face="monospace">|--------------------------------------------|
| Objects with at least one exact match      |
|--------------------------------------------|
| Objects with at least one whole word match |
|--------------------------------------------|
| Objects with only substring matches        |
|--------------------------------------------|

</font></span></p>
    <blockquote type="cite">
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">I opted to switch to counting a maximum of one match per field, which helps
with cases where a common subword matches /many/ times in packages with longer
descriptions, pushing more relevant packages down.  In multi-term searches,
the unique terms - which are naturally rarer - also contribute to a larger
percentage of the score as a result of these changes.
</pre>
      </blockquote>
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">Having matches with only one word boundary be scored as 2 instead of 1 was
done with the reasoning that a term is more likely to be part of a compound
word name (and thus more relevant) if it is a prefix or suffix; for example,
"gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser.
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">[...]

</pre>
      <blockquote type="cite">
        <pre wrap="" class="moz-quote-pre">Closing this message on an unrelated note for future work: I stumbled on an
interesting idea while looking for test cases which suggested reducing the
score of a programming library when its language is not included in search
terms.  It's out of scope for the current issue, but I thought I'd mention it
anyways for potential further improvements.
</pre>
      </blockquote>
      <pre wrap="" class="moz-quote-pre">Well, years ago I thought about implementing TF-IDF [2,3].  Other ideas
[4] are floating around.  Then, we spent some time for making “guix
search” faster [5] and today my TODO is about having an extension
relying on Guile-Xapian.

Therefore, I would prefer keep the ’relevance’ more or less predictable
by only counting the number of occurrences and apply some weights.
Else, for what my opinion is worth, the direction would not be to
re-invent an algorithm but maybe implement some already well-known ones.
TF-IDF [3] is one or Okapi-BM25 is another one, etc.  In all in all,
that what Xapian provides. ;-) And it does it very well!  That’s why I
would be tempted to have a Guix extension relying on Guile-Xapin for
indexing and searching (fast!).
</pre>
    </blockquote>
    <p>Yes, I had thought about trying something like TF-IDF while
      looking into the issue, but it seemed much less trivial than
      changes to a scoring function. The count-once-per-field change was
      supposed to at least tangentially mimic this behavior and reduce
      bias towards objects that happen to have very long descriptions
      but aren't very relevant. It's also needed for my "categorization"
      math to hold.<br>
    </p>
    <p> </p>
    <blockquote type="cite">
      <pre wrap="" class="moz-quote-pre">Hum, why this:

    guix search ' dig$' dig | recsel -p name,relevance | head -8

does not return the package ’bind’?
</pre>
    </blockquote>
    <p>It appears the ~regexp/newline~ flag needs to be set for
      ~make-regexp~. A quick test adding it here [1] seemed to work.
    </p>
    <p><br>
    </p>
    <p>My main concern with v2 is that I don't think whole words are
      weighed heavily enough, but it provides a simpler solution that
      still offers improvement, so I'm happy either way.</p>
    <p>Thanks for the feedback!<br>
    </p>
    <p>[1]
<a class="moz-txt-link-freetext" href="https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/package.scm#n897">https://git.savannah.gnu.org/cgit/guix.git/tree/guix/scripts/package.scm#n897</a></p>
    <p>Cheers,</p>
    <p>aurtzy<br>
    </p>
  </body>
</html>

--------------21s0DpniSld8nE19l6UNkpfL--