GNU bug report logs - #70689
guix search doesn't weigh word matches higher than subword matches

Package: guix;

Reported by: Richard Sent <richard <at> freakingpenguin.com>

Date: Wed, 1 May 2024 02:19:02 UTC

Severity: normal

To reply to this bug, email your comments to 70689 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-guix <at> gnu.org:
bug#70689; Package guix. (Wed, 01 May 2024 02:19:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Richard Sent <richard <at> freakingpenguin.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Wed, 01 May 2024 02:19:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Richard Sent <richard <at> freakingpenguin.com>
To: bug-guix <at> gnu.org
Subject: guix search doesn't weigh word matches higher than subword matches
Date: Tue, 30 Apr 2024 22:18:03 -0400

Hi Guix!

When running guix search, relevance in synopsis and description fields
are computed strictly by the number of matches, both as a word and as a
subword. Ideally, if a search string matches an isolated word in a
search, that result should be considered more relevant than simply
matching a subword, even multiple times.

To illustrate, imagine trying to find what package provides the `rsh`
binary and running running `$ guix search rsh`. This binary is part of
`inetutils` and the description field contains:

> Inetutils is a collection of common network programs, such as an ftp
> client and server, a telnet client and server, an rsh client and
> server, and hostname.

Most likely, this is what the user is interested in. However, inetutils
does not show up until roughly the ~75th result with a relevance of 2
(the lowest possible relevance).

Almost every search result beforehand contains the string "rsh" as a
component of another word, such as "marshaling", "powershell", and
"hershey". However, these match multiple times and are weighted
significantly higher.

Ideally, guix search should rate inetutils higher because the string
"rsh" occurs as its own word, not as a component of another, unrelated
word. (Very, very people would search "rsh" looking for matches with
"hershey", even if "hershey" occurs multiple times.)

Another example of where this can happen is with "dig", part of the bind
package. Searching for "dig" returns garbage because "dig" is a common
subword. Bind is scored with a relevance of 2, even though bind's
description emphasises that dig is part of it.

This would improve the experience when searching with strings that
commonly occur as subwords.

Since this change can't occur in a vacuum, care should be taken not to
reduce the effectiveness of other reasonably forseeable search queries.

-- 
Take it easy,
Richard Sent
Making my computer weirder one commit at a time.

Information forwarded to bug-guix <at> gnu.org:
bug#70689; Package guix. (Wed, 01 May 2024 13:46:01 GMT) Full text and rfc822 format available.

Message #8 received at 70689 <at> debbugs.gnu.org (full text, mbox):

From: bokr <at> bokr.com
To: Richard Sent <richard <at> freakingpenguin.com>
Cc: 70689 <at> debbugs.gnu.org
Subject: Re: bug#70689: guix search doesn't weigh word matches higher than
 subword matches
Date: Wed, 1 May 2024 15:45:05 +0200

On +2024-04-30 22:18:03 -0400, Richard Sent wrote:
> Hi Guix!
> 
> When running guix search, relevance in synopsis and description fields
> are computed strictly by the number of matches, both as a word and as a
> subword. Ideally, if a search string matches an isolated word in a
> search, that result should be considered more relevant than simply
> matching a subword, even multiple times.
> 
> To illustrate, imagine trying to find what package provides the `rsh`
> binary and running running `$ guix search rsh`. This binary is part of
> `inetutils` and the description field contains:
> 
> > Inetutils is a collection of common network programs, such as an ftp
> > client and server, a telnet client and server, an rsh client and
> > server, and hostname.
> 
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).
> 
> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.
> 
> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)
> 
> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.
> 
> This would improve the experience when searching with strings that
> commonly occur as subwords.
> 
> Since this change can't occur in a vacuum, care should be taken not to
> reduce the effectiveness of other reasonably forseeable search queries.
> 
> -- 
> Take it easy,
> Richard Sent
> Making my computer weirder one commit at a time.
> 
> 
> 

I like your proposal :)

I'm wondering how [1] compares in what it does for your use(ful) case.
(I am not familiar with Hyper Estraier beyond being prompted for gnu.org searching)

[1] <https://directory.fsf.org/wiki/Hyper_Estraier>

--
Regards,
Bengt Richter

Information forwarded to bug-guix <at> gnu.org:
bug#70689; Package guix. (Fri, 13 Sep 2024 07:16:02 GMT) Full text and rfc822 format available.

Message #11 received at 70689 <at> debbugs.gnu.org (full text, mbox):

From: aurtzy <aurtzy <at> gmail.com>
To: 70689 <at> debbugs.gnu.org
Cc: Richard Sent <richard <at> freakingpenguin.com>, bokr <at> bokr.com
Subject: Re: guix search doesn't weigh word matches higher than subword matches
Date: Fri, 13 Sep 2024 03:13:41 -0400

Hi Richard and bokr,

I've proposed changes to relevance scoring that should help with this 
issue, if you'd like to try it out here: https://issues.guix.gnu.org/73220

Cheers,

aurtzy

> On +2024-04-30 22:18:03 -0400, Richard Sent wrote:
> > Hi Guix!
> >
> > When running guix search, relevance in synopsis and description fields
> > are computed strictly by the number of matches, both as a word and as a
> > subword. Ideally, if a search string matches an isolated word in a
> > search, that result should be considered more relevant than simply
> > matching a subword, even multiple times.
> >
> > To illustrate, imagine trying to find what package provides the `rsh`
> > binary and running running `$ guix search rsh`. This binary is part of
> > `inetutils` and the description field contains:
> >
> > > Inetutils is a collection of common network programs, such as an ftp
> > > client and server, a telnet client and server, an rsh client and
> > > server, and hostname.
> >
> > Most likely, this is what the user is interested in. However, inetutils
> > does not show up until roughly the ~75th result with a relevance of 2
> > (the lowest possible relevance).
> >
> > Almost every search result beforehand contains the string "rsh" as a
> > component of another word, such as "marshaling", "powershell", and
> > "hershey". However, these match multiple times and are weighted
> > significantly higher.
> >
> > Ideally, guix search should rate inetutils higher because the string
> > "rsh" occurs as its own word, not as a component of another, unrelated
> > word. (Very, very people would search "rsh" looking for matches with
> > "hershey", even if "hershey" occurs multiple times.)
> >
> > Another example of where this can happen is with "dig", part of the 
bind
> > package. Searching for "dig" returns garbage because "dig" is a common
> > subword. Bind is scored with a relevance of 2, even though bind's
> > description emphasises that dig is part of it.
> >
> > This would improve the experience when searching with strings that
> > commonly occur as subwords.
> >
> > Since this change can't occur in a vacuum, care should be taken not to
> > reduce the effectiveness of other reasonably forseeable search queries.
> >
> > --
> > Take it easy,
> > Richard Sent
> > Making my computer weirder one commit at a time.
> >
> >
> >
>
> I like your proposal :)
>
> I'm wondering how [1] compares in what it does for your use(ful) case.
> (I am not familiar with Hyper Estraier beyond being prompted for 
gnu.org searching)
>
> [1] <https://directory.fsf.org/wiki/Hyper_Estraier>
>
> --
> Regards,
> Bengt Richter

Information forwarded to bug-guix <at> gnu.org:
bug#70689; Package guix. (Fri, 13 Sep 2024 15:10:02 GMT) Full text and rfc822 format available.

Message #14 received at 70689 <at> debbugs.gnu.org (full text, mbox):

From: Simon Tournier <zimon.toutoune <at> gmail.com>
To: Richard Sent <richard <at> freakingpenguin.com>
Cc: aurtzy <aurtzy <at> gmail.com>, 70689 <at> debbugs.gnu.org
Subject: Re: bug#70689: guix search doesn't weigh word matches higher than
 subword matches
Date: Fri, 13 Sep 2024 17:08:19 +0200

Hi,

On Tue, 30 Apr 2024 at 22:18, Richard Sent <richard <at> freakingpenguin.com> wrote:

>> Inetutils is a collection of common network programs, such as an ftp
>> client and server, a telnet client and server, an rsh client and
>> server, and hostname.
>
> Most likely, this is what the user is interested in. However, inetutils
> does not show up until roughly the ~75th result with a relevance of 2
> (the lowest possible relevance).

Using Guix 056910e, I get:

    $ guix search rsh | recsel -CP name | grep -n inetutils
    76:inetutils

Then using the proposed v2 patch#73220 [1], I get:

    $ ./pre-inst-env guix search rsh | recsel -CP name | grep -n inetutils
    34:inetutils

Well, that’s not perfect but a bit better.

> Almost every search result beforehand contains the string "rsh" as a
> component of another word, such as "marshaling", "powershell", and
> "hershey". However, these match multiple times and are weighted
> significantly higher.

Well, if we consider the current implementation, the relevance scoring
reads for the highest:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 2 * 1   synopsis
        + 2 * 4 * 1   description
        + 1 * 0       file-name
        = 14

where it means: field-weigh * match * weight-match

Compared to inetutils:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 0       synopsis
        + 2 * 1 * 1   description
        + 1 * 0       file-name
        = 2

Well, this case cannot be improved much.  First, the field-weights are
almost optimal [2]. Second the number of occurrences depends on the
description; maybe it could be improved, I have not checked yet.

And v2 of #73220 replace the value of weight-match: the term ’rsh’ in
“an rsh client” should have an higher score than in “uses `json.Marshal'
and `json.Unmarshal'”.

In other words, it reads:

          4 * 0       name
        + 2 * 0       upstream-name
        + 1 * 0       outputs
        + 3 * 0       synopsis
        + 2 * 1 * 3   description
        + 1 * 0       file-name
        = 6

I think this address your suggestion, I guess.

> Ideally, guix search should rate inetutils higher because the string
> "rsh" occurs as its own word, not as a component of another, unrelated
> word. (Very, very people would search "rsh" looking for matches with
> "hershey", even if "hershey" occurs multiple times.)

Again, considering the case at hand: If instead of 3 randomly picked in
v2 of #73220, we would pick 7, then inetutils is ranked first.

Yeah, maybe 3 isn’t enough… And maybe 7 is a good choice.

Do you have other examples than ’rsh’?

> Another example of where this can happen is with "dig", part of the bind
> package. Searching for "dig" returns garbage because "dig" is a common
> subword. Bind is scored with a relevance of 2, even though bind's
> description emphasises that dig is part of it.

Please note that using v2 of #73220 with the weight of 7, the package is
returned “third“: a relevance of 14 (behind 24 and 20).

However, it appears 8th in the list because the appearance for packages
having the same relevance scoring is arbitrary.  It just depends on how
the modules are walked.  Therefore, we cannot do much, IMHO.

Cheers,
simon

1: https://issues.guix.gnu.org/73220#1

2: Re: Search improvements (Was: Opposition to new single-letter package name "t")
zimoun <zimon.toutoune <at> gmail.com>
Tue, 09 Mar 2021 19:37:23 +0100
id:CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=zw <at> mail.gmail.com
https://lists.gnu.org/archive/html/guix-devel/2021-03
https://yhetil.org/guix/CAJ3okZ3+hn0nJP98OhnZYLWJvhLGpdTUK+jB0hoM5JArQxO=zw <at> mail.gmail.com

This bug report was last modified 329 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #70689 guix search doesn't weigh word matches higher than subword matches

GNU bug report logs - #70689
guix search doesn't weigh word matches higher than subword matches