Package: guix-patches;
Reported by: aurtzy <aurtzy <at> gmail.com>
Date: Fri, 13 Sep 2024 07:04:02 UTC
Severity: normal
Tags: patch
Message #11 received at 73220 <at> debbugs.gnu.org (full text, mbox):
From: Simon Tournier <zimon.toutoune <at> gmail.com> To: aurtzy <aurtzy <at> gmail.com>, 73220 <at> debbugs.gnu.org Cc: Josselin Poiret <dev <at> jpoiret.xyz>, aurtzy <aurtzy <at> gmail.com>, Mathieu Othacehe <othacehe <at> gnu.org>, Ludovic Courtès <ludo <at> gnu.org>, Tobias Geerinckx-Rice <me <at> tobias.gr>, Christopher Baines <guix <at> cbaines.net> Subject: Re: [bug#73220] [PATCH] ui: Add more nuance to relevance scoring. Date: Fri, 13 Sep 2024 16:12:19 +0200
Hi, On Fri, 13 Sep 2024 at 03:02, aurtzy <aurtzy <at> gmail.com> wrote: > Fixes <https://issues.guix.gnu.org/70689>. Thanks! > | Keyword(s) with poor | Expectations | > | results before | | > |-----------------------+-----------------------------------------------| > | dig | ~bind~ near top. | Hum, indeed and I do not know if we can improve here. Well, it’s hard to improve for short terms, BTW. --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search dig | recsel -p name,relevance | head -8 name: go-go-uber-org-dig relevance: 104 name: rust-num-bigint-dig relevance: 78 name: rust-num-bigint-dig relevance: 78 --8<---------------cut here---------------end--------------->8--- Compared to current: --8<---------------cut here---------------start------------->8--- $ guix search dig | recsel -p name,relevance | head -8 name: sysdig relevance: 24 name: texlive-pedigree-perl relevance: 13 name: ruby-net-http-digest-auth relevance: 13 --8<---------------cut here---------------end--------------->8--- Indeed, 17th position is better than 609th. But if you add a term as ’dns’, bang! :-) Well, BTW the description of ’bind’ could be a bit improved because the word network does not appear. Anyway. :-) Hum, why this: guix search ' dig$' dig | recsel -p name,relevance | head -8 does not return the package ’bind’? > | rsh | ~inetutils~ near top. | --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search rsh | recsel -p name,relevance | head -8 name: inetutils relevance: 26 name: emacs-tramp relevance: 26 name: rust-borsh-schema-derive-internal relevance: 22 --8<---------------cut here---------------end--------------->8--- Compared to current: --8<---------------cut here---------------start------------->8--- $ guix search rsh | recsel -p name,relevance | head -8 name: go-sigs-k8s-io-yaml relevance: 14 name: python-pymarshal relevance: 13 name: emacs-powershell relevance: 13 --8<---------------cut here---------------end--------------->8--- > | c | C language related. | > | c compiler | Compiler-related C stuff. | This cannot be improved. > | r | R language related. | Usually, I add the prefix ^r\- and I do not have issue with search for r packages. For instance, search ^r\- keyword and it works well. $ guix search ^r\- cyto | recsel -CP name | cut -f1 -d'-' | uniq -c 29 r Somehow, I do not think we can improve here. I mean, the improvement is to document the usage of prefixes. Similarly for ghc (haskell), ocaml, python, etc. > | tor | Tor related; ~torbrowser~ somewhere near top. | --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search tor | recsel -p name,relevance | head -8 name: tor relevance: 208 name: tor-client relevance: 169 name: torsocks relevance: 103 --8<---------------cut here---------------end--------------->8--- Compared to current: --8<---------------cut here---------------start------------->8--- $ guix search tor | recsel -p name,relevance | head -8 name: tor relevance: 47 name: ghc-storablevector relevance: 29 name: tor-client relevance: 28 --8<---------------cut here---------------end--------------->8--- However, the position move from 225th to 19th. $ guix search tor | recsel -P name | grep -n torbrowser 225:torbrowser $ ./pre-inst-env guix search tor | recsel -P name | grep -n torbrowser 19:torbrowser Similarly as ’dig’, the description of ’torbrowser’ package could be improvement. Because ’guix search tor browser’ returns nothing. > | gcc | ~gcc-toolchain~ near top. | Indeed, something is unexpected. Well, first: $ guix search gcc | recsel -CP name | uniq | head -8 gccgo gfortran-toolchain gdc-toolchain gcc-toolchain gcc-cross-x86_64-w64-mingw32-toolchain gcc-cross-or1k-elf-toolchain gcc-cross-i686-w64-mingw32-toolchain gcc-cross-avr-toolchain $ guix search gcc | recsel -CP name | uniq -c | sort -rn | head -8 18 llvm 12 gcc-toolchain 6 libgccjit 6 gccgo 3 isl 2 libstdc++-doc 2 java-commons-cli 2 gdc-toolchain Other said, the packages with multi-versions decrease the experience. Well, that had already by “improved” [1] with some SEO. ;-) Indeed, maybe the relevance should be improved. Second, gccgo has a relevance score of 22 with the only term ’gcc’, compared to gcc-toolchain scoring at 15. gccgo gcc-toolchain 4 * 1 * 1 4 * 1 * 1 + 2 * 5 * 1 + 2 * 1 * 1 + 1 * 0 + 1 * 0 + 3 * 1 * 1 + 3 * 1 * 1 + 2 * 0 + 2 * 1 * 3 + 1 * 5 * 1 + 1 * 0 = 22 = 15 This is unexpected. And, IMHO that’s bug! In the description of gcc-toolchain, the term ’gcc’ appears 3 times but it only score with ’1’ instead of ’5’. As the patch try to address, the main issue is: (define (score regexp str) (fold-matches regexp str 0 (lambda (m score) (+ score (if (string=? (match:substring m) str) 5 ;exact match 1))))) Here the exact match does not consider a substring exact match. For instance, one would consider that the term ’gcc’ exactly matches in “some GCC thing”. Considering the current implementation, that’s not the case. For instance, a snippet as the procedure ’scoring’: --8<---------------cut here---------------start------------->8--- scheme@(guix-user)> ,use(ice-9 regex) scheme@(guix-user)> (define regexp (make-regexp "gcc" regexp/icase)) scheme@(guix-user)> (define str "some GCC thing") scheme@(guix-user)> (fold-matches regexp str 0 (lambda (m res) (+ res (if (string=? (match:substring m) str) 5 1)))) $2 = 1 --8<---------------cut here---------------end--------------->8--- See v2 for my proposal fixing this. Please note that this v2 gives the same ranking for torbrowser. And also improve the situation with gcc-toolchain. --8<---------------cut here---------------start------------->8--- $ ./pre-inst-env guix search gcc | recsel -CP name | grep -n gcc-toolchain 1:gcc-toolchain 2:gcc-toolchain 3:gcc-toolchain 4:gcc-toolchain 5:gcc-toolchain 6:gcc-toolchain 7:gcc-toolchain 8:gcc-toolchain 9:gcc-toolchain 10:gcc-toolchain 11:gcc-toolchain 12:gcc-toolchain $ ./pre-inst-env guix search tor | recsel -CP name | grep -n torbrowser 7:torbrowser $ ./pre-inst-env guix search dig | recsel -CP name | grep -n bind 44:bind --8<---------------cut here---------------end--------------->8--- However, inetutils is still at 44th with the only one term ’rsh’. I would suggest to do some tweak with the description. Bah maybe it is then a bit slower on cold caches? Hum?! Well, I have not investigated, neither with your patch. :-) Well, that something that could be investigated; especially the performance of ’char-set’ operations. 1: https://issues.guix.gnu.org/43342 > I opted to switch to counting a maximum of one match per field, which helps > with cases where a common subword matches /many/ times in packages with longer > descriptions, pushing more relevant packages down. In multi-term searches, > the unique terms - which are naturally rarer - also contribute to a larger > percentage of the score as a result of these changes. > Having matches with only one word boundary be scored as 2 instead of 1 was > done with the reasoning that a term is more likely to be part of a compound > word name (and thus more relevant) if it is a prefix or suffix; for example, > "gl" in OpenGL, "borg" in borgmatic, and "tor" in torbrowser. [...] > Closing this message on an unrelated note for future work: I stumbled on an > interesting idea while looking for test cases which suggested reducing the > score of a programming library when its language is not included in search > terms. It's out of scope for the current issue, but I thought I'd mention it > anyways for potential further improvements. Well, years ago I thought about implementing TF-IDF [2,3]. Other ideas [4] are floating around. Then, we spent some time for making “guix search” faster [5] and today my TODO is about having an extension relying on Guile-Xapian. Therefore, I would prefer keep the ’relevance’ more or less predictable by only counting the number of occurrences and apply some weights. Else, for what my opinion is worth, the direction would not be to re-invent an algorithm but maybe implement some already well-known ones. TF-IDF [3] is one or Okapi-BM25 is another one, etc. In all in all, that what Xapian provides. ;-) And it does it very well! That’s why I would be tempted to have a Guix extension relying on Guile-Xapin for indexing and searching (fast!). Cheers, simon 2: Re: Organizing packages zimoun <zimon.toutoune <at> gmail.com> Tue, 16 Jul 2019 19:04:26 +0200 id:CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg <at> mail.gmail.com https://lists.gnu.org/archive/html/guix-devel/2019-07 https://yhetil.org/guix/CAJ3okZ0LaJzWDBA7bjqZew_jAmtt1rj9PJhevwrtBiA_COXENg <at> mail.gmail.com 3: https://en.wikipedia.org/wiki/Tf%E2%80%93idf 4: Inverted index to accelerate guix package search Arun Isaac <arunisaac <at> systemreboot.net> Sun, 12 Jan 2020 20:33:51 +0530 id:cu7h810emy0.fsf <at> systemreboot.net https://lists.gnu.org/archive/html/guix-devel/2020-01 https://yhetil.org/guix/cu7h810emy0.fsf <at> systemreboot.net 5: [bug#39258] Faster guix search using an sqlite cache Arun Isaac <arunisaac <at> systemreboot.net> Fri, 24 Jan 2020 01:21:57 +0530 id:cu7pnfaar36.fsf <at> systemreboot.net https://issues.guix.gnu.org/39258 https://issues.guix.gnu.org/msgid/cu7pnfaar36.fsf <at> systemreboot.net https://yhetil.org/guix/cu7pnfaar36.fsf <at> systemreboot.net 6: https://en.wikipedia.org/wiki/Okapi_BM25
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.