GNU bug report logs - #39258
Faster guix search using an sqlite cache

Previous Next

Package: guix-patches;

Reported by: Arun Isaac <arunisaac <at> systemreboot.net>

Date: Thu, 23 Jan 2020 19:53:02 UTC

Severity: important

Done: Arun Isaac <arunisaac <at> systemreboot.net>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: zimoun <zimon.toutoune <at> gmail.com>
To: 39258 <at> debbugs.gnu.org
Cc: arunisaac <at> systemreboot.net, mail <at> ambrevar.xyz, ludo <at> gnu.org, zimoun <zimon.toutoune <at> gmail.com>
Subject: [bug#39258] [PATCH v4 0/3] Faster cache generation (similar as v3)
Date: Sun,  3 May 2020 17:01:51 +0200
Dear,

The aim of this version v4 is to keep the same searching performances as the previous version v3 but to drastically reduce the generation of the cache.  On my laptop, the overhead is now 4 seconds; compared to more than 20 seconds for v2 and v3.

--8<---------------cut here---------------start------------->8---
# default
time guix build /gnu/store/0nfpp82mqglpwvl1nbfpaphw5db2ivcp-guix-package-cache.drv --check
# v4
time guix build /gnu/store/y78gfh1n7m3kyrj8wsqj25qc2cbc1a4d-guix-package-cache.drv --check
--8<---------------cut here---------------end--------------->8---

|      | default  | v4        |
|------+----------+-----------|
| real | 0m6.012s | 0m10.244s |
| user | 0m0.541s | 0m0.542s  |
| sys  | 0m0.033s | 0m0.032s  |


In the version v3, the cache is built using 'cons' and 'fold-packages' (wrapper to 'fold-module-public-variables').  The version v4 modifies -- by adding other information -- the function 'generate-package-cache' which uses 'vhash' and 'fold-module-public-variables*'.

Therefore the cache '/lib/guix/package.cache' contains more information.  (The v4 structure of 'package.cache' is a quick draft, so details should be discussed and an interesting move should to have a structured (binary and all strings) S-exp; because it should become an entry point to export the packages list to JSON.  WDYT?)


Now, we are comparing apples to apples and the cost to compute BM25 (v2) is not free at all.  Remember that BM25 is the state-of-the-art of information retrieval (relevance ranking) and it is delegated to Xapian (v2).  I do not know if there is perfomance bottleneck between Guix, Guile-Xapian and Xapian itself but for sure the computation of BM25 is not free.  More about that soon.

To be clear about BM25 and caching, what I have in mind is:
  1. "guix search --build-index" optionally done by the user if they wants for example the BM25 ranking.
  2. Use BM25 metrics to detect poor package meta-data (synopsis and description); if it worth why not add another checker to "guix lint".

However, ranking is another story and I am not convinced yet if BM25 fits Guix needs or not.



* Details
~~~~~~~~~

The pacthes applies against the commit a357849f5b (and it is not yet rebased).

--8<---------------cut here---------------start------------->8---
time ./pre-env-inst guix pull --branch=search-v4 --url=$PWD -p /tmp/v4
--8<---------------cut here---------------end--------------->8---


Similar test than the previous benchmark (cold cache).

--8<---------------cut here---------------start------------->8---
time ./pre-env-inst /tmp/v4/bin/guix search crypto library \
     | recsel -P name | grep libb2
name: libb2

real    0m0.784s
user    0m0.810s
sys     0m0.037s
--8<---------------cut here---------------end--------------->8---

And the option '--load-path' turns off the cache and it fallbacks to the usual 'fold-package'.

--8<---------------cut here---------------start------------->8---
time ./pre-inst-env /tmp/v4/bin/guix search -L /tmp/my-pkgs crypto library \
     | recsel -C -p name | grep libb2
name: libb2

real    0m2.446s
user    0m1.872s
sys     0m0.187s
--8<---------------cut here---------------end--------------->8---



* Still draft
~~~~~~~~~~~~~

 1. The name of 'fold-packages*' should be misleading since it does not return "true" packages.

--8<---------------cut here---------------start------------->8---
(define get-hello (p r)
  (if (string=? (package-name p) "hello")
      p
      r))
(define no-cache   (fold-packages  get-hello '()))
(define from-cache (fold-packages* get-hello '()))

(equal? no-cache from-cache)
;;; #f
--8<---------------cut here---------------end--------------->8---

    Another name for the procedure is welcome if it is an issue.

 2. The function 'package->recutils' in 'guix/ui.scm' is modified but it is not the better.

--8<---------------cut here---------------start------------->8---
          (match (package-supported-systems p)
            (('cache supported-systems)
             (string-join supported-systems))
            (_
             (string-join (package-transitive-supported-systems p)))))
--8<---------------cut here---------------end--------------->8---

    However it avoids to duplicate code; as it is done in version v3.


 3. Deprecated packages are displayed (bug in v3 too).

 4. Impolite '@@' is used to access the private license construction.

 5. Commit messages are incomplete, copyright header too, etc..



* Next?
~~~~~~~

IMHO, simply caching improves the current situation:

 - a bit of extra time at pull time (less than 5s on my machine)
 + speed up at search time (2x faster)
 * maintainable code?

Is it in the right direction?
Could you advise for a more compliant code?
Could you test on your machines to have another point of comparison?



Best regards,
simon


zimoun (3):
  DRAFT packages: Add fields to packages cache.
  DRAFT packages: Add new procedure 'fold-packages*'.
  DRAFT guix package: Use cache in 'find-packages-by-description'.

 gnu/packages.scm         | 98 ++++++++++++++++++++++++++++++++++++++--
 guix/scripts/package.scm |  2 +-
 guix/ui.scm              | 29 +++++++-----
 tests/packages.scm       | 31 +++++++++++++
 4 files changed, 143 insertions(+), 17 deletions(-)

-- 
2.26.1





This bug report was last modified 37 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.