GNU bug report logs - #42162
gforge.inria.fr to be taken off-line in Dec. 2020

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Thu, 2 Jul 2020 07:34:01 UTC

Severity: important

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #79 received at 42162 <at> debbugs.gnu.org (full text, mbox):

From: zimoun <zimon.toutoune <at> gmail.com>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: 42162 <at> debbugs.gnu.org,
 Maurice Brémond <Maurice.Bremond <at> inria.fr>,
 Ludovic Courtès <ludo <at> gnu.org>
Subject: Re: bug#42162: Recovering source tarballs
Date: Thu, 27 Aug 2020 11:41:24 +0200
Hi,

On Wed, 26 Aug 2020 at 17:11, Timothy Sample <samplet <at> ngyro.com> wrote:
> zimoun <zimon.toutoune <at> gmail.com> writes:
>
>> One question is how this database scales?
>>
>> For example, a quick back-to-envelop estimation leads to ~1.2GB metadata
>> for ~14k packages and then an increase of ~700MB per year, both with the
>> Ludo’s code [1].
>>
>> [1] <http://issues.guix.gnu.org/issue/42162#11>
>
> It’s a good question.  A good part of the size comes from the
> representation rather than the data.  Compression helps a lot here.  I
> have a database of 3,912 packages.  It’s 295M uncompressed (which is a
> little better than your estimation).  If I pass each file through Lzip,
> it shrinks down to 60M.  That’s more like 15.5K per package, which is
> almost an order of magnitude smaller than the estimation you used
> (120K).  I think that makes the numbers rather pleasant, but it comes at
> the expense of easy storing in Git.

Thank you for these numbers.  Really interesting!

First, I do not know if the database needs to be stored with Git.  What
should be the advantage? (naive question :-))


On SWH T2430 [1], you explain the “default-header” trick to cut down the
size.  Nice!

Moreover, the format is a long list, e.g.,

--8<---------------cut here---------------start------------->8---
(headers
    ((name "raptor2-2.0.15/")
     (mode 493)
     (mtime 1414909500)
     (chksum 4225)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/")
     (mode 493)
     (mtime 1414909497)
     (chksum 4797)
     (typeflag 53))
    ((name "raptor2-2.0.15/build/ltversion.m4")
     (size 690)
     (mtime 1414908273)
     (chksum 5958))

     […])
--8<---------------cut here---------------end--------------->8---

which is human-readable.  Is it useful?


Instead, one could imagine shorter keywords:

    ((na "raptor2-2.0.15/")
     (mo 493)
     (mt 1414909500)
     (ch 4225)
     (ty 53))

which using your database (commit fc50927) reduces from 295MB to 279MB.

Or even plain list:

   (\x00 "raptor2-2.0.15/" 493 1414909500 4225 53)
   (\x01 "raptor2-2.0.15/build/ltversion.m4" 690 1414908273 5958)

where the first element provides the “type” of list to ease the reader.


Well, the 2 naive questions are: does it make sense to
 - have the database stored under Git?
 - have an human-readable format?


Thank you again for pushing forward this topic. :-)

All the best,
simon

[1] https://forge.softwareheritage.org/T2430#47522




This bug report was last modified 2 years and 288 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.