GNU bug report logs - #48437
[PATCH] lint: archival: Lookup content in Disarchive database.

Previous Next

Package: guix-patches;

Reported by: Ludovic Courtès <ludo <at> gnu.org>

Date: Sat, 15 May 2021 10:29:02 UTC

Severity: normal

Tags: patch

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 48437 in the body.
You can then email your comments to 48437 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to guix-patches <at> gnu.org:
bug#48437; Package guix-patches. (Sat, 15 May 2021 10:29:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludo <at> gnu.org>:
New bug report received and forwarded. Copy sent to guix-patches <at> gnu.org. (Sat, 15 May 2021 10:29:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: guix-patches <at> gnu.org
Cc: Timothy Sample <samplet <at> ngyro.com>,
 Ludovic Courtès <ludo <at> gnu.org>
Subject: [PATCH] lint: archival: Lookup content in Disarchive database.
Date: Sat, 15 May 2021 12:28:14 +0200
* guix/lint.scm (lookup-disarchive-spec): New procedure.
(check-archival): When 'lookup-content' returns #f, call
'lookup-disarchive-spec'.
* guix/download.scm (%disarchive-mirrors): Make public.
---
 guix/download.scm |  1 +
 guix/lint.scm     | 31 +++++++++++++++++++++++++++----
 2 files changed, 28 insertions(+), 4 deletions(-)

Hello!

This patch makes the ‘archival’ checker check the Disarchive database(s)
when SWH ‘lookup-content’ returns #f.  For example, before the patch,
we get:

  $ guix lint -c archival guile-json
  gnu/packages/guile.scm:622:12: guile-json <at> 4.5.2: source not archived on Software Heritage

After the patch, we get nothing (success) thanks to Disarchive metadata
available at:

  https://disarchive.ngyro.com/sha256/1ab046ec36b1c44c041ac275568d818784d71fab9a5d95f9128cfe8a25051933

It assumes that the swhid found in the Disarchive metadata is valid, a
reasonable assumption IMO.

Thoughts?

Ludo’.

diff --git a/guix/download.scm b/guix/download.scm
index 72094e7318..b6eb97e6fa 100644
--- a/guix/download.scm
+++ b/guix/download.scm
@@ -35,6 +35,7 @@
   #:use-module (srfi srfi-1)
   #:use-module (srfi srfi-26)
   #:export (%mirrors
+            %disarchive-mirrors
             (url-fetch* . url-fetch)
             url-fetch/executable
             url-fetch/tarbomb
diff --git a/guix/lint.scm b/guix/lint.scm
index 1bebfe03d3..c6ad54ddeb 100644
--- a/guix/lint.scm
+++ b/guix/lint.scm
@@ -30,6 +30,7 @@
 
 (define-module (guix lint)
   #:use-module (guix store)
+  #:autoload   (guix base16) (bytevector->base16-string)
   #:use-module (guix base32)
   #:use-module (guix diagnostics)
   #:use-module (guix download)
@@ -1227,6 +1228,23 @@ upstream releases")
                             #:field 'source)))))))
 
 
+(define (lookup-disarchive-spec hash)
+  "Return true if Disarchive mirrors have a spec for HASH, false otherwise."
+  (any (lambda (mirror)
+         (with-networking-fail-safe
+          (format #f (G_ "failed to access Disarchive database at ~a")
+                  mirror)
+          #f
+          (let* ((url (string-append mirror
+                                     (symbol->string
+                                      (content-hash-algorithm hash))
+                                     "/"
+                                     (bytevector->base16-string
+                                      (content-hash-value hash))))
+                 (response (http-head url)))
+            (= 200 (response-code response)))))
+       %disarchive-mirrors))
+
 (define (check-archival package)
   "Check whether PACKAGE's source code is archived on Software Heritage.  If
 it's not, and if its source code is a VCS snapshot, then send a \"save\"
@@ -1302,10 +1320,15 @@ try again later")
                                         (symbol->string
                                          (content-hash-algorithm hash)))
                    (#f
-                    (list (make-warning package
-                                        (G_ "source not archived on Software \
-Heritage")
-                                        #:field 'source)))
+                    ;; If SWH doesn't have HASH as is, it may be because it's
+                    ;; a hand-crafted tarball.  In that case, check whether
+                    ;; the Disarchive database has an entry for that tarball.
+                    (if (lookup-disarchive-spec hash)
+                        '()
+                        (list (make-warning package
+                                            (G_ "source not archived on Software \
+Heritage and missing from the Disarchive database")
+                                            #:field 'source))))
                    ((? content?)
                     '())))
                '()))))
-- 
2.31.1





Information forwarded to guix-patches <at> gnu.org:
bug#48437; Package guix-patches. (Tue, 18 May 2021 03:20:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Timothy Sample <samplet <at> ngyro.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: guix-patches <at> gnu.org
Subject: Re: [PATCH] lint: archival: Lookup content in Disarchive database.
Date: Mon, 17 May 2021 23:19:01 -0400
Hello,

Ludovic Courtès <ludo <at> gnu.org> writes:

> This patch makes the ‘archival’ checker check the Disarchive database(s)
> when SWH ‘lookup-content’ returns #f.  [...]
>
> It assumes that the swhid found in the Disarchive metadata is valid, a
> reasonable assumption IMO.
>
> Thoughts?

One thing to consider is that just because Disarchive has captured an
archive’s metadata and computed the SWHID of its contents doesn’t mean
that the contents are actually in the SWH archive.  (Maybe that’s what
you meant when you wrote about valid IDs above.)  It would be neat if
the lint check looked up the SWHID to see if it exists.  Unfortunately,
Disarchive doesn’t make getting the underlying SWHID easy at the moment.
One option would be to pass a resolver to “disarchive-assemble” that
exfiltrates the ID using “set!”.  Another one would be to “read” the
specification and search for a form like ‘(swhid "swh:1:dir:...")’.
Neither is particularly lovely....

Other than that, the code looks good and everything seems to work.  :)


-- Tim




Information forwarded to guix-patches <at> gnu.org:
bug#48437; Package guix-patches. (Tue, 18 May 2021 21:48:02 GMT) Full text and rfc822 format available.

Message #11 received at 48437 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Timothy Sample <samplet <at> ngyro.com>
Cc: 48437 <at> debbugs.gnu.org
Subject: Re: bug#48437: [PATCH] lint: archival: Lookup content in Disarchive
 database.
Date: Tue, 18 May 2021 23:47:45 +0200
Hi!

Timothy Sample <samplet <at> ngyro.com> skribis:

> Ludovic Courtès <ludo <at> gnu.org> writes:
>
>> This patch makes the ‘archival’ checker check the Disarchive database(s)
>> when SWH ‘lookup-content’ returns #f.  [...]
>>
>> It assumes that the swhid found in the Disarchive metadata is valid, a
>> reasonable assumption IMO.
>>
>> Thoughts?
>
> One thing to consider is that just because Disarchive has captured an
> archive’s metadata and computed the SWHID of its contents doesn’t mean
> that the contents are actually in the SWH archive.  (Maybe that’s what
> you meant when you wrote about valid IDs above.)

Yes, I thought we could assume the contents were necessarily in the
archive.

> It would be neat if the lint check looked up the SWHID to see if it
> exists.  Unfortunately, Disarchive doesn’t make getting the underlying
> SWHID easy at the moment.  One option would be to pass a resolver to
> “disarchive-assemble” that exfiltrates the ID using “set!”.  Another
> one would be to “read” the specification and search for a form like
> ‘(swhid "swh:1:dir:...")’.  Neither is particularly lovely....

Hmm yeah.  There’s no API to deserialize the (disarchive …) sexp as a
record, right?

> Other than that, the code looks good and everything seems to work.  :)

Maybe we can assume (with a comment) that the SWHID points to a valid
content, and when creating the Disarchive database we actually make sure
this is the case?

More generally, we need to talk about that database, how to create it
and maintain it.  :-)

Thanks,
Ludo’.




Information forwarded to guix-patches <at> gnu.org:
bug#48437; Package guix-patches. (Fri, 21 May 2021 10:28:01 GMT) Full text and rfc822 format available.

Message #14 received at 48437 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 48437 <at> debbugs.gnu.org
Cc: Timothy Sample <samplet <at> ngyro.com>,
 Ludovic Courtès <ludo <at> gnu.org>
Subject: [PATCH v2] lint: archival: Lookup content in Disarchive database.
Date: Fri, 21 May 2021 12:27:03 +0200
* guix/lint.scm (lookup-disarchive-spec): New procedure.
(check-archival): When 'lookup-content' returns #f, call
'lookup-disarchive-spec'.  Call 'lookup-directory' on the result of
'lookup-directory'.
* guix/download.scm (%disarchive-mirrors): Make public.
* tests/lint.scm ("archival: missing content"): Set
'%disarchive-mirrors'.
("archival: content unavailable but disarchive available"): New test.
---
 guix/download.scm |  1 +
 guix/lint.scm     | 62 ++++++++++++++++++++++++++++++++++++++++++++---
 tests/lint.scm    | 34 +++++++++++++++++++++++---
 3 files changed, 89 insertions(+), 8 deletions(-)

Hi!

This new version checks that the SWH IDs that appear in a Disarchive
entry are indeed available at archive.softwareheritage.org.

It also adds a test for that.

Ludo'.

diff --git a/guix/download.scm b/guix/download.scm
index 72094e7318..b6eb97e6fa 100644
--- a/guix/download.scm
+++ b/guix/download.scm
@@ -35,6 +35,7 @@
   #:use-module (srfi srfi-1)
   #:use-module (srfi srfi-26)
   #:export (%mirrors
+            %disarchive-mirrors
             (url-fetch* . url-fetch)
             url-fetch/executable
             url-fetch/tarbomb
diff --git a/guix/lint.scm b/guix/lint.scm
index 1bebfe03d3..a2d6418b85 100644
--- a/guix/lint.scm
+++ b/guix/lint.scm
@@ -30,6 +30,7 @@
 
 (define-module (guix lint)
   #:use-module (guix store)
+  #:autoload   (guix base16) (bytevector->base16-string)
   #:use-module (guix base32)
   #:use-module (guix diagnostics)
   #:use-module (guix download)
@@ -1227,6 +1228,43 @@ upstream releases")
                             #:field 'source)))))))
 
 
+(define (lookup-disarchive-spec hash)
+  "If Disarchive mirrors have a spec for HASH, return the list of SWH
+directory identifiers the spec refers to.  Otherwise return #f."
+  (define (extract-swh-id spec)
+    ;; Return the list of SWH directory identifiers SPEC refers to, where SPEC
+    ;; is a Disarchive sexp.  Instead of attempting to parse it, traverse it
+    ;; in a pretty unintelligent fashion.
+    (let loop ((sexp spec)
+               (ids '()))
+      (match sexp
+        ((? string? str)
+         (let ((prefix "swh:1:dir:"))
+           (if (string-prefix? prefix str)
+               (cons (string-drop str (string-length prefix)) ids)
+               ids)))
+        ((head tail ...)
+         (loop tail (loop head ids)))
+        (_ ids))))
+
+  (any (lambda (mirror)
+         (with-networking-fail-safe
+          (format #f (G_ "failed to access Disarchive database at ~a")
+                  mirror)
+          #f
+          (guard (c ((http-get-error? c) #f))
+            (let* ((url (string-append mirror
+                                       (symbol->string
+                                        (content-hash-algorithm hash))
+                                       "/"
+                                       (bytevector->base16-string
+                                        (content-hash-value hash))))
+                   (port (http-fetch (string->uri url) #:text? #t))
+                   (spec (read port)))
+              (close-port port)
+              (extract-swh-id spec)))))
+       %disarchive-mirrors))
+
 (define (check-archival package)
   "Check whether PACKAGE's source code is archived on Software Heritage.  If
 it's not, and if its source code is a VCS snapshot, then send a \"save\"
@@ -1302,10 +1340,26 @@ try again later")
                                         (symbol->string
                                          (content-hash-algorithm hash)))
                    (#f
-                    (list (make-warning package
-                                        (G_ "source not archived on Software \
-Heritage")
-                                        #:field 'source)))
+                    ;; If SWH doesn't have HASH as is, it may be because it's
+                    ;; a hand-crafted tarball.  In that case, check whether
+                    ;; the Disarchive database has an entry for that tarball.
+                    (match (lookup-disarchive-spec hash)
+                      (#f
+                       (list (make-warning package
+                                           (G_ "source not archived on Software \
+Heritage and missing from the Disarchive database")
+                                           #:field 'source)))
+                      (directory-ids
+                       (match (find (lambda (id)
+                                      (not (lookup-directory id)))
+                                    directory-ids)
+                         (#f '())
+                         (id
+                          (list (make-warning package
+                                              (G_ "
+Disarchive entry refers to non-existent SWH directory '~a'")
+                                              (list id)
+                                              #:field 'source)))))))
                    ((? content?)
                     '())))
                '()))))
diff --git a/tests/lint.scm b/tests/lint.scm
index a2c8665142..d54fafc1d2 100644
--- a/tests/lint.scm
+++ b/tests/lint.scm
@@ -1,7 +1,7 @@
 ;;; GNU Guix --- Functional package management for GNU
 ;;; Copyright © 2012, 2013 Cyril Roelandt <tipecaml <at> gmail.com>
 ;;; Copyright © 2014, 2015, 2016 Eric Bavier <bavier <at> member.fsf.org>
-;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020 Ludovic Courtès <ludo <at> gnu.org>
+;;; Copyright © 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021 Ludovic Courtès <ludo <at> gnu.org>
 ;;; Copyright © 2015, 2016 Mathieu Lirzin <mthl <at> gnu.org>
 ;;; Copyright © 2016 Hartmut Goebel <h.goebel <at> crazy-compilers.com>
 ;;; Copyright © 2017 Alex Kost <alezost <at> gmail.com>
@@ -1008,10 +1008,13 @@
                      (method url-fetch)
                      (uri "http://example.org/foo.tgz")
                      (sha256 (make-bytevector 32))))
-         (warnings (with-http-server '((404 "Not archived."))
+         (warnings (with-http-server '((404 "Not archived.")
+                                       (404 "Not in Disarchive database."))
                      (parameterize ((%swh-base-url (%local-url)))
-                       (check-archival (dummy-package "x"
-                                                      (source origin)))))))
+                       (mock ((guix download) %disarchive-mirrors
+                              (list (%local-url)))
+                             (check-archival (dummy-package "x"
+                                                            (source origin))))))))
     (warning-contains? "not archived" warnings)))
 
 (test-equal "archival: content available"
@@ -1027,6 +1030,29 @@
       (parameterize ((%swh-base-url (%local-url)))
         (check-archival (dummy-package "x" (source origin)))))))
 
+(test-equal "archival: content unavailable but disarchive available"
+  '()
+  (let* ((origin   (origin
+                     (method url-fetch)
+                     (uri "http://example.org/foo.tgz")
+                     (sha256 (make-bytevector 32))))
+         (disarchive (object->string
+                      '(disarchive (version 0)
+                                   ...
+                                   "swh:1:dir:aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")))
+         ;; https://archive.softwareheritage.org/api/1/directory/
+         (directory "[ { \"checksums\": {},
+                         \"dir_id\": \"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa\",
+                         \"type\": \"file\",
+                         \"name\": \"README\"
+                         \"length\": 42 } ]"))
+    (with-http-server `((404 "")                  ;lookup-content
+                        (200 ,disarchive)         ;Disarchive database lookup
+                        (200 ,directory))         ;lookup-directory
+      (mock ((guix download) %disarchive-mirrors (list (%local-url)))
+            (parameterize ((%swh-base-url (%local-url)))
+              (check-archival (dummy-package "x" (source origin))))))))
+
 (test-assert "archival: missing revision"
   (let* ((origin   (origin
                      (method git-fetch)
-- 
2.31.1





Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Sat, 22 May 2021 21:53:02 GMT) Full text and rfc822 format available.

Notification sent to Ludovic Courtès <ludo <at> gnu.org>:
bug acknowledged by developer. (Sat, 22 May 2021 21:53:02 GMT) Full text and rfc822 format available.

Message #19 received at 48437-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 48437-done <at> debbugs.gnu.org
Cc: Timothy Sample <samplet <at> ngyro.com>
Subject: Re: bug#48437: [PATCH] lint: archival: Lookup content in Disarchive
 database.
Date: Sat, 22 May 2021 23:52:31 +0200
Hi,

Ludovic Courtès <ludo <at> gnu.org> skribis:

> * guix/lint.scm (lookup-disarchive-spec): New procedure.
> (check-archival): When 'lookup-content' returns #f, call
> 'lookup-disarchive-spec'.  Call 'lookup-directory' on the result of
> 'lookup-directory'.
> * guix/download.scm (%disarchive-mirrors): Make public.
> * tests/lint.scm ("archival: missing content"): Set
> '%disarchive-mirrors'.
> ("archival: content unavailable but disarchive available"): New test.

Following our discussion on IRC, I pushed this variant as
bc4d81d267830a3b1ccb63198f4100cc836e4e4e.

Thanks for taking a look!

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 20 Jun 2021 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 362 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.