Package: guix;
Reported by: Marius Bakke <marius <at> gnu.org>
Date: Sun, 31 May 2020 09:52:01 UTC
Severity: normal
Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Bug is archived. No further changes may be made.
Message #29 received at 41625 <at> debbugs.gnu.org (full text, mbox):
From: Ludovic Courtès <ludo <at> gnu.org> To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com> Cc: 41625 <at> debbugs.gnu.org, marius <at> gnu.org Subject: Re: [PATCH v2] offload: Handle a possible EOF response from read-repl-response. Date: Wed, 26 May 2021 11:14:32 +0200
Hi, Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis: >>> + (info (G_ "Testing ~a build machines defined in '~a'...~%") >>> (length machines) machine-file) >>> - (let* ((names (map build-machine-name machines)) >>> - (sockets (map build-machine-daemon-socket machines)) >>> - (sessions (map (cut open-ssh-session <> %short-timeout) machines)) >>> - (nodes (map remote-inferior sessions))) >>> - (for-each assert-node-has-guix nodes names) >>> - (for-each assert-node-repl nodes names) >>> - (for-each assert-node-can-import sessions nodes names sockets) >>> - (for-each assert-node-can-export sessions nodes names sockets) >>> - (for-each close-inferior nodes) >>> - (for-each disconnect! sessions)))) >>> + (par-for-each check-machine-availability machines))) >> >> Why not! IMO this should go in a separate patch, though, since it’s not >> related. > > For me, it is related in that retrying all the checks of *every* build > offload machine would be too expensive; it already takes 32 s for my 4 > offload machines; retrying this for up to 3 times would mean waiting for > a minute and half, which I don't find reasonable (imagine on berlin!). I see. So I’d say it’s a prerequisite (a patch that must come before) but not entirely the same thing. I’m nitpicking! We should make sure it doesn’t trigger thread-safety issues in libssh or anything like that (running it repeatedly on a large machines.scm should give us some confidence). >>> +(define (check-machine-availability machine) >>> + "Check whether MACHINE is available. Exit with an error upon failure." >>> + ;; Sometimes, the machine remote port may return EOF, presumably because the >>> + ;; connection was lost. Retry up to 3 times. >>> + (let loop ((retries 3)) >>> + (guard (c ((inferior-connection-lost? c) >>> + (let ((retries-left (1- retries))) >>> + (if (> retries-left 0) >>> + (begin >>> + (format (current-error-port) >>> + (G_ "connection to machine ~s lost; retrying~%") >>> + (build-machine-name machine)) >>> + (loop (retries-left))) >>> + (leave (G_ "connection repeatedly lost with machine '~a'~%") >>> + (build-machine-name machine)))))) >> >> I’m afraid we’re papering over problems here. > > I had that thought too, but then also realized that even if this was > papering over a problem, it'd be a good one to paper over as this > problem can legitimately happen in practice, due to the network's > inherently shaky nature. It seems better to be ready for it. Also, my > hopes in being able to troubleshoot such a difficult to reproduce > networking issue are rather low. Yes, but note that this is just for ‘guix offload test’. The actual code run while offloading will still fail badly. >> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on >> berlin enough to reproduce the issue? If so, we could monitor/strace >> sshd on overdrive1 to get a better understanding of what’s going on. > > It's actually difficult to trigger it; it seems to happen mostly on the > first try after a long time without connecting to the machine; on the > 2nd and later tries, everything is smooth. Waiting a few minutes is not > enough to re-trigger the problem. > > I've managed to see the problem a few lucky times with: > > while true; do guix offload test /etc/guix/machines.scm overdrive1; done > > I don't have a password set for my user on overdrive1, so can't attach > strace to sshd, but yeah, we could try to capture it and see if we can > understand what's going on. OK. > From c52172502749a4d194dc51db9d2c394cb15e8d07 Mon Sep 17 00:00:00 2001 > From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com> > Date: Tue, 25 May 2021 08:42:06 -0400 > Subject: [PATCH] offload: Handle a possible EOF response from > read-repl-response. > > Fixes <https://issues.guix.gnu.org/41625>. > > * guix/scripts/offload.scm (check-machine-availability): Refactor so that it > takes a single machine object, to allow for retrying a single machine. Handle > the case where the checks raised an exception due to the connection to the > build machine having been lost, and retry up to 3 times. Ensure the cleanup > code is run in all situations. > (check-machines-availability): New procedure. Call > CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about > twice as fast with 4 build machines, from ~30 s to ~15 s). > * guix/inferior.scm (&inferior-connection-lost): New condition type. > (read-repl-response): Raise a condition of the above type when reading EOF > from the build machine's port. [...] > +(define-condition-type &inferior-connection-lost &error > + inferior-connection-lost?) Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would allow the handler to present more information as to which inferior is failing. Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’. > + (format (current-error-port) > + (G_ "connection to machine '~a' lost; retrying~%") > + (build-machine-name machine)) You can use ‘info’ instead of ‘format’. Otherwise LGTM, thanks! Ludo’.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.