GNU bug report logs - #41625
Sporadic guix-offload crashes due to EOF errors

Previous Next

Package: guix;

Reported by: Marius Bakke <marius <at> gnu.org>

Date: Sun, 31 May 2020 09:52:01 UTC

Severity: normal

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


Message #26 received at 41625 <at> debbugs.gnu.org (full text, mbox):

From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 41625 <at> debbugs.gnu.org, marius <at> gnu.org
Subject: Re: [PATCH v2] offload: Handle a possible EOF response from
 read-repl-response.
Date: Tue, 25 May 2021 23:18:17 -0400
[Message part 1 (text/plain, inline)]
Hi Ludovic,

Ludovic Courtès <ludo <at> gnu.org> writes:

[...]

>>  (define* (read-repl-response port #:optional inferior)
>>    "Read a (guix repl) response from PORT and return it as a Scheme object.
>>  Raise '&inferior-exception' when an exception is read from PORT."
>> @@ -241,6 +246,10 @@ Raise '&inferior-exception' when an exception is read from PORT."
>>    (match (read port)
>>      (('values objects ...)
>>       (apply values (map sexp->object objects)))
>> +    ;; Unexpectedly read EOF from the port.  This can happen for example when
>> +    ;; the underlying connection for PORT was lost with Guile-SSH.
>> +    (? eof-object?
>> +       (raise (condition (&inferior-connection-lost))))
>
> The match clause syntax is incorrect; should be:
>
>  ((? eof-object?)
>   (raise …))

Good catch, fixed.

>> +    (info (G_ "Testing ~a build machines defined in '~a'...~%")
>>            (length machines) machine-file)
>> -    (let* ((names    (map build-machine-name machines))
>> -           (sockets  (map build-machine-daemon-socket machines))
>> -           (sessions (map (cut open-ssh-session <> %short-timeout) machines))
>> -           (nodes    (map remote-inferior sessions)))
>> -      (for-each assert-node-has-guix nodes names)
>> -      (for-each assert-node-repl nodes names)
>> -      (for-each assert-node-can-import sessions nodes names sockets)
>> -      (for-each assert-node-can-export sessions nodes names sockets)
>> -      (for-each close-inferior nodes)
>> -      (for-each disconnect! sessions))))
>> +    (par-for-each check-machine-availability machines)))
>
> Why not!  IMO this should go in a separate patch, though, since it’s not
> related.

For me, it is related in that retrying all the checks of *every* build
offload machine would be too expensive; it already takes 32 s for my 4
offload machines; retrying this for up to 3 times would mean waiting for
a minute and half, which I don't find reasonable (imagine on berlin!).

>> +(define (check-machine-availability machine)
>> +  "Check whether MACHINE is available.  Exit with an error upon failure."
>> +  ;; Sometimes, the machine remote port may return EOF, presumably because the
>> +  ;; connection was lost.  Retry up to 3 times.
>> +  (let loop ((retries 3))
>> +    (guard (c ((inferior-connection-lost? c)
>> +               (let ((retries-left (1- retries)))
>> +                 (if (> retries-left 0)
>> +                     (begin
>> +                       (format (current-error-port)
>> +                               (G_ "connection to machine ~s lost; retrying~%")
>> +                               (build-machine-name machine))
>> +                       (loop (retries-left)))
>> +                     (leave (G_ "connection repeatedly lost with machine '~a'~%")
>> +                            (build-machine-name machine))))))
>
> I’m afraid we’re papering over problems here.

I had that thought too, but then also realized that even if this was
papering over a problem, it'd be a good one to paper over as this
problem can legitimately happen in practice, due to the network's
inherently shaky nature.  It seems better to be ready for it.  Also, my
hopes in being able to troubleshoot such a difficult to reproduce
networking issue are rather low.

> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
> berlin enough to reproduce the issue?  If so, we could monitor/strace
> sshd on overdrive1 to get a better understanding of what’s going on.

It's actually difficult to trigger it; it seems to happen mostly on the
first try after a long time without connecting to the machine; on the
2nd and later tries, everything is smooth.  Waiting a few minutes is not
enough to re-trigger the problem.

I've managed to see the problem a few lucky times with:

--8<---------------cut here---------------start------------->8---
while true; do guix offload test /etc/guix/machines.scm overdrive1; done
--8<---------------cut here---------------end--------------->8---

I don't have a password set for my user on overdrive1, so can't attach
strace to sshd, but yeah, we could try to capture it and see if we can
understand what's going on.

Attached is v2 of the patch, with the match clause fixed.

[0001-offload-Handle-a-possible-EOF-response-from-read-rep.patch (text/x-patch, attachment)]
[Message part 3 (text/plain, inline)]
Thanks!

Maxim

This bug report was last modified 3 years and 54 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.