GNU bug report logs -
#41625
Sporadic guix-offload crashes due to EOF errors
Previous Next
Reported by: Marius Bakke <marius <at> gnu.org>
Date: Sun, 31 May 2020 09:52:01 UTC
Severity: normal
Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
View this message in rfc822 format
Hi Ludovic,
Ludovic Courtès <ludo <at> gnu.org> writes:
[...]
> I see. So I’d say it’s a prerequisite (a patch that must come before)
> but not entirely the same thing. I’m nitpicking!
Eh, it's okay :-). Splitting changes into the right unit is a problem
that is akin to naming things; it's hard! I welcome your suggestion.
> We should make sure it doesn’t trigger thread-safety issues in libssh or
> anything like that (running it repeatedly on a large machines.scm should
> give us some confidence).
It seems fine so far, but I've only tested in a loop with 4 build
machines. When it nears completion I'll give it a shot on berlin.
[...]
> Yes, but note that this is just for ‘guix offload test’. The actual
> code run while offloading will still fail badly.
Ah, thanks for pointing that; I somehow thought that this machine status
checking code was a prelude to every offloaded build.
[...]
>> I don't have a password set for my user on overdrive1, so can't attach
>> strace to sshd, but yeah, we could try to capture it and see if we can
>> understand what's going on.
>
> OK.
I'd be happy to try strace when your are available. You can ping me on
the chat. It's been more than 8 hours since I tried, so I should be
able to trigger the problem :-).
[...]
> Perhaps worth adding an ‘inferior’ and/or ‘port’ field. That would
> allow the handler to present more information as to which inferior is
> failing.
>
> Maybe ‘premature-eof’ would be more accurate than ‘connection-lost’.
Good suggestions. I'll implement them.
>> + (format (current-error-port)
>> + (G_ "connection to machine '~a' lost; retrying~%")
>> + (build-machine-name machine))
>
> You can use ‘info’ instead of ‘format’.
That also. Thanks!
On another note, I was able to 'exercise' the fix, and the exception is
raised but something fails with the following backtrace instead of being
retried:
--8<---------------cut here---------------start------------->8---
guix offload: Testing 1 build machines defined in '/etc/guix/machines.scm'...
connection to machine 'overdrive1.guix.gnu.org' lost; retrying
Backtrace:
In ice-9/boot-9.scm:
1752:10 10 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
In unknown file:
9 (apply-smob/0 #<thunk 7f915c028f60>)
In ice-9/boot-9.scm:
724:2 8 (call-with-prompt _ _ #<procedure default-prompt-handler (k proc)>)
In ice-9/eval.scm:
619:8 7 (_ #(#(#<directory (guile-user) 7f915c022c80>)))
In guix/ui.scm:
2161:12 6 (run-guix-command _ . _)
In ice-9/boot-9.scm:
1752:10 5 (with-exception-handler _ _ #:unwind? _ #:unwind-for-type _)
1747:15 4 (with-exception-handler #<procedure 7f91576bf0c0 at ice-9/boot-9.scm:1831:7 (exn)> _ # _ # …)
In srfi/srfi-1.scm:
634:9 3 (for-each #<procedure check-machine-availability (a)> (#<<build-machine> name: "overdriv…>))
In ice-9/eval.scm:
191:35 2 (_ #(#(#(#<directory (guix scripts offload) 7f9159852780> 3 #<<build-machine> na…> …) …) …))
Exception thrown while printing backtrace:
In procedure frame-local-ref: Argument 2 out of range: 1
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
Wrong type to apply: 2
--8<---------------cut here---------------end--------------->8---
I haven't been able to pinpoint what yet. Notice that in the above code
I've changed par-for-each by just for-each, doubting it might have
something to do with it, but it appears unrelated.
Thanks,
Maxim
This bug report was last modified 3 years and 54 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.