GNU bug report logs - #41625
Sporadic guix-offload crashes due to EOF errors

Previous Next

Package: guix;

Reported by: Marius Bakke <marius <at> gnu.org>

Date: Sun, 31 May 2020 09:52:01 UTC

Severity: normal

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#41625: closed (Sporadic guix-offload crashes due to EOF errors)
Date: Sat, 26 Mar 2022 05:04:01 +0000
[Message part 1 (text/plain, inline)]
Your message dated Sat, 26 Mar 2022 01:03:37 -0400
with message-id <87v8w1wbiu.fsf_-_ <at> gmail.com>
and subject line Re: bug#41625: Sporadic guix-offload crashes due to EOF errors
has caused the debbugs.gnu.org bug report #41625,
regarding Sporadic guix-offload crashes due to EOF errors
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
41625: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=41625
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Marius Bakke <marius <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: Sporadic guix-offload crashes due to EOF errors
Date: Sun, 31 May 2020 11:51:54 +0200
[Message part 3 (text/plain, inline)]
Hello,

During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:

--8<---------------cut here---------------start------------->8---
building /gnu/store/87655bh9rqcr29qasl1c4yj3skmxkyiz-kfilemetadata-5.70.0.drv...
process 12989 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'
process 12989 acquired build slot '/var/guix/offload/dover.guix.info:9023/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.167:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.163:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2223/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.168:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.173:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.176:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2222/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.165:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.169:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.181:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.170:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.174:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.180:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.161:22/0'
Backtrace:
In ice-9/boot-9.scm:
  1736:10  5 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
           4 (apply-smob/0 #<thunk 7f3344d296c0>)
In ice-9/boot-9.scm:
    718:2  3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
In ice-9/eval.scm:
    619:8  2 (_ #(#(#<directory (guile-user) 7f3344933f00>)))
In guix/ui.scm:
  1936:12  1 (run-guix-command _ . _)
In guix/scripts/offload.scm:
   742:22  0 (guix-offload . _)

guix/scripts/offload.scm:742:22: In procedure guix-offload:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
guix build: error: unexpected EOF reading a line
--8<---------------cut here---------------end--------------->8---

Which is strange because guix/scripts/offload.scm:742 is wrapped in a
(unless (eof-object? ...)) block.

When this happens, the build command terminates, along with any other
builds that it had started concurrently.  Builds from other clients
were unaffected, of course.

I have also seen this occur on my personal offloading setup once every
blue moon, but don't know what could have caused it.
[signature.asc (application/pgp-signature, inline)]
[Message part 5 (message/rfc822, inline)]
From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Marius Bakke <marius <at> gnu.org>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 41625-done <at> debbugs.gnu.org
Subject: Re: bug#41625: Sporadic guix-offload crashes due to EOF errors
Date: Sat, 26 Mar 2022 01:03:37 -0400
Hello,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hi Marius,
>
> Marius Bakke <marius <at> gnu.org> writes:
>
>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skriver:
>>
>>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>>>> berlin enough to reproduce the issue?  If so, we could monitor/strace
>>>> sshd on overdrive1 to get a better understanding of what’s going on.
>>>
>>> It's actually difficult to trigger it; it seems to happen mostly on the
>>> first try after a long time without connecting to the machine; on the
>>> 2nd and later tries, everything is smooth.  Waiting a few minutes is not
>>> enough to re-trigger the problem.
>>>
>>> I've managed to see the problem a few lucky times with:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>>> --8<---------------cut here---------------end--------------->8---
>>
>> I used to be able to reproduce it by inducing a high load on the target
>> machine and just let Guix keep trying to connect.  But now I did that,
>> and set overload threshold to 0.0 for good measure, and Guix has been
>> waiting patiently for two hours without failure.
>>
>> So AFAICT this bug has been fixed.  Perhaps Berlin or the Overdrive
>> simply needs to be updated?
>
> Ah!  Do you have root access to overdrive1?  It'd be interesting to
> reconfigure it to update the guix-daemon and see if the problem
> vanishes.

Good news, this seems resolved with the newer Guile-SSH 0.15.1, where
long delays to return some output no longer triggers an EOF response
(instead now the client waits still).  I believe it was fixed by this
commit [0].

Many thanks to Artyom Poptsov for fixing it!

Closing.

Maxim

[0]  https://github.com/artyom-poptsov/guile-ssh/commit/fefaab9e925d015b01abc7c76ea4017c373ad895


This bug report was last modified 3 years and 53 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.