GNU bug report logs - #41625
Sporadic guix-offload crashes due to EOF errors

Previous Next

Package: guix;

Reported by: Marius Bakke <marius <at> gnu.org>

Date: Sun, 31 May 2020 09:52:01 UTC

Severity: normal

Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Marius Bakke <marius <at> gnu.org>
Subject: bug#41625: closed (Re: bug#41625: Sporadic guix-offload crashes
 due to EOF errors)
Date: Sat, 26 Mar 2022 05:04:01 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#41625: Sporadic guix-offload crashes due to EOF errors

which was filed against the guix package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 41625 <at> debbugs.gnu.org.

-- 
41625: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=41625
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
To: Marius Bakke <marius <at> gnu.org>
Cc: Ludovic Courtès <ludo <at> gnu.org>, 41625-done <at> debbugs.gnu.org
Subject: Re: bug#41625: Sporadic guix-offload crashes due to EOF errors
Date: Sat, 26 Mar 2022 01:03:37 -0400
Hello,

Maxim Cournoyer <maxim.cournoyer <at> gmail.com> writes:

> Hi Marius,
>
> Marius Bakke <marius <at> gnu.org> writes:
>
>> Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skriver:
>>
>>>> Is running ‘guix offload test /etc/guix/machines.scm overdrive1’ on
>>>> berlin enough to reproduce the issue?  If so, we could monitor/strace
>>>> sshd on overdrive1 to get a better understanding of what’s going on.
>>>
>>> It's actually difficult to trigger it; it seems to happen mostly on the
>>> first try after a long time without connecting to the machine; on the
>>> 2nd and later tries, everything is smooth.  Waiting a few minutes is not
>>> enough to re-trigger the problem.
>>>
>>> I've managed to see the problem a few lucky times with:
>>>
>>> --8<---------------cut here---------------start------------->8---
>>> while true; do guix offload test /etc/guix/machines.scm overdrive1; done
>>> --8<---------------cut here---------------end--------------->8---
>>
>> I used to be able to reproduce it by inducing a high load on the target
>> machine and just let Guix keep trying to connect.  But now I did that,
>> and set overload threshold to 0.0 for good measure, and Guix has been
>> waiting patiently for two hours without failure.
>>
>> So AFAICT this bug has been fixed.  Perhaps Berlin or the Overdrive
>> simply needs to be updated?
>
> Ah!  Do you have root access to overdrive1?  It'd be interesting to
> reconfigure it to update the guix-daemon and see if the problem
> vanishes.

Good news, this seems resolved with the newer Guile-SSH 0.15.1, where
long delays to return some output no longer triggers an EOF response
(instead now the client waits still).  I believe it was fixed by this
commit [0].

Many thanks to Artyom Poptsov for fixing it!

Closing.

Maxim

[0]  https://github.com/artyom-poptsov/guile-ssh/commit/fefaab9e925d015b01abc7c76ea4017c373ad895

[Message part 3 (message/rfc822, inline)]
From: Marius Bakke <marius <at> gnu.org>
To: bug-guix <at> gnu.org
Subject: Sporadic guix-offload crashes due to EOF errors
Date: Sun, 31 May 2020 11:51:54 +0200
[Message part 4 (text/plain, inline)]
Hello,

During 'guix build -s aarch64-linux dolphin' on Berlin, I got this crash:

--8<---------------cut here---------------start------------->8---
building /gnu/store/87655bh9rqcr29qasl1c4yj3skmxkyiz-kfilemetadata-5.70.0.drv...
process 12989 acquired build slot '/var/guix/offload/overdrive1.guixsd.org:52522/1'
process 12989 acquired build slot '/var/guix/offload/dover.guix.info:9023/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.167:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.163:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2223/1'
process 12989 acquired build slot '/var/guix/offload/141.80.167.168:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.173:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.176:22/0'
process 12989 acquired build slot '/var/guix/offload/localhost:2222/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.165:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.169:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.181:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.170:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.174:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.180:22/0'
process 12989 acquired build slot '/var/guix/offload/141.80.167.161:22/0'
Backtrace:
In ice-9/boot-9.scm:
  1736:10  5 (with-exception-handler _ _ #:unwind? _ # _)
In unknown file:
           4 (apply-smob/0 #<thunk 7f3344d296c0>)
In ice-9/boot-9.scm:
    718:2  3 (call-with-prompt _ _ #<procedure default-prompt-handle…>)
In ice-9/eval.scm:
    619:8  2 (_ #(#(#<directory (guile-user) 7f3344933f00>)))
In guix/ui.scm:
  1936:12  1 (run-guix-command _ . _)
In guix/scripts/offload.scm:
   742:22  0 (guix-offload . _)

guix/scripts/offload.scm:742:22: In procedure guix-offload:
Throw to key `match-error' with args `("match" "no matching pattern" #<eof>)'.
guix build: error: unexpected EOF reading a line
--8<---------------cut here---------------end--------------->8---

Which is strange because guix/scripts/offload.scm:742 is wrapped in a
(unless (eof-object? ...)) block.

When this happens, the build command terminates, along with any other
builds that it had started concurrently.  Builds from other clients
were unaffected, of course.

I have also seen this occur on my personal offloading setup once every
blue moon, but don't know what could have caused it.
[signature.asc (application/pgp-signature, inline)]

This bug report was last modified 3 years and 110 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.