GNU bug report logs -
#41625
Sporadic guix-offload crashes due to EOF errors
Previous Next
Reported by: Marius Bakke <marius <at> gnu.org>
Date: Sun, 31 May 2020 09:52:01 UTC
Severity: normal
Done: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
Bug is archived. No further changes may be made.
Full log
Message #50 received at 41625 <at> debbugs.gnu.org (full text, mbox):
Hi,
Maxim Cournoyer <maxim.cournoyer <at> gmail.com> skribis:
> Now that I have root access to overdrive1, I could strace the sshd
> process (I just did 'strace -p340', noting the process of sshd displayed
> with 'herd status sshd'):
>
> pselect6(87, [3 4], NULL, NULL, NULL, NULL) = 1 (in [3])
> accept(3, {sa_family=AF_INET, sin_port=htons(33262), sin_addr=inet_addr("66.158.152.121")}, [128->16]) = 5
> fcntl(5, F_GETFL) = 0x2 (flags O_RDWR)
> pipe2([6, 7], 0) = 0
> socketpair(AF_UNIX, SOCK_STREAM, 0, [8, 9]) = 0
> clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xffff8e0ef0e0) = 644
> close(7) = 0
> close(9) = 0
> write(8, "\0\0\1\245\0", 5) = 5
> write(8, "\0\0\1\234\nPort 22\nPermitRootLogin no\n"..., 420) = 420
> close(8) = 0
> close(5) = 0
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> getpid() = 340
> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
> read(6, "\0", 1) = 1
> pselect6(87, [3 4 6], NULL, NULL, NULL, NULL) = 1 (in [6])
> read(6, "", 1) = 0
OK, so it looks as if the client disconnected right away. Hard to tell
exactly what that happened. :-/ Perhaps turning libssh debugging on on
the client side could help (by uncommenting “#:log-verbosity 'protocol”
in (guix ssh)).
>>From c7b2ec1c58adf8c795df0a6aaf075dbc331f41e8 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
> Date: Thu, 27 May 2021 08:44:44 -0400
> Subject: [PATCH 1/2] offload: Parallelize machine check in offload test.
>
> * guix/scripts/offload.scm (check-machine-availability): Refactor so that it
> takes a single machine object. Ensure the cleanup code is always run.
> (check-machines-availability): New procedure. Call
> CHECK-MACHINES-AVAILABILITY in parallel, which improves performance (about
> twice as fast with 4 build machines, from ~30 s to ~15 s).
I remain wary of this change, because that could lead to subtle
non-deterministic bugs (of the kind that keeps you busy for weeks) and
because I personally don’t mind if ‘guix offload test’ takes 30s on
berlin, and because the intermingled output may make diagnostics less
clear.
>>From b5558777617e4674a150895458d57d202de56120 Mon Sep 17 00:00:00 2001
> From: Maxim Cournoyer <maxim.cournoyer <at> gmail.com>
> Date: Tue, 25 May 2021 08:42:06 -0400
> Subject: [PATCH 2/2] offload: Handle a possible EOF response from
> read-repl-response.
>
> Partially fixes <https://issues.guix.gnu.org/41625>.
>
> * guix/scripts/offload.scm (check-machine-availability): Handle the case where
> the checks raised an exception due to receiving EOF prematurely, and retry up
> to 3 times.
> * guix/inferior.scm (&inferior-premature-eof): New condition type.
> (read-repl-response): Raise a condition of the above type when reading EOF
> from the build machine's port.
LGTM!
Thanks,
Ludo’.
This bug report was last modified 3 years and 54 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.