GNU bug report logs - #64297
[Cuirass] Remote server not picking up job, losing workers

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Mon, 26 Jun 2023 08:55:02 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 64297 in the body.
You can then email your comments to 64297 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#64297; Package guix. (Mon, 26 Jun 2023 08:55:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Ludovic Courtès <ludovic.courtes <at> inria.fr>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Mon, 26 Jun 2023 08:55:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: bug-guix <at> gnu.org
Subject: [Cuirass] Remote server not picking up job, losing workers
Date: Mon, 26 Jun 2023 10:54:12 +0200
As of cuirass <at> 1.1.0-16.b825967, ‘cuirass remote-server’ appears to not
pick jobs as quickly as it should and to lose sight of workers (you can
see them come and go on <https://ci.guix.gnu.org/workers>).

/var/log/cuirass-remote-worker.log shows that it does build things, but
only sporadically.  Then there are things like:

  2023-06-26 10:07:58 warning: Poll loop busy during 3404 seconds.

This is presumably related to Cuirass commit
c4743b54720e86b0e0b0295fb6d33977e4293644 (previously ‘remote-worker’
would have a database worker thread; now it opens a new connection every
time—a stopgap before it’s fiberized, but apparently not a good one).

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#64297; Package guix. (Fri, 30 Jun 2023 15:46:02 GMT) Full text and rfc822 format available.

Message #8 received at 64297 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 64297 <at> debbugs.gnu.org
Subject: Re: bug#64297: [Cuirass] Remote server not picking up job, losing
 workers
Date: Fri, 30 Jun 2023 17:45:48 +0200
Ludovic Courtès <ludovic.courtes <at> inria.fr> skribis:

> As of cuirass <at> 1.1.0-16.b825967, ‘cuirass remote-server’ appears to not
> pick jobs as quickly as it should and to lose sight of workers (you can
> see them come and go on <https://ci.guix.gnu.org/workers>).
>
> /var/log/cuirass-remote-worker.log shows that it does build things, but
> only sporadically.  Then there are things like:
>
>   2023-06-26 10:07:58 warning: Poll loop busy during 3404 seconds.
>
> This is presumably related to Cuirass commit
> c4743b54720e86b0e0b0295fb6d33977e4293644 (previously ‘remote-worker’
> would have a database worker thread; now it opens a new connection every
> time—a stopgap before it’s fiberized, but apparently not a good one).

Apparently this has to do with squee calling ‘current-read-waiter’
(i.e., poll(2)) while waiting for its response and passing checking the
wrong FD for some reason, as in this case:

--8<---------------cut here---------------start------------->8---
18484 15:01:00 connect(55, {sa_family=AF_UNIX, sun_path="/tmp/ephemeralpg.58xnKh/.s.PGSQL.5432"}, 110) = 0 <0.000019>
18484 15:01:00 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000014>
18484 15:01:00 getsockname(55, {sa_family=AF_UNIX}, [128 => 2]) = 0 <0.000010>
18484 15:01:00 poll([{fd=55, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLOUT}]) <0.000011>
18484 15:01:00 sendto(55, "\0\0\0!\0\3\0\0user\0ludo\0database\0test\0\0", 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000014>
18484 15:01:00 poll([{fd=55, events=POLLIN|POLLERR}], 1, -1 <unfinished ...>
18484 15:01:00 <... poll resumed>)      = 1 ([{fd=55, revents=POLLIN}]) <0.001786>
18484 15:01:00 recvfrom(55, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0\0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MD"..., 16384, 0, NULL, NULL) = 376 <0.000016>
18484 15:01:00 sendto(55, "P\0\0\0\366\0\nUPDATE Builds SET status = -2, worker = null FROM\n(SELECT id FROM Workers"..., 291, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
18484 15:01:00 <... sendto resumed>)    = 291 <0.000589>
18484 15:01:00 poll([{fd=54, events=POLLIN}], 1, -1 <unfinished ...>
18484 15:03:14 <... poll resumed>)      = 1 ([{fd=54, revents=POLLIN}]) <134.319198>
18484 15:03:14 recvfrom(55,  <unfinished ...>
18484 15:03:14 <... recvfrom resumed>"1\0\0\0\0042\0\0\0\4n\0\0\0\4C\0\0\0\rUPDATE 0\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 35 <0.000026>
18484 15:03:14 sendto(55, "P\0\0\0V\0DELETE FROM Workers WHERE\n(extract(epoch from now())::int - last_seen) >  "..., 131, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
18484 15:03:14 <... sendto resumed>)    = 131 <0.000084>
18484 15:03:14 poll([{fd=54, events=POLLIN}], 1, -1) = 1 ([{fd=54, revents=POLLNVAL}]) <0.000013>
18484 15:03:14 recvfrom(55, "1\0\0\0\0042\0\0\0\4n\0\0\0\4C\0\0\0\rDELETE 4\0Z\0\0\0\5I", 16384, 0, NULL, NULL) = 35 <0.000016>
18484 15:03:14 sendto(55, "X\0\0\0\4", 5, MSG_NOSIGNAL, NULL, 0) = 5 <0.000025>
18484 15:03:14 close(55)                = 0 <0.000018>
18484 15:03:14 openat(AT_FDCWD, "/etc/localtime", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) <0.000019>
18484 15:03:14 write(2, "2023-06-30T15:03:14 warning: Poll loop busy during 134 seconds.\n", 64) = 64 <0.000024>
--8<---------------cut here---------------end--------------->8---

In this case FD 54 is a connection with a worker process; terminating
that process led poll(2) to return, thus unblocking the “poll loop”.

The problem is most likely with the connection-to-port caching in
squee’s ‘connection-socket-port’, as can be seen in this other trace
where I added ‘pk’ calls in ‘connection-socket-port’:

--8<---------------cut here---------------start------------->8---
19468 15:37:43 connect(55, {sa_family=AF_UNIX, sun_path="/tmp/ephemeralpg.58xnKh/.s.PGSQL.5432"}, 110) = 0 <0.000018>
19468 15:37:43 getsockopt(55, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 <0.000010>
19468 15:37:43 getsockname(55, {sa_family=AF_UNIX}, [128 => 2]) = 0 <0.000009>
19468 15:37:43 poll([{fd=55, events=POLLOUT|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLOUT}]) <0.000009>
19468 15:37:43 sendto(55, "\0\0\0!\0\3\0\0user\0ludo\0database\0test\0\0", 33, MSG_NOSIGNAL, NULL, 0) = 33 <0.000012>
19468 15:37:43 poll([{fd=55, events=POLLIN|POLLERR}], 1, -1) = 1 ([{fd=55, revents=POLLIN}]) <0.002109>
19468 15:37:43 recvfrom(55, "R\0\0\0\10\0\0\0\0S\0\0\0\26application_name\0\0S\0\0\0\31client_encoding\0UTF8\0S\0\0\0\27DateStyle\0ISO, MD"..., 16384, 0, NULL, NULL) = 376 <0.000009>
19468 15:37:43 sendto(55, "P\0\0\0\366\0\nUPDATE Builds SET status = -2, worker = null FROM\n(SELECT id FROM Workers"..., 291, MSG_NOSIGNAL, NULL, 0) = 291 <0.000012>
19468 15:37:43 write(1, "\n", 1)        = 1 <0.000015>
19468 15:37:43 ioctl(54, TCGETS, 0x7ffd8fde6660) = -1 ENOTTY (Inappropriate ioctl for device) <0.000012>
19468 15:37:43 write(1, ";;; (cached #<pg-conn cdac20 (connected)> #<input-output: file 54>)\n", 68) = 68 <0.000012>
--8<---------------cut here---------------end--------------->8---

Above we open a fresh connection on FD 55, but ‘connection-socket-port’
determines that the connection object is cached and associated with a
port for FD 54.

To be continued…

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#64297; Package guix. (Fri, 30 Jun 2023 22:43:02 GMT) Full text and rfc822 format available.

Message #11 received at 64297 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: 64297 <at> debbugs.gnu.org
Subject: Re: bug#64297: [Cuirass] Remote server not picking up job, losing
 workers
Date: Sat, 01 Jul 2023 00:42:44 +0200
Ludovic Courtès <ludo <at> gnu.org> skribis:

> The problem is most likely with the connection-to-port caching in
> squee’s ‘connection-socket-port’, as can be seen in this other trace
> where I added ‘pk’ calls in ‘connection-socket-port’:

Confirmed, with a fix!

  https://notabug.org/cwebber/guile-squee/pulls/8

Ludo’.




Information forwarded to bug-guix <at> gnu.org:
bug#64297; Package guix. (Sat, 01 Jul 2023 10:31:02 GMT) Full text and rfc822 format available.

Message #14 received at 64297 <at> debbugs.gnu.org (full text, mbox):

From: Christopher Baines <mail <at> cbaines.net>
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: 64297 <at> debbugs.gnu.org
Subject: Re: bug#64297: [Cuirass] Remote server not picking up job, losing
 workers
Date: Sat, 01 Jul 2023 11:28:54 +0100
[Message part 1 (text/plain, inline)]
Ludovic Courtès <ludo <at> gnu.org> writes:

> Ludovic Courtès <ludo <at> gnu.org> skribis:
>
>> The problem is most likely with the connection-to-port caching in
>> squee’s ‘connection-socket-port’, as can be seen in this other trace
>> where I added ‘pk’ calls in ‘connection-socket-port’:
>
> Confirmed, with a fix!
>
>   https://notabug.org/cwebber/guile-squee/pulls/8

I've merged that change, updated guile-squee in Guix, pulled on berlin,
reconfigured and restarted Cuirass now.

It seems to be building some now stuff at least.
[signature.asc (application/pgp-signature, inline)]

Reply sent to Ludovic Courtès <ludo <at> gnu.org>:
You have taken responsibility. (Sat, 01 Jul 2023 16:59:02 GMT) Full text and rfc822 format available.

Notification sent to Ludovic Courtès <ludovic.courtes <at> inria.fr>:
bug acknowledged by developer. (Sat, 01 Jul 2023 16:59:02 GMT) Full text and rfc822 format available.

Message #19 received at 64297-done <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Christopher Baines <mail <at> cbaines.net>
Cc: 64297-done <at> debbugs.gnu.org
Subject: Re: bug#64297: [Cuirass] Remote server not picking up job, losing
 workers
Date: Sat, 01 Jul 2023 18:57:53 +0200
Hello,

Christopher Baines <mail <at> cbaines.net> skribis:

> Ludovic Courtès <ludo <at> gnu.org> writes:
>
>> Ludovic Courtès <ludo <at> gnu.org> skribis:
>>
>>> The problem is most likely with the connection-to-port caching in
>>> squee’s ‘connection-socket-port’, as can be seen in this other trace
>>> where I added ‘pk’ calls in ‘connection-socket-port’:
>>
>> Confirmed, with a fix!
>>
>>   https://notabug.org/cwebber/guile-squee/pulls/8
>
> I've merged that change, updated guile-squee in Guix, pulled on berlin,
> reconfigured and restarted Cuirass now.

Awesome, thanks a lot!  It does seem to be working hard now.

(I’ve just reconfigured as well, as I hadn’t seen your message…)

Ludo’.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 30 Jul 2023 11:24:15 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 14 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.