GNU bug report logs - #67839
shepherd: sometimes hangs on `guix system reconfigure`

Previous Next

Package: guix;

Reported by: Attila Lendvai <attila <at> lendvai.name>

Date: Fri, 15 Dec 2023 19:22:02 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 67839 in the body.
You can then email your comments to 67839 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Fri, 15 Dec 2023 19:22:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Attila Lendvai <attila <at> lendvai.name>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Fri, 15 Dec 2023 19:22:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila <at> lendvai.name>
To: "bug-guix <at> gnu.org" <bug-guix <at> gnu.org>
Subject: shepherd: sometimes hangs on `guix system reconfigure`
Date: Fri, 15 Dec 2023 19:20:36 +0000
my fellow hackers,

i'm going to attach two patches that is essentially just adding a couple of asserts that trigger a test failure (tests/replacement.sh).

my current codebase (https://codeberg.org/attila-lendvai-patches/shepherd/commits/branch/attila) logs a whole lot more information, and has a more sophisticated error handling. triggering the same error on that codebase shows that the first assert is already failing (the one that is before spawning the new fiber for the controller of the replacement).

maybe the root cause is this: https://github.com/wingo/fibers/issues/29

HTH,

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Angry people want you to see how powerful they are… loving people want you to see how powerful You are.”
	— Chief Red Eagle





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Fri, 15 Dec 2023 19:38:01 GMT) Full text and rfc822 format available.

Message #8 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila.lendvai <at> gmail.com>
To: 67839 <at> debbugs.gnu.org
Cc: Attila Lendvai <attila <at> lendvai.name>
Subject: [PATCH 1/2] shepherd: Move root-service start under
 with-process-monitor.
Date: Fri, 15 Dec 2023 20:37:08 +0100
* modules/shepherd.scm (main): move the (start-service root-service) under the
dynamic extent of with-process-monitor, so that (current-process-monitor) is
valid for the root-service, too.
---
 modules/shepherd.scm | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index efc5517..77c6d18 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -451,12 +451,12 @@ fork in the child process."
         (run-fibers
          (lambda ()
            (with-service-registry
+             (with-process-monitor
 
-             ;; Register and start the 'root' service.
-             (register-services (list root-service))
-             (start-service root-service)
+               ;; Register and start the 'root' service.
+               (register-services (list root-service))
+               (start-service root-service)
 
-             (with-process-monitor
                ;; Replace the default 'system*' binding with one that
                ;; cooperates instead of blocking on 'waitpid'.  Replace
                ;; 'primitive-load' (in C as of 3.0.9) with one that does
-- 
2.41.0





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Fri, 15 Dec 2023 19:39:02 GMT) Full text and rfc822 format available.

Message #11 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila.lendvai <at> gmail.com>
To: 67839 <at> debbugs.gnu.org
Cc: Attila Lendvai <attila <at> lendvai.name>
Subject: [PATCH 2/2] service: Add two asserts that will make
 tests/replacement.sh fail.
Date: Fri, 15 Dec 2023 20:37:10 +0100
* modules/shepherd/service.scm (spawn-service-controller): Add two asserts.
This is the bug that causes `guix system reconfigure ...` to sometimes hang,
and subsequently all shepherd commands, because a match-error flies out from
the service-controller of a replaced service, and thus its fiber dies.
---
 modules/shepherd/service.scm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index c3bdf44..0ee6929 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -382,9 +382,11 @@ denoting what the service provides."
 
 (define (spawn-service-controller service)
   "Return a channel over which @var{service} may be controlled."
+  (assert (current-process-monitor))
   (let ((channel (make-channel)))
     (spawn-fiber
      (lambda ()
+       (assert (current-process-monitor))
        ;; The controller writes to its current output port via 'local-output'.
        ;; Make sure that goes to the right port.  If the controller got a
        ;; wrong output port, it could crash and stop responding just because a
-- 
2.41.0





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Fri, 15 Dec 2023 20:34:03 GMT) Full text and rfc822 format available.

Message #14 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: "Timo Wilken" <guix <at> twilken.net>
To: "Attila Lendvai" <attila <at> lendvai.name>
Cc: 67538 <at> debbugs.gnu.org, 67230 <at> debbugs.gnu.org, 67839 <at> debbugs.gnu.org,
 65178 <at> debbugs.gnu.org
Subject: Re: Shepherd stops responding during "guix system reconfigure"
Date: Fri, 15 Dec 2023 21:33:15 +0100
On Fri Dec 15, 2023 at 8:47 PM CET, Attila Lendvai wrote:
> i think i have found the root cause of this, as documented here: https://issues.guix.gnu.org/67839
>
> that issue contains patches for shepherd to reproduce it in its test suite.

Thank you very much for this, Attila!

Are the patch in 67839 and/or your branch "attila" linked from there in a
state that I could test them locally? Would it be valuable to you if I ran a
patched Shepherd and sent logs and/or backtraces as I encountered them?




Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Fri, 15 Dec 2023 21:25:05 GMT) Full text and rfc822 format available.

Message #17 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila <at> lendvai.name>
To: Timo Wilken <guix <at> twilken.net>
Cc: 67538 <at> debbugs.gnu.org, 67230 <at> debbugs.gnu.org, 67839 <at> debbugs.gnu.org,
 65178 <at> debbugs.gnu.org
Subject: Re: Shepherd stops responding during "guix system reconfigure"
Date: Fri, 15 Dec 2023 21:24:15 +0000
> Thank you very much for this, Attila!


you're welcome! :)


> Are the patch in 67839 and/or your branch "attila" linked from there in a
> state that I could test them locally? Would it be valuable to you if I ran a
> patched Shepherd and sent logs and/or backtraces as I encountered them?


it's nice of you, but not really. now that we have a failing test case in shepherd's unit tests that can reproduce it much easier.

with #67839 you would only get you an extra "Assertion failed" message over master, without much useful output.

as for my branch, it would emit a lot of useful log, including backtraces, but i keep force-pushing into it. i'm running my servers with it, though, so if you feel really adventurous, and want to join the debugging, then you can try... otherwise it's too much in flux.

what we need to focus on now is making shepherd's test suite run clean again, one way or another. then i can test it in a real life environment, and report back with any possible findings.

-- 
• attila lendvai
• PGP: 963F 5D5F 45C7 DFCD 0A39
--
“Ignorance might be bliss for the ignorant, but for the rest of us it's a fucking pain in the ass.”
	— Ricky Gervais





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Sun, 17 Dec 2023 00:08:01 GMT) Full text and rfc822 format available.

Message #20 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila.lendvai <at> gmail.com>
To: 67839 <at> debbugs.gnu.org
Cc: Attila Lendvai <attila <at> lendvai.name>
Subject: [PATCH 3/2] shepherd: Fix tests/replacement.sh
Date: Sun, 17 Dec 2023 01:06:24 +0100
* modules/shepherd.scm (main): Switch with-service-registry and
with-process-monitor.  Fix proposed by @emixa-d at
https://github.com/wingo/fibers/issues/29#issuecomment-1858922276.  This way
the parameterize of the process monitor covers everything else.
---
 modules/shepherd.scm | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index 77c6d18..3303de3 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -450,8 +450,8 @@ fork in the child process."
         ;; because POSIX threads and 'fork' cannot be used together.
         (run-fibers
          (lambda ()
-           (with-service-registry
-             (with-process-monitor
+           (with-process-monitor
+             (with-service-registry
 
                ;; Register and start the 'root' service.
                (register-services (list root-service))
-- 
2.41.0





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Sun, 17 Dec 2023 00:46:02 GMT) Full text and rfc822 format available.

Message #23 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila.lendvai <at> gmail.com>
To: 67839 <at> debbugs.gnu.org
Cc: Attila Lendvai <attila <at> lendvai.name>
Subject: [PATCH v2 1/2] shepherd: Make sure with-process-monitor covers
 everything needed.
Date: Sun, 17 Dec 2023 01:44:24 +0100
* modules/shepherd.scm (main): Switch with-service-registry and
with-process-monitor.  This way the parameterize of the process monitor covers
everything else.  This fixes the bug that caused `guix system reconfigure` to
hang in certain situations.  Fix proposed by @emixa-d at:
https://github.com/wingo/fibers/issues/29#issuecomment-1858922276.
---
 modules/shepherd.scm | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/modules/shepherd.scm b/modules/shepherd.scm
index efc5517..3303de3 100644
--- a/modules/shepherd.scm
+++ b/modules/shepherd.scm
@@ -450,13 +450,13 @@ fork in the child process."
         ;; because POSIX threads and 'fork' cannot be used together.
         (run-fibers
          (lambda ()
-           (with-service-registry
+           (with-process-monitor
+             (with-service-registry
 
-             ;; Register and start the 'root' service.
-             (register-services (list root-service))
-             (start-service root-service)
+               ;; Register and start the 'root' service.
+               (register-services (list root-service))
+               (start-service root-service)
 
-             (with-process-monitor
                ;; Replace the default 'system*' binding with one that
                ;; cooperates instead of blocking on 'waitpid'.  Replace
                ;; 'primitive-load' (in C as of 3.0.9) with one that does
-- 
2.41.0





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Sun, 17 Dec 2023 00:46:02 GMT) Full text and rfc822 format available.

Message #26 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Attila Lendvai <attila.lendvai <at> gmail.com>
To: 67839 <at> debbugs.gnu.org
Cc: Attila Lendvai <attila <at> lendvai.name>
Subject: [PATCH v2 2/2] service: Add asserts that used to make
 tests/replacement.sh fail.
Date: Sun, 17 Dec 2023 01:44:25 +0100
* modules/shepherd/service.scm (spawn-service-controller): Add two asserts.
This is the bug that causes `guix system reconfigure ...` to sometimes hang,
and subsequently all shepherd commands, because a match-error flies out from
the service-controller of a replaced service, and thus its fiber dies.  These
asserts get triggered without the previous commit that fixes the issue.
---
 modules/shepherd/service.scm | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/modules/shepherd/service.scm b/modules/shepherd/service.scm
index c3bdf44..0ee6929 100644
--- a/modules/shepherd/service.scm
+++ b/modules/shepherd/service.scm
@@ -382,9 +382,11 @@ denoting what the service provides."
 
 (define (spawn-service-controller service)
   "Return a channel over which @var{service} may be controlled."
+  (assert (current-process-monitor))
   (let ((channel (make-channel)))
     (spawn-fiber
      (lambda ()
+       (assert (current-process-monitor))
        ;; The controller writes to its current output port via 'local-output'.
        ;; Make sure that goes to the right port.  If the controller got a
        ;; wrong output port, it could crash and stop responding just because a
-- 
2.41.0





Information forwarded to bug-guix <at> gnu.org:
bug#67839; Package guix. (Sun, 17 Dec 2023 22:01:02 GMT) Full text and rfc822 format available.

Message #29 received at 67839 <at> debbugs.gnu.org (full text, mbox):

From: Ludovic Courtès <ludo <at> gnu.org>
To: Attila Lendvai <attila.lendvai <at> gmail.com>
Cc: Attila Lendvai <attila <at> lendvai.name>, 67839 <at> debbugs.gnu.org
Subject: Re: bug#67839: shepherd: sometimes hangs on `guix system reconfigure`
Date: Sun, 17 Dec 2023 22:59:59 +0100
Hi Attila,

Attila Lendvai <attila.lendvai <at> gmail.com> skribis:

> * modules/shepherd.scm (main): Switch with-service-registry and
> with-process-monitor.  This way the parameterize of the process monitor covers
> everything else.  This fixes the bug that caused `guix system reconfigure` to
> hang in certain situations.  Fix proposed by @emixa-d at:
> https://github.com/wingo/fibers/issues/29#issuecomment-1858922276.

[...]

> * modules/shepherd/service.scm (spawn-service-controller): Add two asserts.
> This is the bug that causes `guix system reconfigure ...` to sometimes hang,
> and subsequently all shepherd commands, because a match-error flies out from
> the service-controller of a replaced service, and thus its fiber dies.  These
> asserts get triggered without the previous commit that fixes the issue.

Good catch!!

I pushed these patches with small edits, in particular adding a test
that reproduces the bug without relying on assertion failures:

  5dbde1c support: ‘assert’ logs source location information.
  0bcf02a Update NEWS.
  c07f0a8 service: Add asserts to ensure a process monitor is running.
  9be0b7e shepherd: Make sure ‘with-process-monitor’ covers everything needed.

Thanks for the tedious but fruitful debugging work!

Ludo’.




bug closed, send any further explanations to 67839 <at> debbugs.gnu.org and Attila Lendvai <attila <at> lendvai.name> Request was from Ludovic Courtès <ludo <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 17 Dec 2023 22:01:03 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 15 Jan 2024 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 206 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.