From debbugs-submit-bounces@debbugs.gnu.org Mon Jun 11 10:06:43 2018 Received: (at submit) by debbugs.gnu.org; 11 Jun 2018 14:06:43 +0000 Received: from localhost ([127.0.0.1]:43514 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fSNT1-0008II-Bm for submit@debbugs.gnu.org; Mon, 11 Jun 2018 10:06:43 -0400 Received: from eggs.gnu.org ([208.118.235.92]:56334) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fSNSz-0008I6-9a for submit@debbugs.gnu.org; Mon, 11 Jun 2018 10:06:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fSNSq-0001YW-18 for submit@debbugs.gnu.org; Mon, 11 Jun 2018 10:06:35 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=-0.0 required=5.0 tests=BAYES_40 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:47059) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1fSNSp-0001YQ-Tg for submit@debbugs.gnu.org; Mon, 11 Jun 2018 10:06:31 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:47293) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fSNSg-0002pO-VQ for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:31 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fSNSc-0001TT-TN for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:22 -0400 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:44274) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fSNSc-0001TH-PY for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:18 -0400 Received: from [193.50.110.75] (port=57708 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fSNSc-0003FS-9d for bug-guix@gnu.org; Mon, 11 Jun 2018 10:06:18 -0400 From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) To: bug-guix@gnu.org Subject: Multiple client 'build-paths' RPCs can lead to daemon deadlock X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 23 Prairial an 226 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 11 Jun 2018 16:06:16 +0200 Message-ID: <87602ph0yv.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -5.0 (-----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -6.0 (------) Hello, I tried running this: guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-gr= afts) | sort) & \ guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-gr= afts) | sort -r) =E2=80=A6 also in parallel with this (for good measure): guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-gr= afts) | sort -R) Since we have 3 clients, that leads to 3 guix-daemon processes, and those are stuck in a deadlock: --8<---------------cut here---------------start------------->8--- $ ps aux | grep guix-daemon [=E2=80=A6] root 20501 0.6 0.0 39292 12012 ? Ss 15:51 0:01 guix-daemo= n 20455 root 20503 4.3 0.0 39420 12096 ? Ss 15:51 0:12 guix-daemo= n 20491 root 22154 0.0 0.0 39028 11016 ? Ss 15:52 0:00 guix-daemo= n 22148 $ sudo strace -p 22154 -t strace: Process 22154 attached 15:57:03 select(0, 0x7ffdca30b610, NULL, NULL, {tv_sec=3D1, tv_usec=3D31418= 9}) =3D 0 (Timeout) 15:57:04 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binu= tils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-core= utils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gett= ext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glib= c-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnut= ls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-grof= f-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libs= tdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-modu= le-import-compiled.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:57:04 fcntl(8, F_GETFD) =3D 0 15:57:04 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:57:04 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:57:04 close(8) =3D 0 15:57:04 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-open= ssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 ^C $ sudo strace -p 20503 -t strace: Process 20503 attached 15:58:23 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec= =3D1, tv_usec=3D595302}) =3D 0 (Timeout) 15:58:25 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:25 fcntl(14, F_GETFD) =3D 0 15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:25 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:25 close(14) =3D 0 15:58:25 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-modu= le-import-compiled.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:25 fcntl(14, F_GETFD) =3D 0 15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:25 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:25 close(14) =3D 0 15:58:25 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-open= ssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:25 fcntl(14, F_GETFD) =3D 0 15:58:25 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:25 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:25 close(14) =3D 0 15:58:25 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec= =3D5, tv_usec=3D0}) =3D 0 (Timeout) 15:58:30 openat(AT_FDCWD, "/gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:30 fcntl(14, F_GETFD) =3D 0 15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:30 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:30 close(14) =3D 0 15:58:30 openat(AT_FDCWD, "/gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-modu= le-import-compiled.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:30 fcntl(14, F_GETFD) =3D 0 15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:30 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:30 close(14) =3D 0 15:58:30 openat(AT_FDCWD, "/gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-open= ssl-1.0.2o-doc.lock", O_RDWR|O_CREAT, 0600) =3D 14 15:58:30 fcntl(14, F_GETFD) =3D 0 15:58:30 fcntl(14, F_SETFD, FD_CLOEXEC) =3D 0 15:58:30 fcntl(14, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:58:30 close(14) =3D 0 15:58:30 select(445, [15 54 159 177 321 411 414 444], NULL, NULL, {tv_sec= =3D5, tv_usec=3D0}^Cstrace: Process 20503 detached $ sudo strace -p 20501 -t strace: Process 20501 attached 15:59:09 select(121, [20 30 120], NULL, NULL, {tv_sec=3D0, tv_usec=3D535316= }) =3D 0 (Timeout) 15:59:10 openat(AT_FDCWD, "/gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binu= tils-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-core= utils-8.29-debug.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gett= ext-minimal-0.19.8.1.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glib= c-bootstrap-0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnut= ls-3.5.18-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-grof= f-1.22.3-doc.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libs= tdc++-5.5.0.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 openat(AT_FDCWD, "/gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1= .4.18.lock", O_RDWR|O_CREAT, 0600) =3D 8 15:59:10 fcntl(8, F_GETFD) =3D 0 15:59:10 fcntl(8, F_SETFD, FD_CLOEXEC) =3D 0 15:59:10 fcntl(8, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start= =3D0, l_len=3D0}) =3D -1 EAGAIN (Risurco dumtempe ne disponeblas) 15:59:10 close(8) =3D 0 15:59:10 select(121, [20 30 120], NULL, NULL, {tv_sec=3D5, tv_usec=3D0}^Cst= race: Process 20501 detached $ sudo lsof | grep '/gnu/store/.*\.lock' guix-daem 20501 root 14uW REG 253,0 = 0 3150879 /gnu/store/zvvgr301hz95ac15rp0wsjy5j4cdgkg4-module-import-com= piled.lock guix-daem 20501 root 26uW REG 253,0 = 0 3212618 /gnu/store/760kj698c9x1vcmlzk1qr8i9hqw961n3-openssl-1.0.2o-do= c.lock guix-daem 20501 root 116uW REG 253,0 = 0 3212684 /gnu/store/amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock guix-daem 20503 root 8uW REG 253,0 = 0 3212520 /gnu/store/7rs9qx86f2w4hjbf9wjm10k4m2vvlbyv-groff-1.22.3-doc.= lock guix-daem 20503 root 50uW REG 253,0 = 0 3212514 /gnu/store/ghc9vcc8ydv305jz0spd73qrrf30gxik-binutils-bootstra= p-0.lock guix-daem 20503 root 155uW REG 253,0 = 0 3212547 /gnu/store/j1kpzcpi1m1ihzwyqb60zd4s74yp85g8-glibc-bootstrap-0= .lock guix-daem 20503 root 173uW REG 253,0 = 0 3212552 /gnu/store/4llnxsgr6mxxj6pk21d9vbwaxnkbn99k-gnutls-3.5.18-doc= .lock guix-daem 20503 root 317uW REG 253,0 = 0 3212602 /gnu/store/d645d919gp1zk8v65qa2r4ia5x1c5n6k-libstdc++-5.5.0.l= ock guix-daem 20503 root 407uW REG 253,0 = 0 3212655 /gnu/store/5hdpffkgkfszpsin2ir50zscz06x0mb5-coreutils-8.29-de= bug.lock guix-daem 20503 root 410uW REG 253,0 = 0 3212659 /gnu/store/aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0= .19.8.1.lock guix-daem 20503 root 440uW REG 253,0 = 0 3212686 /gnu/store/p262qk7zsd2wnlal6x1jax2xbqvdddqq-m4-1.4.18.lock --8<---------------cut here---------------end--------------->8--- So for example: =E2=80=A2 20501 tries to acquire aasfcliv85frq3aawpa4rw18nvlvl735-gettext-minimal-0.19.8.1.lock, which is held by 20503; =E2=80=A2 20503 tries to acquire amcs7l0ynj1qg6fp9ll3asiamd4zsq75-m4-1.4.18.lock, which is held by 20501. This comes from the fact that =E2=80=98LocalStore::buildPaths=E2=80=99 take= s the user-supplied derivation list as is, without sorting it, and then acquires locks in that order in =E2=80=98Worker::run=E2=80=99. A topological sort (or maybe an alphanumeric sort?) should allow us to guarantee that guix-daemon processes take locks in the same order, and then don=E2=80=99t end up in a deadlock. I discovered this bug while monitoring Cuirass on berlin: several sessions submit batches of 200 derivations in =E2=80=98build-paths=E2=80=99= RPCs, and sometimes most of the corresponding guix-daemon processes would end up being stuck in a lock-acquiring loop. Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Mon Jun 11 11:42:51 2018 Received: (at control) by debbugs.gnu.org; 11 Jun 2018 15:42:51 +0000 Received: from localhost ([127.0.0.1]:43590 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fSOy3-00023X-AX for submit@debbugs.gnu.org; Mon, 11 Jun 2018 11:42:51 -0400 Received: from mail2-relais-roc.national.inria.fr ([192.134.164.83]:14252) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1fSOy0-00023I-Pg for control@debbugs.gnu.org; Mon, 11 Jun 2018 11:42:49 -0400 X-IronPort-AV: E=Sophos;i="5.49,502,1520895600"; d="scan'208";a="331324972" Received: from unknown (HELO ribbon) ([193.50.110.75]) by mail2-relais-roc.national.inria.fr with ESMTP/TLS/AES256-GCM-SHA384; 11 Jun 2018 17:42:42 +0200 Date: Mon, 11 Jun 2018 17:42:42 +0200 Message-Id: <87zi01cost.fsf@gnu.org> To: control@debbugs.gnu.org From: ludo@gnu.org (Ludovic =?utf-8?Q?Court=C3=A8s?=) Subject: control message for bug #31785 MIME-version: 1.0 Content-type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -5.0 (-----) severity 31785 important From debbugs-submit-bounces@debbugs.gnu.org Fri Sep 06 05:05:36 2019 Received: (at 31785) by debbugs.gnu.org; 6 Sep 2019 09:05:37 +0000 Received: from localhost ([127.0.0.1]:36406 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1i6ABU-0006rO-LP for submit@debbugs.gnu.org; Fri, 06 Sep 2019 05:05:36 -0400 Received: from eggs.gnu.org ([209.51.188.92]:48037) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1i6ABS-0006r9-Op for 31785@debbugs.gnu.org; Fri, 06 Sep 2019 05:05:35 -0400 Received: from fencepost.gnu.org ([2001:470:142:3::e]:53162) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1i6AAh-0004sA-Eh for 31785@debbugs.gnu.org; Fri, 06 Sep 2019 05:05:28 -0400 Received: from [2001:660:6102:320:e120:2c8f:8909:cdfe] (port=37632 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1i6AAg-00050u-OG for 31785@debbugs.gnu.org; Fri, 06 Sep 2019 05:04:47 -0400 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 31785@debbugs.gnu.org Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock References: <87602ph0yv.fsf@gnu.org> Date: Fri, 06 Sep 2019 11:04:43 +0200 In-Reply-To: <87602ph0yv.fsf@gnu.org> ("Ludovic \=\?utf-8\?Q\?Court\=C3\=A8s\=22'\?\= \=\?utf-8\?Q\?s\?\= message of "Mon, 11 Jun 2018 16:06:16 +0200") Message-ID: <87lfv1eqxg.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Here=E2=80=99s another example from berlin today: --8<---------------cut here---------------start------------->8--- ludo@berlin$ sudo guix processes SessionPID: 38649 ClientPID: 38611 ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guil= e \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --= max-silent-time=3D36000 --timeout=3D216000 -s armhf-linux nmoldyn r-msnid r= -yamss netcdf-fortran python-netcdf4 LockHeld: /gnu/store/v6hcm1gvv3gdfbnq6cd4kn1i6ip3y74s-netcdf-fortran-4.4.4.= lock LockHeld: /gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.1.lock ChildProcess: 33181: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.= 4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902= 458/bin/.guix-real offload x86_64-linux 36000 1 216000 [...] SessionPID: 44043 ClientPID: 44023 ClientCommand: /gnu/store/7x9lrzs2l00mfcr3ya7hg1i06i7lq9pd-profile/bin/guil= e \ /home/mbakke/guix/master/scripts/guix build --no-grafts --keep-going --= max-silent-time=3D36000 --timeout=3D216000 -s armhf-linux netcdf-fortran r-= msnid nmoldyn r-yamss LockHeld: /gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.0.lock LockHeld: /gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.11.lock ChildProcess: 48997: /gnu/store/9alic3caqhay3h8mx4iihpmyj6ymqpcx-guile-2.2.= 4/bin/guile \ /gnu/store/vyhfp0gj4rx74yw0ybalzj8bvmqgp9pr-guix-1.0.1-4.c902= 458/bin/.guix-real offload x86_64-linux 36000 1 216000 ludo@berlin ~$ sudo strace -p 38649 -e openat,fcntl strace: Process 38649 attached openat(AT_FDCWD, "/gnu/store/ibdbwxb1i8yal8ilqxxgnzh6bxpinic1-nmoldyn-3.0.1= 1.lock", O_RDWR|O_CREAT, 0600) =3D 10 fcntl(10, F_GETFD) =3D 0 fcntl(10, F_SETFD, FD_CLOEXEC) =3D 0 fcntl(10, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start=3D0, l_l= en=3D0}) =3D -1 EAGAIN (Resource temporarily unavailable) openat(AT_FDCWD, "/gnu/store/69jjabypcpx83482rh3yp4a9ncjimw6b-r-yamss-1.10.= 0.lock", O_RDWR|O_CREAT, 0600) =3D 10 fcntl(10, F_GETFD) =3D 0 fcntl(10, F_SETFD, FD_CLOEXEC) =3D 0 fcntl(10, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start=3D0, l_l= en=3D0}) =3D -1 EAGAIN (Resource temporarily unavailable) ^Cstrace: Process 38649 detached ludo@berlin ~$ sudo strace -p 44043 -e openat,fcntl strace: Process 44043 attached openat(AT_FDCWD, "/gnu/store/ydxl0ws9kk6ix58m99ib7qhi157yq49h-r-msnid-1.18.= 1.lock", O_RDWR|O_CREAT, 0600) =3D 10 fcntl(10, F_GETFD) =3D 0 fcntl(10, F_SETFD, FD_CLOEXEC) =3D 0 fcntl(10, F_SETLK, {l_type=3DF_WRLCK, l_whence=3DSEEK_SET, l_start=3D0, l_l= en=3D0}) =3D -1 EAGAIN (Resource temporarily unavailable) ^Cstrace: Process 44043 detached --8<---------------cut here---------------end--------------->8--- A very palpable deadlock. Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Wed Nov 04 10:21:51 2020 Received: (at 31785) by debbugs.gnu.org; 4 Nov 2020 15:21:52 +0000 Received: from localhost ([127.0.0.1]:50270 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kaKbf-00030L-Kk for submit@debbugs.gnu.org; Wed, 04 Nov 2020 10:21:51 -0500 Received: from eggs.gnu.org ([209.51.188.92]:60520) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1kaKbd-000308-Sq for 31785@debbugs.gnu.org; Wed, 04 Nov 2020 10:21:50 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]:41337) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1kaKbW-0008Oq-US for 31785@debbugs.gnu.org; Wed, 04 Nov 2020 10:21:44 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=56034 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1kaKbV-0007OP-DN; Wed, 04 Nov 2020 10:21:41 -0500 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 31785@debbugs.gnu.org Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock References: <87602ph0yv.fsf@gnu.org> Date: Wed, 04 Nov 2020 16:21:39 +0100 In-Reply-To: <87602ph0yv.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Mon, 11 Jun 2018 16:06:16 +0200") Message-ID: <87361p9mgs.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 Cc: Mathieu Othacehe X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Hi, ludo@gnu.org (Ludovic Court=C3=A8s) skribis: > This comes from the fact that =E2=80=98LocalStore::buildPaths=E2=80=99 ta= kes the > user-supplied derivation list as is, without sorting it, and then > acquires locks in that order in =E2=80=98Worker::run=E2=80=99. This diagnostic is incorrect: =E2=80=98Goals=E2=80=99 is a set sorted accor= ding to =E2=80=98CompareGoalPtrs=E2=80=99, which is lexical sort that arranges so s= ubstitution goals come before derivation goals. Thus, =E2=80=98_topGoals=E2=80=99 and = =E2=80=98awake2=E2=80=99 in Worker::run are sorted in a deterministic fashion. The problem is that =E2=80=98Worker::waitForAWhile=E2=80=99 reshuffles the = order of goals by temporarily moving goals out of the way. This can happen when offloading replies =E2=80=9Cpostpone=E2=80=9D, which is inherently non-dete= rministic (which goals are put to sleep will vary from one session to another session.) When those goals are eventually woken up from =E2=80=98Worker::waitForInput= =E2=80=99, they=E2=80=99re reprocessed, in sorted order, but potentially with =E2=80= =9Choles=E2=80=9D compared to other =E2=80=98guix-daemon=E2=80=99 processes. That=E2=80=99s only a partial explanation; we need to go further to come up= with an actual deadlock scenario. Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 21 11:22:29 2024 Received: (at 31785) by debbugs.gnu.org; 21 Dec 2024 16:22:29 +0000 Received: from localhost ([127.0.0.1]:47418 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tP2FJ-0003Mb-Ai for submit@debbugs.gnu.org; Sat, 21 Dec 2024 11:22:29 -0500 Received: from eggs.gnu.org ([209.51.188.92]:60032) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tP2FH-0003MP-Sh for 31785@debbugs.gnu.org; Sat, 21 Dec 2024 11:22:28 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tP2FC-0004Dh-M0 for 31785@debbugs.gnu.org; Sat, 21 Dec 2024 11:22:22 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=gLOjRZxuK+bwiWhhRezOi0k5F6eQPhSVP9o/7zGUeEA=; b=bvzR7CWA+/liyVe8zu5Q F8K/ddPVG//swDgTYsIDM/sgVy30XPSi3Ju2h/+cEijnMV89ovUR3J/0PHozFp9mG8og17jP4LGJ0 lShpGzp+jNBC+INPC7TLtAfFMO2/r5MSYDij6iG+iz7HtJRzPvhjlZw9fbMOYB7/eOg3zg/2NaxCI A+VLFj1jVbTkrS08JA+w01eSeBUzy+O3t8qrtuz+xhW08+EJjYKnr5e3ttwNEUERgSKwI3XY0BfDv nRgsgw5dkcnSrNA+P8SHdfm+Q82INBEr3mSSR9OVWf18Gep4uXd1s3lkOJ4JT38QfgKU/eP+825v0 T3ZSTH209vewhQ==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 31785@debbugs.gnu.org Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <87602ph0yv.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Mon, 11 Jun 2018 16:06:16 +0200") References: <87602ph0yv.fsf@gnu.org> Date: Sat, 21 Dec 2024 17:22:15 +0100 Message-ID: <878qs9gg5k.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) ludo@gnu.org (Ludovic Court=C3=A8s) skribis: > I tried running this: > > guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-= grafts) | sort) & \ > guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-= grafts) | sort -r) > > =E2=80=A6 also in parallel with this (for good measure): > > guix build --max-jobs=3D200 $(guix gc -R $(guix build -d inkscape --no-= grafts) | sort -R) > > Since we have 3 clients, that leads to 3 guix-daemon processes, and > those are stuck in a deadlock: This strikes again: =E2=80=98cuirass remote-worker=E2=80=99 processes on be= rlin occasionally end up deadlocking in the exact same way. When running =E2=80=98current remote-worker --workers=3D4=E2=80=99, 4 sessi= ons (4 clients) are used, which can lead to that situation, as in this example: --8<---------------cut here---------------start------------->8--- root@hydra-guix-126 ~# guix processes |guix shell recutils -- recsel -p 'Se= ssionPID,ClientCommand,LockHeld' SessionPID: 27250 ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/= guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sry= b24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=3Dcuira= ss-worker --workers=3D4 --systems=3Dx86_64-linux,i686-linux --publish-port= =3D5558 --substitute-urls=3Dhttp://141.80.167.131 SessionPID: 27269 ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/= guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sry= b24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=3Dcuira= ss-worker --workers=3D4 --systems=3Dx86_64-linux,i686-linux --publish-port= =3D5558 --substitute-urls=3Dhttp://141.80.167.131 LockHeld: /gnu/store/72s7500g3zg2p6fjdc1paazvm1w2xdr2-libva-2.19.0.lock LockHeld: /gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1.lock SessionPID: 27308 ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/= guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sry= b24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=3Dcuira= ss-worker --workers=3D4 --systems=3Dx86_64-linux,i686-linux --publish-port= =3D5558 --substitute-urls=3Dhttp://141.80.167.131 LockHeld: /gnu/store/zf5w9ypk8il0i9y22n81aamypr2qgsmm-dav1d-1.5.0.lock SessionPID: 27345 ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/= guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sry= b24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=3Dcuira= ss-worker --workers=3D4 --systems=3Dx86_64-linux,i686-linux --publish-port= =3D5558 --substitute-urls=3Dhttp://141.80.167.131 LockHeld: /gnu/store/0xbi2bgq34yyx2fqjjwpgdv4gkfyaf60-gst-plugins-bad-minim= al-1.22.3.lock LockHeld: /gnu/store/ij5igi5xrp4sx6c78nbvg24lb4ma2f4l-libcbor-0.11.0.lock LockHeld: /gnu/store/czfvm14yy517vb8w2hpp46nyrdrymqyp-libfido2-1.12.0.lock LockHeld: /gnu/store/1ldcq0p20nqy7d3mxdy4yra1ax5ik3xc-mpg123-1.31.2.lock LockHeld: /gnu/store/sadbf1fmb0n9k754x5jbbdklcxbjqlhx-openssh-9.9p1.lock LockHeld: /gnu/store/86rl29llmb7s4sl3bx0vl465mmq7nk6f-gcr-3.41.2.lock SessionPID: 27382 ClientCommand: /gnu/store/mfkz7fvlfpv3ppwbkv0imb19nrf95akf-guile-3.0.9/bin/= guile --no-auto-compile -e main -s /gnu/store/ll18sc406b5cqapmvz17v22gh4sry= b24-cuirass-1.2.0-11.e96f088/bin/.cuirass-real remote-worker --user=3Dcuira= ss-worker --workers=3D4 --systems=3Dx86_64-linux,i686-linux --publish-port= =3D5558 --substitute-urls=3Dhttp://141.80.167.131 --8<---------------cut here---------------end--------------->8--- Here process 27269 holds locks on libva and rav1e and waits forever trying to get the dav1d lock, held by 27308; process 27308 tries to get the rav1e lock; process 27345 tries to get the libva lock. FWIW, each of them is trying to substitute (not build) those things, via the =E2=80=98build-things=E2=80=99 call made after the =E2=80=9Csubstitutin= g ~a inputs for ~a=E2=80=9D message in remote-worker. Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 21 12:08:24 2024 Received: (at 31785) by debbugs.gnu.org; 21 Dec 2024 17:08:24 +0000 Received: from localhost ([127.0.0.1]:47547 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tP2xj-0005Vl-Tm for submit@debbugs.gnu.org; Sat, 21 Dec 2024 12:08:24 -0500 Received: from eggs.gnu.org ([209.51.188.92]:45972) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tP2xi-0005VX-FJ for 31785@debbugs.gnu.org; Sat, 21 Dec 2024 12:08:23 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tP2xb-00040l-NO; Sat, 21 Dec 2024 12:08:15 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=wzRm2Mb94YE1+SeYc9ondSLuc0rChA/NAzL/CoXZPwQ=; b=CTwvlSegBHWCauY1n99g 03OxaQ/VwnwZWswn03PJgtpZgXaCR73Ows1euGUsN5HfNQDWmp818Q4mBgByUcy4TXywQg2bNlV67 wqq5bolDT81e3OkTckadA3DfmZCY8eoN/BtDKSHeEhCAm1QArzzXLgtWJogY8XzDdliFjLbY4Neqi TVpiPP+LXZGz28B9GssDCo7Y1wf3v564YFzw2y5QyXbVt7PXHQyEDFup0kpa9vyS4om0DsexlnbOA c/2nCosqcokAIA7MZ5VDD3vsVHoKARNpOtM48Yvt+0rBH2Red0yTfkHaAdDSyQG6jjkn7ratSa9ud cXXBNALxEXSJaA==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 31785@debbugs.gnu.org Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <878qs9gg5k.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Sat, 21 Dec 2024 17:22:15 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> Date: Sat, 21 Dec 2024 18:08:13 +0100 Message-ID: <874j2xge0y.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , Reepca Russelstein X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) (Cc=E2=80=99ing Reepca + Chris for insight! See for context.) Ludovic Court=C3=A8s skribis: > Here process 27269 holds locks on libva and rav1e and waits forever > trying to get the dav1d lock, held by 27308; process 27308 tries to get > the rav1e lock; process 27345 tries to get the libva lock. Additional data points: =E2=80=A2 All three store items are valid, yet client sessions are still = stuck with locks held and trying to acquire the other locks; =E2=80=A2 The associated timestamp of these three store items in /var/guix/db/db.sqlite is the same (one-second accuracy); =E2=80=A2 The timestamps corresponds exactly to that of the =E2=80=9Cfetc= hing path=E2=80=9D messages in the log: --8<---------------cut here---------------start------------->8--- root@hydra-guix-126 ~# grep -C3 -E "fetching.*(rav1e-0.7.1|libva-2.19.0|dav= 1d-1.5.0)'" /var/log/cuirass-remote-worker.log 2024-12-21 13:27:21 libgdata-0.18.1.tar.xz 832KiB 133.9MiB/s 00:00 ???= ????????????????? 100.0% 2024-12-21 13:27:21 2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/9zbfmr41v0g4a5wm5s4y= zwn8hg8l50b0-libgdata-0.18.1.tar.xz 2024-12-21 13:27:21 fetching path `/gnu/store/zf5w9ypk8il0i9y22n81aamypr2qg= smm-dav1d-1.5.0'... 2024-12-21 13:27:21 @ substituter-started /gnu/store/zf5w9ypk8il0i9y22n81aa= mypr2qgsmm-dav1d-1.5.0 substitute 2024-12-21 13:27:21 GmMxDR0c: substituting 14 inputs for /gnu/store/21m6i11= 0abpzdjsb2kbz2c6sm8zy3zpx-sugar-cellgame-activity-5-1.4a22fd1.drv 2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/lzip/zf5w9ypk8il0= i9y22n81aamypr2qgsmm-dav1d-1.5.0... -- 2024-12-21 13:27:21 guile_gi-0.3.2.tar.gz 876KiB 124.3MiB/s 00:00 ???= ????????????????? 100.0% 2024-12-21 13:27:21 2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/db700nxijpqn34a22nxp= ajbw2pwffkpv-guile_gi-0.3.2.tar.gz 2024-12-21 13:27:21 fetching path `/gnu/store/72s7500g3zg2p6fjdc1paazvm1w2x= dr2-libva-2.19.0'... 2024-12-21 13:27:21 @ substituter-started /gnu/store/72s7500g3zg2p6fjdc1paa= zvm1w2xdr2-libva-2.19.0 substitute 2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/zstd/ij5igi5xrp4s= x6c78nbvg24lb4ma2f4l-libcbor-0.11.0... 2024-12-21 13:27:21 -- 2024-12-21 13:27:21 2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/ij5igi5xrp4sx6c78nbv= g24lb4ma2f4l-libcbor-0.11.0 2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/zstd/72s7500g3zg2= p6fjdc1paazvm1w2xdr2-libva-2.19.0... 2024-12-21 13:27:21 fetching path `/gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3= dgz-rav1e-0.7.1'... 2024-12-21 13:27:21 @ substituter-started /gnu/store/0bbnhq7bagn6sbj2lmapmd= iiw50v3dgz-rav1e-0.7.1 substitute 2024-12-21 13:27:21 waiting for locks or build slots... 2024-12-21 13:27:21 --8<---------------cut here---------------end--------------->8--- It would seem that the root cause is that locks aren=E2=80=99t released even though substitution succeeded: --8<---------------cut here---------------start------------->8--- root@hydra-guix-126 ~# grep -E '(zf5w9ypk8il0i9y22n81aamypr2qgsmm|72s7500g3= zg2p6fjdc1paazvm1w2xdr2|0bbnhq7bagn6sbj2lmapmdiiw50v3dgz)' /var/log/cuirass= -remote-worker.log 2024-12-21 13:27:21 fetching path `/gnu/store/zf5w9ypk8il0i9y22n81aamypr2qg= smm-dav1d-1.5.0'... 2024-12-21 13:27:21 @ substituter-started /gnu/store/zf5w9ypk8il0i9y22n81aa= mypr2qgsmm-dav1d-1.5.0 substitute 2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/lzip/zf5w9ypk8il0= i9y22n81aamypr2qgsm -dav1d-1.5.0... 2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/zf5w9ypk8il0i9y22n81= aamypr2qgsmm-dav1d-1.5.0 2024-12-21 13:27:21 fetching path `/gnu/store/72s7500g3zg2p6fjdc1paazvm1w2x= dr2-libva-2.19.0'... 2024-12-21 13:27:21 @ substituter-started /gnu/store/72s7500g3zg2p6fjdc1paa= zvm1w2xdr2-libva-2.19.0 substitute 2024-12-21 13:27:21 Downloading http://141.80.167.131/nar/zstd/72s7500g3zg2= p6fjdc1paazvm1w2xdr -libva-2.19.0... 2024-12-21 13:27:21 fetching path `/gnu/store/0bbnhq7bagn6sbj2lmapmdiiw50v3= dgz-rav1e-0.7.1'... 2024-12-21 13:27:21 @ substituter-started /gnu/store/0bbnhq7bagn6sbj2lmapmd= iiw50v3dgz-rav1e-0.7.1 substitute 2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/72s7500g3zg2p6fjdc1p= aazvm1w2xdr2-libva-2.19.0 2024-12-21 13:27:21 gst-plugins-espeak-0.5.0-1.7f6e412 19KiB 4.5MiB/s 00:= 00 ???????????????????? 100.0%Downloading http://141.80.167.131/nar/zstd/0b= bnhq7bagn6sbj2lmapmdiiw50v3dgz-rav1e-0.7.1... 2024-12-21 13:27:21 @ substituter-succeeded /gnu/store/0bbnhq7bagn6sbj2lmap= mdiiw50v3dgz-rav1e-0.7.1 --8<---------------cut here---------------end--------------->8--- Thoughts? Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 21 17:46:02 2024 Received: (at 31785) by debbugs.gnu.org; 21 Dec 2024 22:46:02 +0000 Received: from localhost ([127.0.0.1]:48256 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tP8EU-0004iU-C6 for submit@debbugs.gnu.org; Sat, 21 Dec 2024 17:46:02 -0500 Received: from eggs.gnu.org ([209.51.188.92]:48570) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tP8ES-0004hy-2E for 31785@debbugs.gnu.org; Sat, 21 Dec 2024 17:46:00 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tP8CE-0000iP-QW; Sat, 21 Dec 2024 17:43:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=l/pFC9H+DvoUW1Tf9Btc5M+pzEpXj4/l4tORBxiG/5Q=; b=pEF8AW1Q813SFFBtZg3D wAXet0uICAK0KleXrS9f530fB/H7YU6N4fKPD5bXO7MCTSdheinign1C4p71HsUMq8RTQexhusiiH /tmWDF/0cbk7EfWaBDPf4K07JF6jztvMUr79vRdQeq7P6EDeylCZe8lh+UhRHdjGNe5hnzpp71YDa YduinTPOHgAdDzN0v7rcf0DbhDle0hXofqMYtPhY5eR/2hzfFvtlS/gxDxA4MiF0ElIIFHao7aGaP deZUf3i4jgF0E4Uq1/LLGg0dpsXS7k7FAYVEnMq9Z2AKKZieWTfxomboU46lBf0ah417Sbk9dtGSr xbYSYP9mhMR/Aw==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 31785@debbugs.gnu.org Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <874j2xge0y.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Sat, 21 Dec 2024 18:08:13 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> Date: Sat, 21 Dec 2024 23:43:38 +0100 Message-ID: <87y108fyhx.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , Reepca Russelstein X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Ludovic Court=C3=A8s skribis: > It would seem that the root cause is that locks aren=E2=80=99t released e= ven > though substitution succeeded: was about an issue that looks exactly the same, but the fix is already included in our copy of =E2=80=98build.cc=E2=80=99. Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Tue Dec 24 08:12:17 2024 Received: (at 31785) by debbugs.gnu.org; 24 Dec 2024 13:12:18 +0000 Received: from localhost ([127.0.0.1]:59920 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tQ4hs-00085z-J9 for submit@debbugs.gnu.org; Tue, 24 Dec 2024 08:12:17 -0500 Received: from mailout.russelstein.xyz ([209.141.47.21]:45088) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tQ4hn-00085d-T6 for 31785@debbugs.gnu.org; Tue, 24 Dec 2024 08:12:14 -0500 DKIM-Signature: v=1; a=ed25519-sha256; q=dns/txt; c=relaxed/relaxed; d=russelstein.xyz; s=ed25519; h=Content-Type:MIME-Version:Message-ID:Date: References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ghIxgPpzvvRz9XVJd2/eSY3NQo2J+enLsBb2FP63JQQ=; b=JyEMuF0LAD/dNc87W/dkDBv8oa tzcl1TD4c8btRmIWOeFyCrbOJGhL4Wcr0/AVnhN02sUM/u60F83tbMpaZqDA==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=russelstein.xyz; s=rsa; h=Content-Type:MIME-Version:Message-ID:Date: References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ghIxgPpzvvRz9XVJd2/eSY3NQo2J+enLsBb2FP63JQQ=; b=OPJLZgP7ZTbE6tCwPEfWsSYFIq PPYIDUSDQtb9P0V0KQWEqtSCmmpSZMnwXf3sSQQYOTnlZcABTJH5y1bmSE1UzWwxkTgX3ZO+WY6xf sn+KjodqMlWf7mMXlMjlqjfN9SF7lV6h9u0ji0inFXnXZzbKCk+o+9QDrBzqHHda6LcquR8pQMHW1 gMCImYudn8qTF4vr2KMs47zU8yR89q1mjfaOQC5YaLqAP7q+5AorMs5Unbpz3G1tcFglgwCLK59YM zFkUKv0z8EZUSsvF18A/x/T9LMcl22pbRc6Gr0OAhWwt83bz6Lm0CUsP7hhNjpSt2kOBSPgPZtzv/ WgwhSpKqv2n8EBUXMi+IkOA2V0T1CMK5R+ZbcyHtVimpAboD3kaP1bd/ZiJpjbYc+eWEHMpIlkXG+ LOYPQ6TGCNKtnx9nO8GmTN90rrhU3BQzTYegiDTz/D/pw8PfOVftZ5zKs1XWjaBkhCbGF61irz+Aj H2D2lrgAyZTaNnjDtLbgFiIwzwncC/vCZ+gb0geOOVPWLE3F+7p73CGUheV35TwFgsCgTOYVY6Nak 3LIxpTrZ4KOjO7D6gkXakviqNRkTpzD69/e8iOLXqjqNXMRri6cTUx3Ed9I3aPrh6OtAFGIMViWUg MT7awrFt9SLl6usu0nX/r7n+AtKgSCJ6RY5i4tID0=; Received: by russelstein.xyz with esmtpsa (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.98) (envelope-from ) id 1tQ4fc-0000000070K-1fVo; Tue, 24 Dec 2024 07:09:58 -0600 From: Reepca Russelstein To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <874j2xge0y.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Sat, 21 Dec 2024 18:08:13 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> Date: Tue, 24 Dec 2024 07:08:52 -0600 Message-ID: <87bjx16xej.fsf@russelstein.xyz> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/signed; boundary="==-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 X-Spam-Bar: / X-Spam-Score-Int: 5 X-Spam-Report: Spam detection software, running on the system "Sanctum", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: (note: the following includes some "thinking out loud", skip to the patch if you're in a hurry) I'm a bit puzzled how exactly a true "forward-progress-not-possible" deadlock can even occur, especially in the manner you describe. The derivation graph is inherently a directed acyclic graph, so as [...] Content analysis details: (0.5 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 NO_RELAYS Informational: message was not relayed via SMTP 0.5 FROM_SUSPICIOUS_NTLD From abused NTLD X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , 31785@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --==-=-= Content-Type: multipart/mixed; boundary="=-=-=" --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable (note: the following includes some "thinking out loud", skip to the patch if you're in a hurry) I'm a bit puzzled how exactly a true "forward-progress-not-possible" deadlock can even occur, especially in the manner you describe. The derivation graph is inherently a directed acyclic graph, so as long as locks are only acquired for a given derivation after all of its inputs (or references, for substitutions) are present, and are released once the desired outputs are present, it should only even theoretically be possible to have a deadlock between different outputs of the same derivation. These requirements /seem/ to be satisfied by build.cc. Path locks are only acquired in DerivationGoal::tryToBuild and SubstitutionGoal::tryToRun, and the former is only reachable through DerivationGoal::inputsRealised (and DerivationGoal::buildDone in case of multiple rounds), and the latter is only reachable through SubstitutionGoal::referencesValid. As their names imply, having made it through them should suffice to prove that the inputs and references are present. That leaves the requirement that the locks are released once the desired outputs are present. It /is/ possible for a lock to stick around longer than desired (along with many other bad things) in DerivationGoal::tryToBuild, but only if a cached build failure is encountered, to my knowledge (notice how there is no call to done, amDone, or any rescheduling procedure in that case; the running DerivationGoal just drops off the face of the earth and tells nobody, including any goal waiting on it). This also happens in DerivationGoal::haveDerivation. I assume you're not using cached build failures, right? It defaults to off, so it should only be in use if you explicitly passed =2D-cache-failures to guix-daemon (or had your client pass a value of "true" for "build-cache-failure", but I don't see that string anywhere in the guile side of guix). One detail that worries me is that SubstitutionGoal.outputLock is a std::shared_ptr instead of a PathLocks. I cannot for the life of me figure out why this is the case, since no sharing of ownership of the underlying PathLocks seems to ever be done. outputLock.reset() seems to be used as if it were a synonym for outputLock.unlock() in many places, so whoever wrote those uses probably thought the same. As far as I can tell, it *should* always cause PathLocks::unlock to be called, so it shouldn't be an issue, it just worries me that I don't understand why it's done that way. Ahh, I think I may have just found another place in DerivationGoal::tryToBuild that might be the source of your issues. First, note the comment: /* Obtain locks on all output paths. The locks are automatically released when we exit this function or the client crashes. ... */ Well, as I noticed earlier in the cached build failures case, this isn't exactly accurate, and there's probably a reason why. The comment seems to believe specifically that leaving the scope of tryToBuild will automatically release the locks, which suggests to me that at one point outputLocks was a local variable, rather than a field of DerivationGoal. This theorized older version would be consistent with what we see in this (current) snippet: if (buildMode !=3D bmCheck && validPaths.size() =3D=3D drv.outputs.size= ()) { debug(format("skipping build of derivation `%1%', someone beat us t= o it") % drvPath); outputLocks.setDeletion(true); done(BuildResult::AlreadyValid); return; } If leaving the scope of tryToBuild would cause outputLocks.unlock to run, then outputLocks.setDeletion is all that would need to be called. But since that isn't the case, this will cause the lock to be held for as long as the DerivationGoal exists. (note that I haven't actually checked the commit history to see whether outputLocks was or wasn't at some point a local variable - I have no idea why the commenter thought returning would automatically release it) Perhaps this same error was made elsewhere. C-s to see where else setDeletion is used without an accompanying unlock and... sure enough, in SubstitutionGoal::tryToRun, we see: /* Check again whether the path is invalid. */ if (!repair && worker.store.isValidPath(storePath)) { debug(format("store path `%1%' has become valid") % storePath); outputLock->setDeletion(true); amDone(ecSuccess); return; } Now outputLock will be held for as long as this SubstitutionGoal exists. The Nix issue that you identified as closely resembling what you were encountering was resolved with this commit message (https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83= b8): =2D------------------------------------------------------------------ Fix deadlock in SubstitutionGoal We were relying on SubstitutionGoal's destructor releasing the lock, but if a goal is a top-level goal, the destructor won't run in a timely manner since its reference count won't drop to zero. So release it explicitly. =2D------------------------------------------------------------------ Hmmm. Destructors not running in a timely manner, and in this particular case - which is guaranteed to occur when one guix-daemon process gets blocked by another holding a lock and the one initially holding the lock produces the output path - the only way the locks get released is when the destructors are run. I do believe this may be related! One other possible reason for destructors not running in as timely a manner as one might hope (though probably not as severely as for top-level goals) would be that Worker::run uses a ready queue, awake2, of type Goals, which contains strong references. This ensures that any Goal on this scheduler turn will not be destroyed, even if it has already run, until the end of this scheduler turn is reached. This probably isn't the cause for what you're seeing, but this kind of detail matters a lot when so much is implicitly tied to the lifetime of references. This could be corrected by popping off Goals one at a time from awake2. Patch attached for DerivationGoal::tryToBuild and SubstitutionGoal::tryToRun. I haven't tested it beyond verifying that it compiles, but I've got a pretty good feeling=E2=84=A2 about it. The com= mit message includes discussion of a concrete scenario in which this could cause a deadlock. Interestingly enough, it seems to require at least 3 guix-daemon processes running during the "inciting event", though now that I think about it, the third doesn't need to be a persistent part of the deadlock. I did a brief glance over what I could find of Nix's corresponding code (of course much has changed), and it looks like there is a chance that at least their DerivationGoal::tryToBuild in src/libstore/build/derivation-goal.cc is affected by the same issue (you see the same pattern of calling outputLocks.setDeletion but not outputLocks.unlock in the same case). I couldn't quickly determine whether it was possible their substitutes were affected, as that code has more significantly diverged and I see no mention of locks in it. You may want to give them a nudge and see if this information is of use to them. Merry Christmas. =2D reepca --=-=-= Content-Type: text/x-patch Content-Disposition: attachment; filename=0001-nix-build.cc-explicitly-unlock-in-the-has-become-val.patch Content-Transfer-Encoding: quoted-printable From=202030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001 From: Reepca Russelstein Date: Tue, 24 Dec 2024 05:40:58 -0600 Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid case. (hopefully) fixes #31785. Similar to Nix issue #178, fixed in https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c83b= 8. We can't rely on Goal deletion to release our locks in a timely manner. In the case in which multiple guix-daemon processes simultaneously try produci= ng an output path path1, the one that gets there first (P1) will get the lock, and the second one (P2) will continue trying to acquire the lock until it is released. Once it has acquired the lock, it checks to see whether the path has already become valid in the meantime, and if so it reports success to those Goals waiting on its completion and finishes. Unfortunately, it fails to release the locks it holds first, so those stay held until that Goal gets deleted. Suppose we have the following store path dependency graph: path4 / | \ path1 path2 path3 P2 is now sitting on path1's lock, and will continue to do so until path4 is completed. Suppose there is also a P3, and it has been blocked while P1 builds path2. Now P3 is sitting on path2's lock, and can't acquire path1's lock to determine that it has been completed. Likewise P2 is sitting on path1's lock, and now can't acquire path2's lock to determine that it has b= een completed. Finally, P3 completes path3 while P2 is blocked. Now: =2D P1 knows that path1 and path2 are complete, and holds no locks, but can= 't determine that path3 is complete =2D P2 knows that path1 and path3 are complete, and holds locks on path1 and path3, but can't determine that path2 is complete =2D P3 knows that path2 and path3 are complete, and holds a lock on path2, = but can't determine that path1 is complete And none of these locks will be released until path4 is complete. Thus, we have a deadlock. To resolve this, we should explicitly release these locks as soon as they should be released. * nix/libstore/build.cc (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun): Explicitly release locks in the has-become-valid case. Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8 =2D-- nix/libstore/build.cc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/nix/libstore/build.cc b/nix/libstore/build.cc index 43a8a371846..edd01bab34d 100644 =2D-- a/nix/libstore/build.cc +++ b/nix/libstore/build.cc @@ -1200,6 +1200,7 @@ void DerivationGoal::tryToBuild() if (buildMode !=3D bmCheck && validPaths.size() =3D=3D drv.outputs.siz= e()) { debug(format("skipping build of derivation `%1%', someone beat us = to it") % drvPath); outputLocks.setDeletion(true); + outputLocks.unlock(); done(BuildResult::AlreadyValid); return; } @@ -3070,6 +3071,7 @@ void SubstitutionGoal::tryToRun() if (!repair && worker.store.isValidPath(storePath)) { debug(format("store path `%1%' has become valid") % storePath); outputLock->setDeletion(true); + outputLock.reset(); amDone(ecSuccess); return; } =2D-=20 2.45.2 --=-=-=-- --==-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQFLBAEBCAA1FiEEdNapMPRLm4SepVYGwWaqSV9/GJwFAmdqsmUXHHJlZXBjYUBy dXNzZWxzdGVpbi54eXoACgkQwWaqSV9/GJyVJgf/TOqLfl5jKX9madCj9VGdaidy gMFJdgr1c9BhGzYFt+FCkBeftN+Dhas4ywtTgN+nmNCAAyMWQSxHK9VvZZFmMxsc sI5qBy+unAvEanR0dn0K02twtY1Zw+rvyZsLB8ECaxGSgBcH8/oalIm6nq/inM2G AHaI5JvPcmZZS1mDJ4FegWaGsxSFo+NdTOKMrHesByL1CqD8YtzVwGiD/tIu1rgT aBuaLWHVSWaLLn84mzqohlcTKlSPTyjnIZgwrBOftR+yhG6X5/LOJskhmjlPlN91 3KJTGunnyDI55ymJw26YDIrYi1aYQcyFS8Zz538kOjzha7nfS5budXHxUViZVA== =OikB -----END PGP SIGNATURE----- --==-=-=-- From debbugs-submit-bounces@debbugs.gnu.org Fri Dec 27 17:47:39 2024 Received: (at 31785) by debbugs.gnu.org; 27 Dec 2024 22:47:40 +0000 Received: from localhost ([127.0.0.1]:47915 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tRJ7L-0004aE-3Y for submit@debbugs.gnu.org; Fri, 27 Dec 2024 17:47:39 -0500 Received: from eggs.gnu.org ([209.51.188.92]:32896) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tRJ7I-0004a1-Af for 31785@debbugs.gnu.org; Fri, 27 Dec 2024 17:47:37 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tRJ54-0006vn-R4; Fri, 27 Dec 2024 17:45:18 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=Yt2BTIX7uu98trjgCi89w8SYMRRpCg28ysSgIg/EDY0=; b=ZA0SWuWGJTuCvv2CTjd3 sbUtnpHdNfVLM2QRmdnj7e+tdZPETX88ldxPcuxC2cWfzxpiW0HBPCgzWGqaLJORtI/cPvTRdIOPs N/VtVK7KMgoDFFS06zdFY8w4C2X8IlIwM/Z5CoVX9AKZSsQEU2BILmPYaAvObSONQtWf62OAOYVDU wprLPLK/74d3t75oJp/HKoFtnUQSB7O6rhWPekQHrxzSWzEV1n9Wa9HnMec54IPM4+W6vTMfea3El 9jFrs15AcCZpfLdAwnrOwHSDyFnJxaNSRpPeGGufpilTSXL86F1XzkrMwzTTj4Dn31ia43ZKTVgwA ZiNIPYSr9CPY1g==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Reepca Russelstein Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <87bjx16xej.fsf@russelstein.xyz> (Reepca Russelstein's message of "Tue, 24 Dec 2024 07:08:52 -0600") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> <87bjx16xej.fsf@russelstein.xyz> Date: Fri, 27 Dec 2024 23:45:10 +0100 Message-ID: <87bjwwyccp.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=-=-=" X-Spam-Score: -1.6 (-) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , 31785@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.6 (--) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Hello Reepca, So glad to get your message. :-) Reepca Russelstein skribis: > I assume you're not using cached build failures, right? It defaults to > off, so it should only be in use if you explicitly passed > --cache-failures to guix-daemon (or had your client pass a value of > "true" for "build-cache-failure", but I don't see that string anywhere > in the guile side of guix). Right, not using cached build failures on these machines. > One detail that worries me is that SubstitutionGoal.outputLock is a > std::shared_ptr instead of a PathLocks. I cannot for the > life of me figure out why this is the case, since no sharing of > ownership of the underlying PathLocks seems to ever be done. > outputLock.reset() seems to be used as if it were a synonym for > outputLock.unlock() in many places, so whoever wrote those uses probably > thought the same. As far as I can tell, it *should* always cause > PathLocks::unlock to be called, so it shouldn't be an issue, it just > worries me that I don't understand why it's done that way. Yes, I had the same thoughts. I checked the =E2=80=9Cshared_ptr outputLock;=E2=80=9D member is here since at least 2005, when the file was renamed to build.cc. > Ahh, I think I may have just found another place in > DerivationGoal::tryToBuild that might be the source of your issues. > First, note the comment: > > /* Obtain locks on all output paths. The locks are automatically > released when we exit this function or the client crashes. ... */ > > Well, as I noticed earlier in the cached build failures case, this isn't > exactly accurate, and there's probably a reason why. The comment seems > to believe specifically that leaving the scope of tryToBuild will > automatically release the locks, which suggests to me that at one point > outputLocks was a local variable, rather than a field of DerivationGoal. > This theorized older version would be consistent with what we see in > this (current) snippet: > > if (buildMode !=3D bmCheck && validPaths.size() =3D=3D drv.outputs.si= ze()) { > debug(format("skipping build of derivation `%1%', someone beat us= to it") % drvPath); > outputLocks.setDeletion(true); > done(BuildResult::AlreadyValid); > return; > } Uh! Sounds like a plausible problem. > If leaving the scope of tryToBuild would cause outputLocks.unlock to > run, then outputLocks.setDeletion is all that would need to be called. > But since that isn't the case, this will cause the lock to be held for > as long as the DerivationGoal exists. Yep. [...] > From 2030f198892f972ee6bc2fb8529cbd6afb6bc9ad Mon Sep 17 00:00:00 2001 > From: Reepca Russelstein > Date: Tue, 24 Dec 2024 05:40:58 -0600 > Subject: [PATCH] nix: build.cc: explicitly unlock in the has-become-valid > case. > > (hopefully) fixes #31785. Similar to Nix issue #178, fixed in > https://github.com/NixOS/nix/commit/29cde917fe6b8f2e669c8bf10b38f640045c8= 3b8. > > We can't rely on Goal deletion to release our locks in a timely manner. = In > the case in which multiple guix-daemon processes simultaneously try produ= cing > an output path path1, the one that gets there first (P1) will get the loc= k, > and the second one (P2) will continue trying to acquire the lock until it= is > released. Once it has acquired the lock, it checks to see whether the pa= th > has already become valid in the meantime, and if so it reports success to > those Goals waiting on its completion and finishes. Unfortunately, it fa= ils > to release the locks it holds first, so those stay held until that Goal g= ets > deleted. > > Suppose we have the following store path dependency graph: > > path4 > / | \ > path1 path2 path3 > > P2 is now sitting on path1's lock, and will continue to do so until path4= is > completed. Suppose there is also a P3, and it has been blocked while P1 > builds path2. Now P3 is sitting on path2's lock, and can't acquire path1= 's > lock to determine that it has been completed. Likewise P2 is sitting on > path1's lock, and now can't acquire path2's lock to determine that it has= been > completed. Finally, P3 completes path3 while P2 is blocked. > > Now: > > - P1 knows that path1 and path2 are complete, and holds no locks, but can= 't > determine that path3 is complete > - P2 knows that path1 and path3 are complete, and holds locks on path1 and > path3, but can't determine that path2 is complete > - P3 knows that path2 and path3 are complete, and holds a lock on path2, = but > can't determine that path1 is complete > > And none of these locks will be released until path4 is complete. Thus, = we > have a deadlock. > > To resolve this, we should explicitly release these locks as soon as they > should be released. > > * nix/libstore/build.cc > (DerivationGoal::tryToBuild, SubstitutionGoal::tryToRun): > Explicitly release locks in the has-become-valid case. > > Change-Id: Ie510f84828892315fe6776c830db33d0f70bcef8 I=E2=80=99m very much convinced by the patch. Yet it bothers me that I can= not reproduce the problem. I tried first with this test, which attempts to reproduce what you describe in the commit log above: --8<---------------cut here---------------start------------->8--- ;; https://issues.guix.gnu.org/31785 (use-modules (guix) ((gnu packages) #:select (specification->package)) (srfi srfi-1) (ice-9 threads) (ice-9 match) (ice-9 textual-ports)) (define (nonce) (logxor (car (gettimeofday)) (cdr (gettimeofday)) (getpid))) (define drv1 (computed-file "drv1" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv2 (computed-file "drv2" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv3 (computed-file "drv3" #~(begin #$(nonce) (sleep 3) (mkdir #$output)))) (define drv4 (computed-file "drv4" #~(begin #$(nonce) (sleep 3) (pk 'deps: #$drv1 #$drv2 #$drv3) (mkdir #$output)))) (%graft? #f) (define (log-port prefix) "Return a logging port that prefixes each line with PREFIX." (make-custom-textual-output-port "log" (lambda (str start count) (let ((str (substring str start (+ start count)))) (format (current-error-port) "~a: ~a" prefix str) count)) #f #f #f)) (setvbuf (current-error-port) 'line) (set-port-encoding! (current-error-port) "UTF-8") (let* ((builder (lambda (name lst) (call-with-new-thread (lambda () (parameterize ((current-build-output-port (log-port name))) (with-store store (set-build-options store #:verbosity 4) (run-with-store store (mlet %store-monad ((lst (mapm %store-monad lower-object lst))) (built-derivations lst))))))))) (thread1 (builder 'thread1 (list drv4))) (thread2 (builder 'thread2 (list drv4 drv1 drv2))) (thread3 (builder 'thread3 (list drv4 drv3 drv2)))) (join-thread thread1) (join-thread thread2) (join-thread thread3)) --8<---------------cut here---------------end--------------->8--- But there=E2=80=99s no deadlock, and I think that=E2=80=99s because the pro= blem we=E2=80=99re seeing has to do with substitute goals, and there=E2=80=99s no such goal he= re. So then I tried this: --=-=-= Content-Type: text/x-patch; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable diff --git a/tests/store.scm b/tests/store.scm index 45948f4f43..224b9867c4 100644 --- a/tests/store.scm +++ b/tests/store.scm @@ -1,5 +1,5 @@ ;;; GNU Guix --- Functional package management for GNU -;;; Copyright =C2=A9 2012-2021, 2023 Ludovic Court=C3=A8s +;;; Copyright =C2=A9 2012-2021, 2023, 2024 Ludovic Court=C3=A8s ;;; ;;; This file is part of GNU Guix. ;;; @@ -35,6 +35,7 @@ (define-module (test-store) #:use-module (gnu packages bootstrap) #:use-module (ice-9 match) #:use-module (ice-9 regex) + #:use-module (ice-9 threads) #:use-module (rnrs bytevectors) #:use-module (rnrs io ports) #:use-module (web uri) @@ -1042,6 +1043,52 @@ (define %shell (ensure-path s item) (path-info-nar-size (query-path-info s item))))) =20 +(test-assert "substitute deadlock" + (with-store s + (let* ((guile (package-derivation s %bootstrap-guile (%current-system)= )) + (c (random-text)) ;contents of the output + (drv1 (build-expression->derivation + s "substitute-me1" + `(begin ,c (exit 1)) ;would actually fail + #:guile-for-build guile)) + (drv2 (build-expression->derivation + s "substitute-me2" + `(begin ,c (exit 1)) ;would actually fail + #:guile-for-build guile)) + (drv3 (build-expression->derivation + s "depends-on-substitutable1" + `(call-with-output-file %output + (lambda (p) + (display ,c p))) + #:inputs `(("drv1" ,drv1)) + #:guile-for-build guile)) + (drv4 (build-expression->derivation + s "depends-on-substitutable2" + `(call-with-output-file %output + (lambda (p) + (display ,c p))) + #:inputs `(("drv2" ,drv2)) + #:guile-for-build guile))) + (with-derivation-substitute drv1 c + (with-derivation-substitute drv2 c + (system* "ls" "-l" "/tmp/subst") + (let* ((builder (lambda (drv) + (call-with-new-thread + (lambda () + (with-store store + (set-build-options store + #:use-substitutes? #t + #:substitute-urls + (%test-substitute-urls= )) + (build-things store + (list (derivation-file-name= drv)))))))) + (thread1 (builder drv3)) + (thread2 (builder drv4))) + (join-thread thread1) + (join-thread thread2) + (and (valid-path? s (derivation->output-path drv3)) + (valid-path? s (derivation->output-path drv4))))))))) + (test-assert "export/import several paths" (let* ((texts (unfold (cut >=3D <> 10) (lambda _ (random-text)) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable But I think we=E2=80=99d have to be lucky for it to trigger the deadlock (l= ucky because the two substitutes should be fetched at the same instant, roughly). Given that it=E2=80=99s a nasty issue, I=E2=80=99d prefer to have an (ideal= ly automated) test. Thoughts? Ideas? Anyway, thanks for thinking through it and for explaining your reasoning and coming up with a patch! And: merry end-of-year holidays too. :-) Ludo=E2=80=99. --=-=-=-- From debbugs-submit-bounces@debbugs.gnu.org Sat Dec 28 02:09:44 2024 Received: (at 31785) by debbugs.gnu.org; 28 Dec 2024 07:09:44 +0000 Received: from localhost ([127.0.0.1]:48542 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tRQxE-00044A-09 for submit@debbugs.gnu.org; Sat, 28 Dec 2024 02:09:44 -0500 Received: from mailout.russelstein.xyz ([209.141.47.21]:34444) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tRQxA-00043s-1S for 31785@debbugs.gnu.org; Sat, 28 Dec 2024 02:09:42 -0500 DKIM-Signature: v=1; a=ed25519-sha256; q=dns/txt; c=relaxed/relaxed; d=russelstein.xyz; s=ed25519; h=Content-Type:MIME-Version:Message-ID:Date: References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ldcmPgyfQeG6UV21zmaHPxO9on3G7ugUGyD0nfNF3FQ=; b=z9n1BMyrJ18cYXggMXTDzC+PGZ u0QiwRzi0zGgd10xNykTBezsSUyHeq/FkHXNkzNr20DV4lS+im/w5VPK3SCw==; DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=russelstein.xyz; s=rsa; h=Content-Type:MIME-Version:Message-ID:Date: References:In-Reply-To:Subject:Cc:To:From:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=ldcmPgyfQeG6UV21zmaHPxO9on3G7ugUGyD0nfNF3FQ=; b=aez1yC/9RkHSEUOE9CDaRUhFbb tSiIK54dnEp2fKSxBmtHAoqYCdLaSnyWjRyNZ7T0mP8O5HX2F+k/4gwuNunEPzp2tIlUEjnSTKtor Pv1RKmFR+frIi3PxwfugMdilR2/VD//M5Vow1z8lPMsqMBDahc6f+zdBg6DLeDayPbfDJBH28RGpj L42uOjWznSHl9QYWsp1OuQFPFTUL6Lp5sT3Z079AsQJNsVosKtmpZ/h1Yrbha0v2/qmiPlqNyW3p7 npTLL9MZcUeSjtB8xUc9odtlocLzVyqOWuyKwPldPjdH2LcTQKahqhF6rFG+nPIesRWX6ZWBX7MiF 4HQtbhqW8eVaZBLjqWAbB2ntQm+rA/yl+nw0ApKj+TxaiOs8VGd5rVvMH62IrQqYL6yx8bOMse49P +XsqMBb3oF1/vXqzPoCtT+Q7Ei/Ph5ZNBgheUguGRUf5s+yhp3ZcL/AQBLBKDRfdPtmtXH3FPhXzH tnZ289cvT170SvfH0jS0hf7RVtWQJT5x+XbjPuej1nC6FSzgJOgwlGsIhxlKXf5Bc03kIlQdIiJ5N DrhG2kIan+JAx5sL8qFWrKU9MCMFIEw8P1SFFTasZ6pDZUdCSRJIHEIZ/ucK6wDb2F0P8ITFhVWCn BebDSCjy1QxtiC58o3nx0/M4I+ZOGJRLc+36txqQ8=; Received: by russelstein.xyz with esmtpsa (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) (Exim 4.98) (envelope-from ) id 1tRQv1-000000002ly-287S; Sat, 28 Dec 2024 01:07:28 -0600 From: Reepca Russelstein To: Ludovic =?utf-8?Q?Court=C3=A8s?= Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <87bjwwyccp.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Fri, 27 Dec 2024 23:45:10 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> <87bjx16xej.fsf@russelstein.xyz> <87bjwwyccp.fsf@gnu.org> Date: Sat, 28 Dec 2024 01:06:52 -0600 Message-ID: <874j2o70c3.fsf@russelstein.xyz> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: multipart/signed; boundary="=-=-="; micalg=pgp-sha256; protocol="application/pgp-signature" X-Spam-Score: 0.5 X-Spam-Bar: / X-Spam-Score-Int: 5 X-Spam-Report: Spam detection software, running on the system "Sanctum", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: Ludovic Courtès writes: > I’m very much convinced by the patch. Yet it bothers me that I cannot > reproduce the problem. I tried first with this test, which attempts to > reproduce what you describe in the commit log above [...] Content analysis details: (0.5 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 NO_RELAYS Informational: message was not relayed via SMTP 0.5 FROM_SUSPICIOUS_NTLD From abused NTLD X-Spam-Score: 0.0 (/) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , 31785@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --=-=-= Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Ludovic Court=C3=A8s writes: > I=E2=80=99m very much convinced by the patch. Yet it bothers me that I c= annot > reproduce the problem. I tried first with this test, which attempts to > reproduce what you describe in the commit log above: > But there=E2=80=99s no deadlock, and I think that=E2=80=99s because the p= roblem we=E2=80=99re > seeing has to do with substitute goals, and there=E2=80=99s no such goal = here. The problem we've been seeing in the wild has to do with substitute goals, yes, but the same problem also exists for derivation goals. Starting the 3 builds at basically the same time seems to be a bit too "nice", given that some effort has already been made to make the scheduler more deterministic, and that they all use the exact same mechanism for deciding how long to wait before retrying the locks. Also, with only 3 inputs, even if the assignment of processes to output path roles (e.g. builds, sits-on, blocked) were completely random, only 2 out of 8 possible assignments to the latter two roles result in a deadlock. While 3 inputs makes for a nice, simple demonstration of the problem, for reliably recreating it, we're probably going to want more. Also, if I understand correctly, the issue with destructors not being run in a timely manner is only for top-level goals. I've managed to create a deadlock using a derivation with 10 inputs, passing all of the inputs and the dependent derivation as top-level derivations to build-things. I've also changed the duration of each input derivation build from 3 seconds to 4 and added a 1-second sleep between starting each thread. I have only seen this arrangement fail to create a deadlock once, but for good measure, I've subsequently bumped up the number of inputs to 15. Note that this means that this test will require 23 seconds to pass (3 daemon processes, 1 job per daemon process, 4 seconds per build round, 3 jobs per build round, 15 jobs in 5 rounds =3D 20 seconds, plus 3 seconds for the dependent derivation). I also had to set #:verbosity 3 instead of 4 because I kept getting encoding errors that killed one of my threads. These are most puzzling because they occur even after changing everything to just use (current-error-port) and making sure to run (set-port-conversion-strategy! (current-error-port) 'escape) I suspect there may be some underlying bug in guix or guile. Also, I don't have make-custom-textual-output-port here; it appears to only be in guile-next. Here is the reproducer: =2D-8<---------------cut here---------------start------------->8--- ;; https://issues.guix.gnu.org/31785 (use-modules (guix) ((gnu packages) #:select (specification->package)) (srfi srfi-1) (ice-9 threads) (ice-9 match) (ice-9 textual-ports)) (define (nonce) (logxor (car (gettimeofday)) (cdr (gettimeofday)) (getpid))) (define input-drvs (map (lambda (n) (computed-file (string-append "drv" (number->string n)) #~(begin #$(nonce) (sleep 4) (mkdir #$output)))) (iota 15))) (define top-drv (computed-file "top-drv" #~(begin #$(nonce) (sleep 3) (pk 'deps: #$@input-drvs) (mkdir #$output)))) (%graft? #f) (let* ((drvs (cons top-drv input-drvs)) (builder (lambda (name lst) (call-with-new-thread (lambda () (with-store store (set-build-options store #:verbosity 3 #:max-build-jobs 1) (run-with-store store (mlet %store-monad ((lst (mapm %store-monad lower-object lst))) (built-derivations lst)))))))) (thread1 (begin (sleep 1) (builder 'thread1 drvs))) (thread2 (begin (sleep 1) (builder 'thread2 drvs))) (thread3 (begin (sleep 1) (builder 'thread3 drvs)))) (join-thread thread1) (join-thread thread2) (join-thread thread3)) =2D-8<---------------cut here---------------end--------------->8--- P.S: If in attempting to turn this into a proper test, you try using the timeout argument to join-thread, be aware that a second attempt at calling join-thread on the same thread will fail with "In procedure lock-mutex: mutex already locked by thread". This is because join-thread in (ice-9 threads) has a bug in it: unlock-mutex is not called in the "else" case of the cond. I am mentioning this here in case I forget to make a proper report of it. =2D reepca --=-=-= Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQFLBAEBCAA1FiEEdNapMPRLm4SepVYGwWaqSV9/GJwFAmdvo40XHHJlZXBjYUBy dXNzZWxzdGVpbi54eXoACgkQwWaqSV9/GJyiwAf+PIPwF7wE0fVHmD5KNzNsbN7V H7eOPm8VJJlKcowpGHM93qqP9vOqcP0r+8x/oYvzwkhDTTAscEQRdm1Nybd04vNj mnCbAksiJ2ULFaqQW4z6h8+UccCdm/o0BWsgtl+mJVaQ1Lc40L4RtScXELwDQoeF tamFGHCNeQwAbHEeBeYqmOobKv4wqKN9C0g1JMDPljJL0UXSWbXmL3tvmudcGYL3 D6rB8HBo7C9gWkKwrtO4X3OjI0mm9oaq6VBVbct63PQaAapbZxL+ZJt2mWsSDko7 EGnJyzy5bFZxHWAtfoLsYUk0swS1autVT8HNxq33/mkahw6bDRvczJUcK7abRQ== =RHfM -----END PGP SIGNATURE----- --=-=-=-- From debbugs-submit-bounces@debbugs.gnu.org Sun Dec 29 18:58:06 2024 Received: (at 31785) by debbugs.gnu.org; 29 Dec 2024 23:58:06 +0000 Received: from localhost ([127.0.0.1]:56595 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tS3Ac-0004kx-1S for submit@debbugs.gnu.org; Sun, 29 Dec 2024 18:58:06 -0500 Received: from eggs.gnu.org ([209.51.188.92]:51476) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tS3AZ-0004kO-Nu for 31785@debbugs.gnu.org; Sun, 29 Dec 2024 18:58:04 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tS3AS-0002uF-W3; Sun, 29 Dec 2024 18:57:57 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=RznLf7yIsVjCZEHscF8q83QnAqxOoK52fAFOgWk5iRA=; b=ID0KyssWu0NKNRc/6oYz JtZfET15qUbuuNxMw1nCBoSx24FEd7T4VLsNWSkhFesGquY1NnTf4PFx6wwgTdeXiE1Wp+bH/liqG sPFRZF9UpGNJl4TUKrI1IuwoO7DFNwogC/al20Iuyip9ptTQHGZzWVHRl8Hxom9iiFxtjdUgx5M/R 3pqm65ALXPj4iyPZcR9gIAexiyklDhf6Y1xbkO2z3RhmlGw7zxtb6gilcBEZAOs6vRjygJVjfZ1fb 0MGwjF5HnE6DmC97FYAqypDXmA1EJU36Nh1s7mrkIL64zpKTs8u/qBXkXLeCkZWFIWEH/pdkkPZc1 obSWO9rENdKZBw==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Reepca Russelstein Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <874j2o70c3.fsf@russelstein.xyz> (Reepca Russelstein's message of "Sat, 28 Dec 2024 01:06:52 -0600") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> <87bjx16xej.fsf@russelstein.xyz> <87bjwwyccp.fsf@gnu.org> <874j2o70c3.fsf@russelstein.xyz> Date: Mon, 30 Dec 2024 00:57:54 +0100 Message-ID: <87v7v2ujnh.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , 31785@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Hey Reepca, Reepca Russelstein skribis: > Here is the reproducer: Brilliant! It works wonderfully well. I moved it to its own file to limit interference with others=E2=80=94plus i= t=E2=80=99s one of our oldest bugs so it deserves it. ;-) > P.S: If in attempting to turn this into a proper test, you try using the > timeout argument to join-thread, be aware that a second attempt at > calling join-thread on the same thread will fail with "In procedure > lock-mutex: mutex already locked by thread". This is because > join-thread in (ice-9 threads) has a bug in it: unlock-mutex is not > called in the "else" case of the cond. I am mentioning this here in > case I forget to make a proper report of it. Heh, how many bugs did you find while chasing this one? I pushed your patch together with the reproducer as 78da6951787f07e9460091885d7a9eb3e667b512. Now we have to update the =E2=80=98guix=E2=80=99 package, though perhaps th= at=E2=80=99ll have to wait until next year, we=E2=80=99ll see. Thanks a lot for your work and for your support! You=E2=80=99re my hero. = :-) Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Mon Dec 30 07:53:31 2024 Received: (at 31785-done) by debbugs.gnu.org; 30 Dec 2024 12:53:31 +0000 Received: from localhost ([127.0.0.1]:57723 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tSFH1-0008BP-CG for submit@debbugs.gnu.org; Mon, 30 Dec 2024 07:53:31 -0500 Received: from eggs.gnu.org ([209.51.188.92]:49014) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tSFGx-0008B8-Pt for 31785-done@debbugs.gnu.org; Mon, 30 Dec 2024 07:53:29 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tSFGr-0008JL-Jr; Mon, 30 Dec 2024 07:53:21 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=5AxeMcH8sf1B5M7NTgx/QwW8tNcdyxGcMKPCFNsuV9A=; b=Ukv7mKeicoKyt9U83wsg vnBImlMuqdC0P6Z2yBFPbaHYPv8ICOfmKtODD/pmZxWfP+W7AiEmVKG0C/u1k74BrGOZWWNaFxsuG 1LSTjegssbbQhg/g7lEOe660bkczlHn413EUy+5BrZMLucHr/J7ccCN3fszFr+6PvxDHlR8ne1vJs Tqb73mOmh3krGpPIkNnMQYPvoThPFfT1w8FnTzPk8pbF0HMuKYqRe30kPFTCGPALsAMUIOHqCZCE0 Sye4NzKTpxxk26OxYux7/SwuOPsiItOfdru+HHr7ay1YoqDDW0kc3aHSQMgMu2sV3MPs5LaM1NE0a z4FXG1N0+0b7Zg==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Reepca Russelstein Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <87v7v2ujnh.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Mon, 30 Dec 2024 00:57:54 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> <87bjx16xej.fsf@russelstein.xyz> <87bjwwyccp.fsf@gnu.org> <874j2o70c3.fsf@russelstein.xyz> <87v7v2ujnh.fsf@gnu.org> Date: Mon, 30 Dec 2024 13:52:39 +0100 Message-ID: <87ikr1s57s.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785-done Cc: Christopher Baines , 31785-done@debbugs.gnu.org X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Ludovic Court=C3=A8s skribis: > I pushed your patch together with the reproducer as > 78da6951787f07e9460091885d7a9eb3e667b512. > > Now we have to update the =E2=80=98guix=E2=80=99 package, though perhaps = that=E2=80=99ll have to > wait until next year, we=E2=80=99ll see. Found the time to do it: pushed as aefe57c69ffb2892df8e180b5487a4ac705f42e3. Looks like we=E2=80=99re done, woohoo! Ludo=E2=80=99. From debbugs-submit-bounces@debbugs.gnu.org Thu Jan 02 17:36:40 2025 Received: (at 31785) by debbugs.gnu.org; 2 Jan 2025 22:36:40 +0000 Received: from localhost ([127.0.0.1]:47076 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1tTTo0-0005DR-6V for submit@debbugs.gnu.org; Thu, 02 Jan 2025 17:36:40 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:34056) by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.84_2) (envelope-from ) id 1tTTny-0005DD-3x for 31785@debbugs.gnu.org; Thu, 02 Jan 2025 17:36:38 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1tTTnq-00027g-VC; Thu, 02 Jan 2025 17:36:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=gnu.org; s=fencepost-gnu-org; h=MIME-Version:Date:References:In-Reply-To:Subject:To: From; bh=6pyoJVzn2p3+iKpOunBw4QvuRq/QA3HKBtad0F+KtYQ=; b=EGCc4pYxj8XhtAujqAXJ aKHsQm9kNiURbPmcWtNIAzVwnLf8Bo2/k/YEFDuO2PI4VEktHbd87jq2x4bfPaCQ0Pa/ipnh4avvq 248iNnImf8D8KPDQVCt/GtmwMjrJeGmmT+DLZD7y8OXIN5gZb4pOvgh4G4qYgvWc4QQwsyBUq8WLP NDzq+7Vn0S64Bthv51ZE3LUz/z5iUuWnxvq2max4o8KoF0sy9S7HstHQA7n39wdmmEtHCPhgu0Xzz Ye5U3d8no0r2rLEAlFF5ZzanEA9g6ioWI7tePC8TMxtO0irln8zAwNcEucIvKBbecCO3bBPWJsEkl 3cnb+uJ3hRn7BA==; From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Reepca Russelstein Subject: Re: bug#31785: Multiple client 'build-paths' RPCs can lead to daemon deadlock In-Reply-To: <87v7v2ujnh.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Mon, 30 Dec 2024 00:57:54 +0100") References: <87602ph0yv.fsf@gnu.org> <878qs9gg5k.fsf@gnu.org> <874j2xge0y.fsf@gnu.org> <87bjx16xej.fsf@russelstein.xyz> <87bjwwyccp.fsf@gnu.org> <874j2o70c3.fsf@russelstein.xyz> <87v7v2ujnh.fsf@gnu.org> Date: Thu, 02 Jan 2025 23:36:25 +0100 Message-ID: <878qrssv12.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 31785 Cc: Christopher Baines , 31785@debbugs.gnu.org, guix-sysadmin X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -3.3 (---) Hello, I have now redeployed the x86 build machines behind ci.guix from current Guix, which includes this fix (I restarted guix-daemon on all of them). We have yet to do that on the non-x86 build machines. Happy new year, let it be deadlock-free! Ludo=E2=80=99. From unknown Tue Jun 17 22:19:01 2025 Received: (at fakecontrol) by fakecontrolmessage; To: internal_control@debbugs.gnu.org From: Debbugs Internal Request Subject: Internal Control Message-Id: bug archived. Date: Fri, 31 Jan 2025 12:24:15 +0000 User-Agent: Fakemail v42.6.9 # This is a fake control message. # # The action: # bug archived. thanks # This fakemail brought to you by your local debbugs # administrator