From unknown Thu Aug 14 22:21:41 2025 X-Loop: help-debbugs@gnu.org Subject: bug#46229: rdma-core 33.x breaks InfiniBand support in =?UTF-8?Q?Open=C2=A0MPI?= Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: code@greghogan.com, florent.pruvost@inria.fr, efraim@flashner.co.il, bug-guix@gnu.org Resent-Date: Mon, 01 Feb 2021 08:56:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: report 46229 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 46229@debbugs.gnu.org Cc: Greg Hogan , Florent Pruvost , Efraim Flashner X-Debbugs-Original-To: X-Debbugs-Original-Xcc: Greg Hogan , Florent Pruvost , Efraim Flashner Received: via spool by submit@debbugs.gnu.org id=B.16121697432246 (code B ref -1); Mon, 01 Feb 2021 08:56:01 +0000 Received: (at submit) by debbugs.gnu.org; 1 Feb 2021 08:55:43 +0000 Received: from localhost ([127.0.0.1]:58740 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6Uzm-0000a9-Qc for submit@debbugs.gnu.org; Mon, 01 Feb 2021 03:55:43 -0500 Received: from lists.gnu.org ([209.51.188.17]:38730) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6Uzk-0000a1-7t for submit@debbugs.gnu.org; Mon, 01 Feb 2021 03:55:41 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:43670) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l6Uzf-0005L1-Au for bug-guix@gnu.org; Mon, 01 Feb 2021 03:55:35 -0500 Received: from mail3-relais-sop.national.inria.fr ([192.134.164.104]:45821) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l6UzY-0006SZ-12 for bug-guix@gnu.org; Mon, 01 Feb 2021 03:55:33 -0500 X-IronPort-AV: E=Sophos;i="5.79,392,1602540000"; d="scan'208";a="371686467" Received: from 91-160-117-201.subs.proxad.net (HELO ribbon) ([91.160.117.201]) by mail3-relais-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Feb 2021 09:55:19 +0100 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 13 =?UTF-8?Q?Pluvi=C3=B4se?= an 229 de la =?UTF-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 01 Feb 2021 09:55:19 +0100 Message-ID: <87r1m0i2vc.fsf@inria.fr> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=192.134.164.104; envelope-from=ludovic.courtes@inria.fr; helo=mail3-relais-sop.national.inria.fr X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hello, We noticed that the recent rdma-core upgrade to 33.1=C2=B9 leads to segfaul= ts in InfiniBand related routines: --8<---------------cut here---------------start------------->8--- $ guix time-machine --commit=3D23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -- = environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-de= bug-info=3Drdma-core -- mpiexec -np 2 IMB-MPI1 PingPong -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on si= gnal 11 (Segmentation fault). -------------------------------------------------------------------------- $ file core.20879=20 core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, = from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: = 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7= zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64' $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core= .20879=20 (gdb) bt #0 0x00007f93b2789e88 in ibv_cmd_create_cq () from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libi= bverbs.so.1 #1 0x00007f93b28c57bb in hfi1_create_cq () from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libi= bverbs/libhfi1verbs-rdmav33.so #2 0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 () from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libi= bverbs.so.1 #3 0x00007f93b27c0a55 in opal_common_verbs_qp_test () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmc= a_common_verbs.so.40 #4 0x00007f93b27f4e83 in btl_openib_component_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openm= pi/mca_btl_openib.so #5 0x00007f93b4516aaf in mca_btl_base_select () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libop= en-pal.so.40 #6 0x00007f93b29552c2 in mca_bml_r2_component_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openm= pi/mca_bml_r2.so #7 0x00007f93b4b81b54 in mca_bml_base_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmp= i.so.40 #8 0x00007f93b4bc4ef8 in ompi_mpi_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmp= i.so.40 #9 0x00007f93b4b5ee55 in PMPI_Init_thread () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmp= i.so.40 #10 0x0000000000405b55 in main () --8<---------------cut here---------------end--------------->8--- Conversely, a pre-upgrade commit works fine: --8<---------------cut here---------------start------------->8--- $ guix time-machine --commit=3Dc2538db5617032788ac2f140496d00d8107579c8 -- = environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexe= c -np 2 IMB-MPI1 PingPong --8<---------------cut here---------------end--------------->8--- Does that ring a bell? Thanks, Ludo=E2=80=99. =C2=B9 https://git.savannah.gnu.org/cgit/guix.git/commit/?id=3Dc2739c0801eb= c5461564e862ce8f08405e2782dc From unknown Thu Aug 14 22:21:41 2025 X-Loop: help-debbugs@gnu.org Subject: bug#46229: rdma-core 33.x breaks InfiniBand support in =?UTF-8?Q?Open=C2=A0MPI?= Resent-From: Efraim Flashner Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 01 Feb 2021 09:15:02 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 46229 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Cc: Florent Pruvost , 46229@debbugs.gnu.org, Greg Hogan Received: via spool by 46229-submit@debbugs.gnu.org id=B46229.161217084412176 (code B ref 46229); Mon, 01 Feb 2021 09:15:02 +0000 Received: (at 46229) by debbugs.gnu.org; 1 Feb 2021 09:14:04 +0000 Received: from localhost ([127.0.0.1]:58780 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6VHU-00039n-DQ for submit@debbugs.gnu.org; Mon, 01 Feb 2021 04:14:03 -0500 Received: from flashner.co.il ([178.62.234.194]:41468) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6VHP-00039W-OU for 46229@debbugs.gnu.org; Mon, 01 Feb 2021 04:13:59 -0500 Received: from localhost (unknown [31.210.181.184]) by flashner.co.il (Postfix) with ESMTPSA id 8E47240049; Mon, 1 Feb 2021 09:13:49 +0000 (UTC) Date: Mon, 1 Feb 2021 11:13:17 +0200 From: Efraim Flashner Message-ID: References: <87r1m0i2vc.fsf@inria.fr> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="GPPj5rkEb904iDY0" Content-Disposition: inline In-Reply-To: <87r1m0i2vc.fsf@inria.fr> X-PGP-Key-ID: 0x41AAE7DCCA3D8351 X-PGP-Key: https://flashner.co.il/~efraim/efraim_flashner.asc X-PGP-Fingerprint: A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351 X-Spam-Score: -0.0 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.0 (-) --GPPj5rkEb904iDY0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Feb 01, 2021 at 09:55:19AM +0100, Ludovic Court=C3=A8s wrote: > Hello, >=20 > We noticed that the recent rdma-core upgrade to 33.1=C2=B9 leads to segfa= ults > in InfiniBand related routines: >=20 > --8<---------------cut here---------------start------------->8--- > $ guix time-machine --commit=3D23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -= - environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-= debug-info=3Drdma-core -- mpiexec -np 2 IMB-MPI1 PingPong > -------------------------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code. Per user-direction, the job has been aborted. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on = signal 11 (Segmentation fault). > -------------------------------------------------------------------------- > $ file core.20879=20 > core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style= , from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid= : 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545l= c7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64' > $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 co= re.20879=20 > (gdb) bt > #0 0x00007f93b2789e88 in ibv_cmd_create_cq () > from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/li= bibverbs.so.1 > #1 0x00007f93b28c57bb in hfi1_create_cq () > from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/li= bibverbs/libhfi1verbs-rdmav33.so > #2 0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 () > from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/li= bibverbs.so.1 > #3 0x00007f93b27c0a55 in opal_common_verbs_qp_test () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/lib= mca_common_verbs.so.40 > #4 0x00007f93b27f4e83 in btl_openib_component_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/ope= nmpi/mca_btl_openib.so > #5 0x00007f93b4516aaf in mca_btl_base_select () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/lib= open-pal.so.40 > #6 0x00007f93b29552c2 in mca_bml_r2_component_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/ope= nmpi/mca_bml_r2.so > #7 0x00007f93b4b81b54 in mca_bml_base_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/lib= mpi.so.40 > #8 0x00007f93b4bc4ef8 in ompi_mpi_init () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/lib= mpi.so.40 > #9 0x00007f93b4b5ee55 in PMPI_Init_thread () > from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/lib= mpi.so.40 > #10 0x0000000000405b55 in main () > --8<---------------cut here---------------end--------------->8--- >=20 > Conversely, a pre-upgrade commit works fine: >=20 > --8<---------------cut here---------------start------------->8--- > $ guix time-machine --commit=3Dc2538db5617032788ac2f140496d00d8107579c8 -= - environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpie= xec -np 2 IMB-MPI1 PingPong > --8<---------------cut here---------------end--------------->8--- >=20 > Does that ring a bell? >=20 > Thanks, > Ludo=E2=80=99. >=20 > =C2=B9 https://git.savannah.gnu.org/cgit/guix.git/commit/?id=3Dc2739c0801= ebc5461564e862ce8f08405e2782dc >=20 I thought I built everything that depended on rdma-core, and unfortunately I don't have a way to test it. As an actual user of the package I trust you to revert the change if necessary. I don't see anything on their mailing list pointing to this, or any other bugs really. http://vger.kernel.org/vger-lists.html#linux-rdma --=20 Efraim Flashner =D7=90=D7=A4=D7=A8=D7=99=D7=9D = =D7=A4=D7=9C=D7=A9=D7=A0=D7=A8 GPG key =3D A28B F40C 3E55 1372 662D 14F7 41AA E7DC CA3D 8351 Confidentiality cannot be guaranteed on emails sent or received unencrypted --GPPj5rkEb904iDY0 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEoov0DD5VE3JmLRT3Qarn3Mo9g1EFAmAXxioACgkQQarn3Mo9 g1Fisw//a/snbXlcI+xut6wzpLb7I0EB5zM9CYWsJPppVzch25lkwErkWDeryLyM VmppeBLLDyWwg9CskklBt+CFUhEEiFJxZNHhJn0DjuZPE8G+cH0Tc/cluzY6RvxP Ynkz0CKIfo5AP5HDktjUX4NCTl7JqFJ+P1EM0aM2yqlHXLiA30JokJS+Ogz/PkA0 0l63tHRJg3Zw7lX01GOVCBeZ3T8Gq5MHrLvlPsUbsAnoosJsbLB4/Fl+pFil9uWN tWlccrk0AdP6tU4QvyPzDPNxB+wRPjaUmuZDwYvyoCqmDON5nqpATdBsHNFJP10i N7tSS/9YepTRQ6fgj9zRuqbU1SXZuLgNBlmN/C/cBeDhXLduXcZbt9lwfD2cBFAh SBcogQ+HScd0q5I3HZtDYDy2bA0L1WmtsiVDAlcFZdBECYZFyCKQulsdTgOmpor7 AIMdIftvx1WVpLFOC84o+IHZ2qPJqFJcWGHywm23Qt5Tg8Rf4TJ0yXBbCCO3rnPR p7dv7fT/FKUFe0rZpHV3fGpYaemGAIfqPk3dbiqVYNsoOYV+kdDWI14BkebjCLXH B4iWmvpil3789k7Ij0LQSowYEVlewc7WP3j0/ajO1Uj4+xIucAklOi9BkYOQgmw7 xvXAqLxDXehkQcaFzoj71rCnh+G7skTr9x2uSzdU7PnvcNWc9U8= =DOu4 -----END PGP SIGNATURE----- --GPPj5rkEb904iDY0-- From unknown Thu Aug 14 22:21:41 2025 X-Loop: help-debbugs@gnu.org Subject: bug#46229: rdma-core 33.x breaks InfiniBand support in =?UTF-8?Q?Open=C2=A0MPI?= Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 01 Feb 2021 10:14:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 46229 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 46229@debbugs.gnu.org Cc: Florent Pruvost , Efraim Flashner , Greg Hogan Received: via spool by 46229-submit@debbugs.gnu.org id=B46229.161217440217998 (code B ref 46229); Mon, 01 Feb 2021 10:14:01 +0000 Received: (at 46229) by debbugs.gnu.org; 1 Feb 2021 10:13:22 +0000 Received: from localhost ([127.0.0.1]:58938 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6WCu-0004gD-P3 for submit@debbugs.gnu.org; Mon, 01 Feb 2021 05:13:22 -0500 Received: from eggs.gnu.org ([209.51.188.92]:42438) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6WCp-0004fx-Ms for 46229@debbugs.gnu.org; Mon, 01 Feb 2021 05:13:19 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]:49940) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l6WCj-0007QH-SO; Mon, 01 Feb 2021 05:13:09 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=54306 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1l6WCi-0006AS-Vv; Mon, 01 Feb 2021 05:13:09 -0500 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87r1m0i2vc.fsf@inria.fr> Date: Mon, 01 Feb 2021 11:13:07 +0100 In-Reply-To: <87r1m0i2vc.fsf@inria.fr> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 01 Feb 2021 09:55:19 +0100") Message-ID: <875z3chz9o.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) Ludovic Court=C3=A8s skribis: > $ guix time-machine --commit=3D23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -= - environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-= debug-info=3Drdma-core -- mpiexec -np 2 IMB-MPI1 PingPong A workaround is to ask Open=C2=A0MPI to ignore the Verbs library with: mpiexec --mca btl ^openib =E2=80=A6 Ludo=E2=80=99. From unknown Thu Aug 14 22:21:41 2025 X-Loop: help-debbugs@gnu.org Subject: bug#46229: rdma-core 33.x breaks InfiniBand support in =?UTF-8?Q?Open=C2=A0MPI?= Resent-From: Ludovic =?UTF-8?Q?Court=C3=A8s?= Original-Sender: "Debbugs-submit" Resent-CC: bug-guix@gnu.org Resent-Date: Mon, 01 Feb 2021 11:11:01 +0000 Resent-Message-ID: Resent-Sender: help-debbugs@gnu.org X-GNU-PR-Message: followup 46229 X-GNU-PR-Package: guix X-GNU-PR-Keywords: To: 46229@debbugs.gnu.org Cc: Florent Pruvost , Efraim Flashner , Greg Hogan Received: via spool by 46229-submit@debbugs.gnu.org id=B46229.161217785023255 (code B ref 46229); Mon, 01 Feb 2021 11:11:01 +0000 Received: (at 46229) by debbugs.gnu.org; 1 Feb 2021 11:10:50 +0000 Received: from localhost ([127.0.0.1]:59009 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6X6X-000630-4o for submit@debbugs.gnu.org; Mon, 01 Feb 2021 06:10:50 -0500 Received: from eggs.gnu.org ([209.51.188.92]:53762) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6X6U-00062k-QL for 46229@debbugs.gnu.org; Mon, 01 Feb 2021 06:10:47 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]:50422) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l6X6O-0008BJ-Fn; Mon, 01 Feb 2021 06:10:40 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=54620 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1l6X6K-0001LV-Kf; Mon, 01 Feb 2021 06:10:39 -0500 From: Ludovic =?UTF-8?Q?Court=C3=A8s?= References: <87r1m0i2vc.fsf@inria.fr> Date: Mon, 01 Feb 2021 12:10:34 +0100 In-Reply-To: <87r1m0i2vc.fsf@inria.fr> ("Ludovic =?UTF-8?Q?Court=C3=A8s?="'s message of "Mon, 01 Feb 2021 09:55:19 +0100") Message-ID: <87r1m0gi1h.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) Ludovic Court=C3=A8s skribis: > mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on = signal 11 (Segmentation fault). Now with a nicer backtrace: --8<---------------cut here---------------start------------->8--- (gdb) bt full #0 attr_optional (attr=3D0x0) at include/infiniband/cmd_ioctl.h:239 No locals. #1 ibv_icmd_create_cq (context=3Dcontext@entry=3D0x1074890, cqe=3Dcqe@entr= y=3D2, channel=3Dchannel@entry=3D0x0,=20 comp_vector=3Dcomp_vector@entry=3D0, flags=3Dflags@entry=3D0, cq=3Dcq@e= ntry=3D0x1074c50, link=3D0x7ffe0a089690) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/libibverbs/cmd_c= q.c:63 cmdb =3D {{next =3D 0x7ffe0a089690, next_attr =3D 0x0, last_attr = =3D 0x0, uhw_in_idx =3D 255 '\377',=20 uhw_out_idx =3D 255 '\377', uhw_in_headroom_dwords =3D 0 '\000'= , uhw_out_headroom_dwords =3D 0 '\000',=20 buffer_error =3D 0 '\000', fallback_require_ex =3D 0 '\000', fa= llback_ioctl_only =3D 0 '\000', hdr =3D { length =3D 0, object_id =3D 0, method_id =3D 0, num_attrs =3D= 0, reserved1 =3D 0, driver_id =3D 0, reserved2 =3D 0,=20 attrs =3D 0x7ffe0a0895f8}}} priv =3D handle =3D async_fd_attr =3D resp_cqe =3D ret =3D 0 #2 0x00007f9ec83f2e4e in ibv_cmd_create_cq (context=3Dcontext@entry=3D0x10= 74890, cqe=3Dcqe@entry=3D2,=20 channel=3Dchannel@entry=3D0x0, comp_vector=3Dcomp_vector@entry=3D0, cq= =3Dcq@entry=3D0x1074c50, cmd=3Dcmd@entry=3D0x0, cmd_size=3D0,=20 resp=3D0x7ffe0a089760, resp_size=3D16) at /tmp/guix-build-rdma-core-33.= A.drv-0/rdma-core-33.1/libibverbs/cmd_cq.c:137 __cmdbtotal =3D 2 cmdb =3D {{next =3D 0x0, next_attr =3D 0x7ffe0a0896d8, last_attr = =3D 0x7ffe0a0896e8, uhw_in_idx =3D 255 '\377',=20 uhw_out_idx =3D 0 '\000', uhw_in_headroom_dwords =3D 0 '\000', = uhw_out_headroom_dwords =3D 2 '\002',=20 buffer_error =3D 0 '\000', fallback_require_ex =3D 0 '\000', fa= llback_ioctl_only =3D 0 '\000', hdr =3D { length =3D 0, object_id =3D 3, method_id =3D 0, num_attrs =3D= 0, reserved1 =3D 0, driver_id =3D 0, reserved2 =3D 0,=20 attrs =3D 0x7ffe0a0896c8}}, {next =3D 0x100081001, next_attr = =3D 0x7ffe0a089768, last_attr =3D 0x6e0000005b,=20 uhw_in_idx =3D 124 '|', uhw_out_idx =3D 0 '\000', uhw_in_headro= om_dwords =3D 0 '\000',=20 uhw_out_headroom_dwords =3D 0 '\000', buffer_error =3D 1 '\001'= , fallback_require_ex =3D 1 '\001',=20 fallback_ioctl_only =3D 1 '\001', hdr =3D {length =3D 0, object= _id =3D 0, method_id =3D 0, num_attrs =3D 0,=20 reserved1 =3D 0, driver_id =3D 17, reserved2 =3D 15, attrs = =3D 0x7ffe0a089700}}} __cmdbdummy =3D #3 0x00007f9ec85257bb in hfi1_create_cq (context=3D0x1074890, cqe=3D2, cha= nnel=3D0x0, comp_vector=3D0) at /tmp/guix-build-rdma-core-33.A.drv-0/rdma-core-33.1/providers/hfi1ve= rbs/verbs.c:184 cq =3D 0x1074c50 resp =3D {ibv_resp =3D {cq_handle =3D 0, cqe =3D 0, driver_data =3D= 0x7ffe0a089768}, offset =3D 0} ret =3D size =3D #4 0x00007f9ec83fde41 in __ibv_create_cq_1_1 (context=3D0x1074890, cqe=3D<= optimized out>, cq_context=3D0x0, channel=3D0x0,=20 comp_vector=3D) at /tmp/guix-build-rdma-core-33.A.drv-0/= rdma-core-33.1/libibverbs/verbs.c:509 cq =3D #5 0x00007f9ec8426a55 in opal_common_verbs_qp_test () from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/libmc= a_common_verbs.so.40 No symbol table info available. #6 0x00007f9ec8454e83 in btl_openib_component_init () from /gnu/store/6gcssdkn8iaiagzdv0d9mi93gppc85r4-openmpi-4.0.5/lib/openm= pi/mca_btl_openib.so --8<---------------cut here---------------end--------------->8--- Version 29.2 is good and everything beyond that isn=E2=80=99t. This has to= do with those rdma-core changes: --8<---------------cut here---------------start------------->8--- $ git log --oneline v26.4..v33.1 libibverbs/cmd_cq.c 317d8895 verbs: Enhance async FD usage 195c9191 verbs: Introduce verbs_cq for extended CQ operations 90a4d0cc verbs: Extend CQ KABI to get an async FD --8<---------------cut here---------------end--------------->8--- (The first commit in the list above appeared in v30.) I forgot to mention this happens with Omni-Path hardware: --8<---------------cut here---------------start------------->8--- $ guix environment --ad-hoc rdma-core -- ibv_devinfo hca_id: hfi1_0 transport: InfiniBand (0) fw_ver: 1.27.0 node_guid: 0011:7509:0107:573e sys_image_guid: 0011:7509:0107:573e vendor_id: 0x1175 vendor_part_id: 9456 hw_ver: 0x11 board_id: Intel Omni-Path Host Fabric Interfa= ce Adapter 100 Series phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 4 port_lmc: 0x00 link_layer: InfiniBand --8<---------------cut here---------------end--------------->8--- Ludo=E2=80=99. From unknown Thu Aug 14 22:21:41 2025 MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) X-Loop: help-debbugs@gnu.org From: help-debbugs@gnu.org (GNU bug Tracking System) To: Ludovic =?UTF-8?Q?Court=C3=A8s?= Subject: bug#46229: closed (Re: bug#46229: rdma-core 33.x breaks InfiniBand support in =?UTF-8?Q?Open=C2=A0MPI?=) Message-ID: References: <87ft2ggcq2.fsf@gnu.org> <87r1m0i2vc.fsf@inria.fr> X-Gnu-PR-Message: they-closed 46229 X-Gnu-PR-Package: guix Reply-To: 46229@debbugs.gnu.org Date: Mon, 01 Feb 2021 13:06:01 +0000 Content-Type: multipart/mixed; boundary="----------=_1612184761-12209-1" This is a multi-part message in MIME format... ------------=_1612184761-12209-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Your bug report #46229: rdma-core 33.x breaks InfiniBand support in Open=C2=A0MPI which was filed against the guix package, has been closed. The explanation is attached below, along with your original report. If you require more details, please reply to 46229@debbugs.gnu.org. --=20 46229: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D46229 GNU Bug Tracking System Contact help-debbugs@gnu.org with problems ------------=_1612184761-12209-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at 46229-done) by debbugs.gnu.org; 1 Feb 2021 13:05:45 +0000 Received: from localhost ([127.0.0.1]:59360 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6Ytk-0003AQ-QI for submit@debbugs.gnu.org; Mon, 01 Feb 2021 08:05:45 -0500 Received: from eggs.gnu.org ([209.51.188.92]:49718) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6Yth-0003A6-3S for 46229-done@debbugs.gnu.org; Mon, 01 Feb 2021 08:05:43 -0500 Received: from fencepost.gnu.org ([2001:470:142:3::e]:51792) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1l6Yta-0002HB-Dn; Mon, 01 Feb 2021 08:05:34 -0500 Received: from [2a01:e0a:1d:7270:af76:b9b:ca24:c465] (port=55030 helo=ribbon) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1l6YtT-00005O-MC; Mon, 01 Feb 2021 08:05:31 -0500 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: 46229-done@debbugs.gnu.org Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in =?utf-8?Q?Open=C2=A0MPI?= References: <87r1m0i2vc.fsf@inria.fr> <87r1m0gi1h.fsf@gnu.org> Date: Mon, 01 Feb 2021 14:05:25 +0100 In-Reply-To: <87r1m0gi1h.fsf@gnu.org> ("Ludovic =?utf-8?Q?Court=C3=A8s=22'?= =?utf-8?Q?s?= message of "Mon, 01 Feb 2021 12:10:34 +0100") Message-ID: <87ft2ggcq2.fsf@gnu.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 46229-done Cc: Florent Pruvost , Efraim Flashner , Greg Hogan X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -1.7 (-) Good news! This is fixed by: https://git.savannah.gnu.org/cgit/guix.git/commit/?id=3D37e997bc7867901dc= 5eaf9060358dfddacae8dd6 Ludo=E2=80=99. ------------=_1612184761-12209-1 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 7bit Received: (at submit) by debbugs.gnu.org; 1 Feb 2021 08:55:43 +0000 Received: from localhost ([127.0.0.1]:58740 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6Uzm-0000a9-Qc for submit@debbugs.gnu.org; Mon, 01 Feb 2021 03:55:43 -0500 Received: from lists.gnu.org ([209.51.188.17]:38730) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from ) id 1l6Uzk-0000a1-7t for submit@debbugs.gnu.org; Mon, 01 Feb 2021 03:55:41 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:43670) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l6Uzf-0005L1-Au for bug-guix@gnu.org; Mon, 01 Feb 2021 03:55:35 -0500 Received: from mail3-relais-sop.national.inria.fr ([192.134.164.104]:45821) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1l6UzY-0006SZ-12 for bug-guix@gnu.org; Mon, 01 Feb 2021 03:55:33 -0500 X-IronPort-AV: E=Sophos;i="5.79,392,1602540000"; d="scan'208";a="371686467" Received: from 91-160-117-201.subs.proxad.net (HELO ribbon) ([91.160.117.201]) by mail3-relais-sop.national.inria.fr with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Feb 2021 09:55:19 +0100 From: =?utf-8?Q?Ludovic_Court=C3=A8s?= To: Subject: rdma-core 33.x breaks InfiniBand support in =?utf-8?Q?Open=C2=A0M?= =?utf-8?Q?PI?= X-Debbugs-Cc: Greg Hogan , Florent Pruvost , Efraim Flashner X-URL: http://www.fdn.fr/~lcourtes/ X-Revolutionary-Date: 13 =?utf-8?Q?Pluvi=C3=B4se?= an 229 de la =?utf-8?Q?R=C3=A9volution?= X-PGP-Key-ID: 0x090B11993D9AEBB5 X-PGP-Key: http://www.fdn.fr/~lcourtes/ludovic.asc X-PGP-Fingerprint: 3CE4 6455 8A84 FDC6 9DB4 0CFB 090B 1199 3D9A EBB5 X-OS: x86_64-pc-linux-gnu Date: Mon, 01 Feb 2021 09:55:19 +0100 Message-ID: <87r1m0i2vc.fsf@inria.fr> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=192.134.164.104; envelope-from=ludovic.courtes@inria.fr; helo=mail3-relais-sop.national.inria.fr X-Spam_score_int: -41 X-Spam_score: -4.2 X-Spam_bar: ---- X-Spam_report: (-4.2 / 5.0 requ) BAYES_00=-1.9, RCVD_IN_DNSWL_MED=-2.3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -1.3 (-) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit@debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: debbugs-submit-bounces@debbugs.gnu.org Sender: "Debbugs-submit" X-Spam-Score: -2.3 (--) Hello, We noticed that the recent rdma-core upgrade to 33.1=C2=B9 leads to segfaul= ts in InfiniBand related routines: --8<---------------cut here---------------start------------->8--- $ guix time-machine --commit=3D23a5dcce1d893b8f5c5301ae3c1af863776ed3cf -- = environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-de= bug-info=3Drdma-core -- mpiexec -np 2 IMB-MPI1 PingPong -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on si= gnal 11 (Segmentation fault). -------------------------------------------------------------------------- $ file core.20879=20 core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, = from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: = 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7= zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64' $ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core= .20879=20 (gdb) bt #0 0x00007f93b2789e88 in ibv_cmd_create_cq () from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libi= bverbs.so.1 #1 0x00007f93b28c57bb in hfi1_create_cq () from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libi= bverbs/libhfi1verbs-rdmav33.so #2 0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 () from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libi= bverbs.so.1 #3 0x00007f93b27c0a55 in opal_common_verbs_qp_test () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmc= a_common_verbs.so.40 #4 0x00007f93b27f4e83 in btl_openib_component_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openm= pi/mca_btl_openib.so #5 0x00007f93b4516aaf in mca_btl_base_select () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libop= en-pal.so.40 #6 0x00007f93b29552c2 in mca_bml_r2_component_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openm= pi/mca_bml_r2.so #7 0x00007f93b4b81b54 in mca_bml_base_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmp= i.so.40 #8 0x00007f93b4bc4ef8 in ompi_mpi_init () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmp= i.so.40 #9 0x00007f93b4b5ee55 in PMPI_Init_thread () from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmp= i.so.40 #10 0x0000000000405b55 in main () --8<---------------cut here---------------end--------------->8--- Conversely, a pre-upgrade commit works fine: --8<---------------cut here---------------start------------->8--- $ guix time-machine --commit=3Dc2538db5617032788ac2f140496d00d8107579c8 -- = environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexe= c -np 2 IMB-MPI1 PingPong --8<---------------cut here---------------end--------------->8--- Does that ring a bell? Thanks, Ludo=E2=80=99. =C2=B9 https://git.savannah.gnu.org/cgit/guix.git/commit/?id=3Dc2739c0801eb= c5461564e862ce8f08405e2782dc ------------=_1612184761-12209-1--