GNU bug report logs - #46229
rdma-core 33.x breaks InfiniBand support in Open MPI

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Mon, 1 Feb 2021 08:56:01 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Ludovic Courtès <ludo <at> gnu.org>
Cc: tracker <at> debbugs.gnu.org
Subject: bug#46229: closed (rdma-core 33.x breaks InfiniBand support in
 Open MPI)
Date: Mon, 01 Feb 2021 13:06:01 +0000
[Message part 1 (text/plain, inline)]
Your message dated Mon, 01 Feb 2021 14:05:25 +0100
with message-id <87ft2ggcq2.fsf <at> gnu.org>
and subject line Re: bug#46229: rdma-core 33.x breaks InfiniBand support in Open MPI
has caused the debbugs.gnu.org bug report #46229,
regarding rdma-core 33.x breaks InfiniBand support in Open MPI
to be marked as done.

(If you believe you have received this mail in error, please contact
help-debbugs <at> gnu.org.)


-- 
46229: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=46229
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: <bug-guix <at> gnu.org>
Subject: rdma-core 33.x breaks InfiniBand support in Open MPI
Date: Mon, 01 Feb 2021 09:55:19 +0100
Hello,

We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
in InfiniBand related routines:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$ file core.20879 
core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
$ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 
(gdb) bt
#0  0x00007f93b2789e88 in ibv_cmd_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#1  0x00007f93b28c57bb in hfi1_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
#2  0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#3  0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
#4  0x00007f93b27f4e83 in btl_openib_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
#5  0x00007f93b4516aaf in mca_btl_base_select ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
#6  0x00007f93b29552c2 in mca_bml_r2_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
#7  0x00007f93b4b81b54 in mca_bml_base_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#8  0x00007f93b4bc4ef8 in ompi_mpi_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#9  0x00007f93b4b5ee55 in PMPI_Init_thread ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#10 0x0000000000405b55 in main ()
--8<---------------cut here---------------end--------------->8---

Conversely, a pre-upgrade commit works fine:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
--8<---------------cut here---------------end--------------->8---

Does that ring a bell?

Thanks,
Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc


[Message part 3 (message/rfc822, inline)]
From: Ludovic Courtès <ludo <at> gnu.org>
To: 46229-done <at> debbugs.gnu.org
Cc: Florent Pruvost <florent.pruvost <at> inria.fr>,
 Efraim Flashner <efraim <at> flashner.co.il>, Greg Hogan <code <at> greghogan.com>
Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in
 Open MPI
Date: Mon, 01 Feb 2021 14:05:25 +0100
Good news!  This is fixed by:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=37e997bc7867901dc5eaf9060358dfddacae8dd6

Ludo’.


This bug report was last modified 4 years and 164 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.