GNU bug report logs - #46229
rdma-core 33.x breaks InfiniBand support in Open MPI

Previous Next

Package: guix;

Reported by: Ludovic Courtès <ludovic.courtes <at> inria.fr>

Date: Mon, 1 Feb 2021 08:56:01 UTC

Severity: normal

Done: Ludovic Courtès <ludo <at> gnu.org>

Bug is archived. No further changes may be made.

Full log


View this message in rfc822 format

From: help-debbugs <at> gnu.org (GNU bug Tracking System)
To: Ludovic Courtès <ludovic.courtes <at> inria.fr>
Subject: bug#46229: closed (Re: bug#46229: rdma-core 33.x breaks
 InfiniBand support in Open MPI)
Date: Mon, 01 Feb 2021 13:06:01 +0000
[Message part 1 (text/plain, inline)]
Your bug report

#46229: rdma-core 33.x breaks InfiniBand support in Open MPI

which was filed against the guix package, has been closed.

The explanation is attached below, along with your original report.
If you require more details, please reply to 46229 <at> debbugs.gnu.org.

-- 
46229: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=46229
GNU Bug Tracking System
Contact help-debbugs <at> gnu.org with problems
[Message part 2 (message/rfc822, inline)]
From: Ludovic Courtès <ludo <at> gnu.org>
To: 46229-done <at> debbugs.gnu.org
Cc: Florent Pruvost <florent.pruvost <at> inria.fr>,
 Efraim Flashner <efraim <at> flashner.co.il>, Greg Hogan <code <at> greghogan.com>
Subject: Re: bug#46229: rdma-core 33.x breaks InfiniBand support in
 Open MPI
Date: Mon, 01 Feb 2021 14:05:25 +0100
Good news!  This is fixed by:

  https://git.savannah.gnu.org/cgit/guix.git/commit/?id=37e997bc7867901dc5eaf9060358dfddacae8dd6

Ludo’.

[Message part 3 (message/rfc822, inline)]
From: Ludovic Courtès <ludovic.courtes <at> inria.fr>
To: <bug-guix <at> gnu.org>
Subject: rdma-core 33.x breaks InfiniBand support in Open MPI
Date: Mon, 01 Feb 2021 09:55:19 +0100
Hello,

We noticed that the recent rdma-core upgrade to 33.1¹ leads to segfaults
in InfiniBand related routines:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=23a5dcce1d893b8f5c5301ae3c1af863776ed3cf --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks --with-debug-info=rdma-core -- mpiexec -np 2 IMB-MPI1 PingPong
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 0 on node devel02 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
$ file core.20879 
core.20879: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from 'IMB-MPI1 PingPong', real uid: 10218, effective uid: 10218, real gid: 11018, effective gid: 11018, execfn: '/gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1', platform: 'x86_64'
$ gdb /gnu/store/ls8pkyi05iabk952x7gy545lc7zyr4cv-profile/bin/IMB-MPI1 core.20879 
(gdb) bt
#0  0x00007f93b2789e88 in ibv_cmd_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#1  0x00007f93b28c57bb in hfi1_create_cq ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs/libhfi1verbs-rdmav33.so
#2  0x00007f93b2796331 in ibv_create_cq@@IBVERBS_1.1 ()
   from /gnu/store/n52snxjsq25m1wgmm6h1v60myld8dyjr-rdma-core-33.1/lib/libibverbs.so.1
#3  0x00007f93b27c0a55 in opal_common_verbs_qp_test ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmca_common_verbs.so.40
#4  0x00007f93b27f4e83 in btl_openib_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_btl_openib.so
#5  0x00007f93b4516aaf in mca_btl_base_select ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libopen-pal.so.40
#6  0x00007f93b29552c2 in mca_bml_r2_component_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/openmpi/mca_bml_r2.so
#7  0x00007f93b4b81b54 in mca_bml_base_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#8  0x00007f93b4bc4ef8 in ompi_mpi_init ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#9  0x00007f93b4b5ee55 in PMPI_Init_thread ()
   from /gnu/store/sk7ngrmr529050qx4nn545lfcxxqkh6h-openmpi-4.0.5/lib/libmpi.so.40
#10 0x0000000000405b55 in main ()
--8<---------------cut here---------------end--------------->8---

Conversely, a pre-upgrade commit works fine:

--8<---------------cut here---------------start------------->8---
$ guix time-machine --commit=c2538db5617032788ac2f140496d00d8107579c8 --  environment --pure --ad-hoc openmpi openssh intel-mpi-benchmarks -- mpiexec -np 2 IMB-MPI1 PingPong
--8<---------------cut here---------------end--------------->8---

Does that ring a bell?

Thanks,
Ludo’.

¹ https://git.savannah.gnu.org/cgit/guix.git/commit/?id=c2739c0801ebc5461564e862ce8f08405e2782dc



This bug report was last modified 4 years and 165 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.