Flux-core: MVAPICH jobs hang under flux and jsrun at certain scales on Sierra

Created on 28 Jul 2018  路  25Comments  路  Source: flux-framework/flux-core

From #1606:

Both the current master and my install version of Flux hang at some scales. One scale:

PMI_LIBRARY=/usr/global/tools/pmi4pmix/blueos_3_ppc64le_ib/lib/libpmi.so time jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 128 /usr/global/tools/flux/blueos_3_ppc64le_ib/default/bin/flux start bash -c unset PMI_LIBRARY; flux wreckrun -n 128 -N 128 virtual_ring_mpi

This could be a bug in Flux or an issue in the system itself.

Most helpful comment

Ok. Analysis with STAT (@lee218llnl for the first success case on Sierra as he and I only recently finished our first cut there.)

Function level merging:

fn

Line number level merging:
ln

All 25 comments

One debugging idea:

Run the failing case with sh -c 'FLUX_PMI_DEBUG=1 virtual_ring_mpi 2>pmi_debug.log' and check determine the last PMI client call made by each task. From #1606, the last call seemed to be a barrier exit, so if all tasks got the barrier exit command, then something might be wrong in the MPID_Init implementation (at this point perhaps run STAT on all tasks).

Sounds good.

BTW, we have no debugger support unfortunately so STAT and totalview won't work...

Also now that we have a version of MVAPICH that runs with my pmi4pmix, I can run this with jsrun to see if it runs or not. If not, we can indeed use STAT or totalview.

Worst case we could use the old pdsh gdb (or jr -x if that still exists).

We don't have jr -x on sierra. So how do you pass pids of the target with push gdb though?

It could be done with a simple script, e.g. perhaps even pidof virtual_ring_mpi. Just some quick debug ideas...

Also now that we have a version of MVAPICH that runs with my pmi4pmix, I can run this with jsrun to see if it runs or not.

@adammoody and @grondo: OK. It looks like the fault is either with MVAPICH or Sierra system itself.

sierra4371{dahn}29: jsrun -a 1 -c 40 -n 128 ./virtual_ring_mpi.mvapich.pmi1
[sierra309:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast

This hung over 8 mins now.

Begin: Sat Jul 28 08:39:17 PDT 2018
Sat Jul 28 08:48:17 PDT 2018

Ok. Analysis with STAT (@lee218llnl for the first success case on Sierra as he and I only recently finished our first cut there.)

Function level merging:

fn

Line number level merging:
ln

Both graphs are showing 57 tasks completed MPI_Init () and ran all the way down to MPI_Allreduce () whereas 71 tasks are stuck in RDMA connect still within MPI_Init ()

Nice work @dongahn!

@adammoody:

Finally, I ran spectrum MPI at this scale and confirmed it ran fine. So this isn't a system issue.

sierra4371{dahn}21: jsrun -a 1 -c 40 -n 128 ./virtual_ring_mpi.spectrum
MPI_Init time is 4.236000
size: 128
rcvbuf: 127

I start to suspect

WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast

might have bearing for all these issues.

Let's discuss this next week for debugging. (I will be busy Monday with visitors all day though.)

I will keep this ticket open for a while. This doesn't look like flux's fault still MVAPICH on Sierra is extremely important for us to address our current use cases.

@adammoody: Any progress?

Oh, I was to hear back from you. I sent an email to you Saturday that probably got lost in the shuffle.

There are some variables that help with the hwloc error. It works for me start with the pmi4pmix library, but I don't know whether that fixes the performance problem.

>>: jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 ./mpiHello
[sierra901:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
Number of tasks= 2 My rank= 0
Number of tasks= 2 My rank= 1

>>: setenv MPIRUN_RSH_LAUNCH 1
>>: setenv MV2_USE_MPIRUN_MAPPING 0
>>: jsrun -a 1 -c ALL_CPUS -g ALL_GPUS --bind=none -n 2 ./mpiHello
Number of tasks= 2 My rank= 0
Number of tasks= 2 My rank= 1

I was hoping you could try that to see if it helps.

Ah. Sorry somehow I missed that. Will try.

I guess I missed all of the updates on this issue, too. That stack trace you have may provide some other clues.

Having some trouble right now with MV2-2.3 startup on surface, too. Found that this variable helps, so you might also try that:

export MV2_USE_RDMA_CM=0

@dongahn, just curious whether you had a chance to give this a try.

@dongahn, also where did you get the pmi4pmix library from?

@adammoody: Sorry I have been swamped.

@dongahn, also where did you get the pmi4pmix library from?

I used the PMI backward compatible library that comes with PMIX. I modified that a little and after that it was easy to built it against the public PMIX library.

@dongahn, just curious whether you had a chance to give this a try.

Will try as soon as practical. I have a few high-priority user issues and work items to knock down at the moment.

We should ask IBM to supply the libpmi.so and libpmi2.so libraries they are suppressing in their PMIx build. Also, would be nice if they would split PMIx out to a separately versioned RPM package so we can track it, rather than stuffing it in with the spectrum MPI package.

@garlick: I did. But since this is not a requirement penciled in the SOW, they gave me a firm no. Our future co-design effort will prevent things like this from happening.

@adammoody: It seems Sierra is so busy: Can't get the nodes (128) I need to try MV2_USE_RDMA_CM=0

@adammoody: I just tested this with jsrun at 128 Sierra nodes and MV2_USE_RDMA_CM=0 seems to work around the issue. I am testing this with Flux directly.

OK. This works around the hang with flux as well:

sierra4361{dahn}23: flux wreckrun -n 128 -N 128 ./virtual_ring_mpi.mvapich.pmi
[sierra3200:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 1.971083
size: 128
rcvbuf: 127
sierra4361{dahn}24:
sierra4361{dahn}24:
sierra4361{dahn}24: flux wreckrun -n 128 -N 128 ./virtual_ring_mpi.mvapich.pmi
[sierra3200:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 1.929974
size: 128
rcvbuf: 127
sierra4361{dahn}25:
sierra4361{dahn}25:
sierra4361{dahn}25: flux wreckrun -n 128 -N 128 ./virtual_ring_mpi.mvapich.pmi
[sierra3200:mpi_rank_0][smpi_load_hwloc_topology] WARNING! Invalid my_local_id: -1, Disabling hwloc topology broadcast
MPI_Init time is 1.989606
size: 128
rcvbuf: 127

OK just sent an email to @koning and @adammoody for further Flux/LBANN testing. I will follow that via emails. So closing this ticket.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

garlick picture garlick  路  8Comments

SteVwonder picture SteVwonder  路  5Comments

garlick picture garlick  路  8Comments

SteVwonder picture SteVwonder  路  4Comments

grondo picture grondo  路  7Comments